python - How to create LinkExtractor rule which based on href in Scrapy -
i trying create simple crawler scrapy
(scrapy.org). per example item.php
allowed. how can write rule allow url start http://example.com/category/
in get
parameter page
should there number of digit other parameter. order of these parameter random. please how can write such rule?
few valid values are:
- http://example.com/category/?page=1&sort=a-z&cache=1
- http://example.com/category/?page=1&sort=a-z#
- http://example.com/category/?sort=a-z&page=1
following code:
import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors import linkextractor class myspider(crawlspider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/category/'] rules = ( rule(linkextractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): item = scrapy.item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'id: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item
test http://example.com/category/
@ start of string , page
parameter 1 or more digits in value:
rule(linkextractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),
demo (using example urls):
>>> import re >>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)') >>> should_match = [ ... 'http://example.com/category/?sort=a-z&page=1', ... 'http://example.com/category/?page=1&sort=a-z&cache=1', ... 'http://example.com/category/?page=1&sort=a-z#' ... ] >>> url in should_match: ... print "matches" if pattern.search(url) else "doesn't match" ... matches matches matches
Comments
Post a Comment