python - How to create LinkExtractor rule which based on href in Scrapy -
i trying create simple crawler scrapy (scrapy.org). per example item.php allowed. how can write rule allow url start http://example.com/category/ in get parameter page should there number of digit other parameter. order of these parameter random.    please how can write such rule?
few valid values are:
- http://example.com/category/?page=1&sort=a-z&cache=1
 - http://example.com/category/?page=1&sort=a-z#
 - http://example.com/category/?sort=a-z&page=1
 
following code:
import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors import linkextractor class myspider(crawlspider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/category/']  rules = (     rule(linkextractor(allow=('item\.php', )), callback='parse_item'), )  def parse_item(self, response):     item = scrapy.item()     item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'id: (\d+)')     item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()     item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()     return item      
test http://example.com/category/ @ start of string , page parameter 1 or more digits in value:
rule(linkextractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),   demo (using example urls):
>>> import re >>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)') >>> should_match = [ ...     'http://example.com/category/?sort=a-z&page=1', ...     'http://example.com/category/?page=1&sort=a-z&cache=1', ...     'http://example.com/category/?page=1&sort=a-z#' ... ] >>> url in should_match: ...     print "matches" if pattern.search(url) else "doesn't match" ...  matches matches matches      
Comments
Post a Comment