concurrency - Processing web pages concurrently with Ruby -
i trying process content of different pages given array of urls, using ruby thread
. however, when trying open url error: #<socketerror: getaddrinfo: name or service not known>
this how trying it:
sites.each |site| threads << thread.new(site) |url| puts url #web = open(url) { |i| i.read } # same issue opening web way web = net::http.new(url, 443).get('/', nil) lock.synchronize new_md5[sites_hash[url]] = digest::md5.hexdigest(web) end end end
sites
array of urls.
the same program sequential works alright:
sites.each { |site| web = open(site) { |i| i.read } new_md5 << digest::md5.hexdigest(web) }
what's problem?
ugh. you're going open thread every site have process? if have 10,000 sites?
instead, set limit on number of threads, , turn sites
queue, , have each thread remove site, process , site. if there no more sites in queue, thread can exit.
the example in queue documentation started.
instead of using get
, always retrieve entire body, use backing database keeps track of last time page processed. use head
check see if page has been updated since then. if has, then get
. reduce your, , their, bandwidth , cpu usage. it's being network citizen, , playing nice other people's toys. if don't play nice, might not let play them more.
i've written hundreds of spiders , site analyzers. i'd recommend should have backing database , use keep track of sites you're going read, when last read them, if or down last time tried page, , how many times you've tried reach them , down. (the last don't bang code's head on wall trying reach dead/down sites.)
i had 75 thread app read pages. each thread wrote findings database, and, if page needed processed, html written record in table. single app read table , did processing. easy single app stay ahead of 75 threads because they're dealing slow internet.
the big advantage using backend database, code can shut down, , it'll pick @ same spot, next site processed, if write correctly. can scale run on multiple hosts too.
regarding not being able find host:
some things see in code:
- you're not handling redirects. "following redirection" shows how that.
- the request port 443, not 80, net::http isn't happy trying use non-ssl ssl port. see "using net::http.get https url", discusses how turn on ssl.
either of explain why using open
works code doesn't. (i'm assuming you're using openuri in conjunction single-threaded code though don't show it, since open
doesn't know url.)
in general, i'd recommend using typhoeus , hydra process large numbers of sites in parallel. typhoeus handle redirects also, along allowing use head
requests. can set how many requests handled @ same time (concurrency) , automatically handles duplicate requests (memoization) redundant urls don't pounded.
Comments
Post a Comment