concurrency - Processing web pages concurrently with Ruby -

- July 15, 2014

i trying process content of different pages given array of urls, using ruby thread. however, when trying open url error: #<socketerror: getaddrinfo: name or service not known>

this how trying it:

sites.each |site|     threads << thread.new(site) |url|         puts url         #web = open(url) { |i| i.read } # same issue opening web way         web = net::http.new(url, 443).get('/', nil)         lock.synchronize             new_md5[sites_hash[url]] = digest::md5.hexdigest(web)         end     end end

sites array of urls.

the same program sequential works alright:

sites.each { |site|     web = open(site) { |i| i.read }     new_md5 << digest::md5.hexdigest(web) }

what's problem?

ugh. you're going open thread every site have process? if have 10,000 sites?

instead, set limit on number of threads, , turn sites queue, , have each thread remove site, process , site. if there no more sites in queue, thread can exit.

the example in queue documentation started.

instead of using get , always retrieve entire body, use backing database keeps track of last time page processed. use head check see if page has been updated since then. if has, then get. reduce your, , their, bandwidth , cpu usage. it's being network citizen, , playing nice other people's toys. if don't play nice, might not let play them more.

i've written hundreds of spiders , site analyzers. i'd recommend should have backing database , use keep track of sites you're going read, when last read them, if or down last time tried page, , how many times you've tried reach them , down. (the last don't bang code's head on wall trying reach dead/down sites.)

i had 75 thread app read pages. each thread wrote findings database, and, if page needed processed, html written record in table. single app read table , did processing. easy single app stay ahead of 75 threads because they're dealing slow internet.

the big advantage using backend database, code can shut down, , it'll pick @ same spot, next site processed, if write correctly. can scale run on multiple hosts too.

regarding not being able find host:

some things see in code:

you're not handling redirects. "following redirection" shows how that.
the request port 443, not 80, net::http isn't happy trying use non-ssl ssl port. see "using net::http.get https url", discusses how turn on ssl.

either of explain why using open works code doesn't. (i'm assuming you're using openuri in conjunction single-threaded code though don't show it, since open doesn't know url.)

in general, i'd recommend using typhoeus , hydra process large numbers of sites in parallel. typhoeus handle redirects also, along allowing use head requests. can set how many requests handled @ same time (concurrency) , automatically handles duplicate requests (memoization) redundant urls don't pounded.

Search This Blog

Crty

concurrency - Processing web pages concurrently with Ruby -

Comments

Post a Comment

Popular posts from this blog

c# - MSAA finds controls UI Automation doesn't -

python - mat is not a numerical tuple : openCV error -

wordpress - .htaccess: RewriteRule: bad flag delimiters -