python - Parsing a html file using BeautifulSoup -
i have html file:
<html> <head></head> <body> text1 text2 <a href="xycl7q.html"> text3 </a> </body> </html>
i want collect separately text1,text2 , text3. text3 have no problems, not able capture text1-2; doing this:
urllib import urlopen bs4 import beautifulsoup url = 'myurl'; html = urlopen(url).read() soup = beautifulsoup(html) soup.body.get_text()
i texts (first problem since text3 again) not separated since text1-2 might contain spaces...for instance, if text1 "hello world" , text2 "foo bar", @ end want list of 2 strings:
results = ['hello world', 'foo bar']
how can that? thank answers...
the text want first child node of "body". can pull out , strip off crud
>>> bs4 import beautifulsoup bs >>> soup=bs("""<html> ... <head></head> ... <body> ... text1 ... text2 ... <a href="xycl7q.html"> ... text3 ... </a> ... </body> ... </html>""") ... >>> body=soup.find('body') >>> type(next(body.children)) <class 'bs4.element.navigablestring'> >>> next(body.children) u'\n text1 \n text2\n ' >>> [stripped stripped in (item.strip() item in next(body.children).split('\n')) if stripped] [u'text1', u'text2']
Comments
Post a Comment