python - Parsing a html file using BeautifulSoup -

- June 15, 2013

i have html file:

<html>     <head></head>     <body>         text1           text2         <a href="xycl7q.html">             text3          </a>     </body> </html>

i want collect separately text1,text2 , text3. text3 have no problems, not able capture text1-2; doing this:

 urllib import urlopen  bs4 import beautifulsoup   url = 'myurl';  html = urlopen(url).read()  soup = beautifulsoup(html)  soup.body.get_text()

i texts (first problem since text3 again) not separated since text1-2 might contain spaces...for instance, if text1 "hello world" , text2 "foo bar", @ end want list of 2 strings:

 results = ['hello world', 'foo bar']

how can that? thank answers...

the text want first child node of "body". can pull out , strip off crud

>>> bs4 import beautifulsoup bs >>> soup=bs("""<html> ...     <head></head> ...     <body> ...         text1   ...         text2 ...         <a href="xycl7q.html"> ...             text3  ...         </a> ...     </body> ... </html>""") ... >>> body=soup.find('body') >>> type(next(body.children)) <class 'bs4.element.navigablestring'> >>> next(body.children) u'\n        text1  \n        text2\n        ' >>> [stripped stripped in (item.strip() item in next(body.children).split('\n')) if stripped] [u'text1', u'text2']

Search This Blog

Crty

python - Parsing a html file using BeautifulSoup -

Comments

Post a Comment

Popular posts from this blog

c# - MSAA finds controls UI Automation doesn't -

python - mat is not a numerical tuple : openCV error -

wordpress - .htaccess: RewriteRule: bad flag delimiters -