Beautiful Soup, use find()/find_all() attrs parameter to select HTML tag

Note: This article applies to Python 3 environment.

Background

Recently, I’m working on a web-scrapying project using Python 3, and use Beautiful Soup to parse the HTML tree. I encountered several problems when selecting HTML tag, but luckily, I found that most of them can be solved via the find()/find_all() attrs parameter.

Problem & Workaround

find()/find_all() **kwargs parameter can’t find tags with some HTML 5 attributes(eg. the data-* attributes)

example:

>>> html_content = """内地剧"""
>>> html_soup = BeautifulSoup(html_content)
>>> html_soup.find(data-pb-other="area")
  File "<stdin>", line 1
SyntaxError: keyword can't be an expression

workaround:

>>> html_soup.find(attrs={"data-pb-other":"area"})
内地剧

CSS selector can’t find tags with some HTML 5 attributes(eg. the data-* attributes)

example:

>>> html_content = """内地剧"""
>>> html_soup = BeautifulSoup(html_content)
>>> html_soup.select('a[data-pb-other="area"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.4/site-packages/bs4/element.py", line 1313, in select
    'Unsupported or invalid CSS selector: "%s"' % token)
ValueError: Unsupported or invalid CSS selector: "a[data-pb-other="area"]"

workaround:

>>> html_soup.find("a", attrs={"data-pb-other": "area"})
内地剧

CSS selector can’t find tags with multiple attributes

example:

>>> html_content = """大道通天第1集"""
>>> html_soup = BeautifulSoup(html_content)
>>> html_soup.select('a[class="movie_name" id="movie_name"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.4/site-packages/bs4/element.py", line 1313, in select
    'Unsupported or invalid CSS selector: "%s"' % token)
ValueError: Unsupported or invalid CSS selector: "a[class="movie_name""

workaround:

>>> html_soup.find("a", attrs={"class":"movie_name", "id":"movie_name"})
大道通天第1集

References:

Beautiful Soup, use find()/find_all() attrs parameter to select HTML tag

Background

Problem & Workaround

example:

workaround:

example:

workaround:

example:

workaround:

Leave a Reply Cancel reply

You Missed

开源了一个 Vue 写的风格简洁的版权/备案页脚

CSS 将页脚固定在页面底部

给 WordPress 博客文章加上「最后修改日期」

MacBook 关闭内置显示器

Background

Problem & Workaround

example:

workaround:

example:

workaround:

example:

workaround:

Related Post

Leave a Reply Cancel reply

You Missed