Note: This article applies to Python 3 environment.
Background
Recently, I’m working on a web-scrapying project using Python 3, and use Beautiful Soup
to parse the HTML tree. I encountered several problems when selecting HTML tag, but luckily, I found that most of them can be solved via the find()
/find_all()
attrs
parameter.
Problem & Workaround
find()
/find_all()
**kwargs
parameter can’t find tags with some HTML 5 attributes(eg. thedata-*
attributes)
example:
>>> html_content = """内地剧""" >>> html_soup = BeautifulSoup(html_content) >>> html_soup.find(data-pb-other="area") File "<stdin>", line 1 SyntaxError: keyword can't be an expression
workaround:
>>> html_soup.find(attrs={"data-pb-other":"area"}) 内地剧
- CSS selector can’t find tags with some HTML 5 attributes(eg. the
data-*
attributes)
example:
>>> html_content = """内地剧""" >>> html_soup = BeautifulSoup(html_content) >>> html_soup.select('a[data-pb-other="area"]') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/site-packages/bs4/element.py", line 1313, in select 'Unsupported or invalid CSS selector: "%s"' % token) ValueError: Unsupported or invalid CSS selector: "a[data-pb-other="area"]"
workaround:
>>> html_soup.find("a", attrs={"data-pb-other": "area"}) 内地剧
- CSS selector can’t find tags with multiple attributes
example:
>>> html_content = """大道通天第1集""" >>> html_soup = BeautifulSoup(html_content) >>> html_soup.select('a[class="movie_name" id="movie_name"]') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/site-packages/bs4/element.py", line 1313, in select 'Unsupported or invalid CSS selector: "%s"' % token) ValueError: Unsupported or invalid CSS selector: "a[class="movie_name""
workaround:
>>> html_soup.find("a", attrs={"class":"movie_name", "id":"movie_name"}) 大道通天第1集
References: