Note: This article applies to Python 3 environment.

Background

Recently, I’m working on a web-scrapying project using Python 3, and use Beautiful Soup to parse the HTML tree. I encountered several problems when selecting HTML tag, but luckily, I found that most of them can be solved via the find()/find_all() attrs parameter.

Problem & Workaround

  • find()/find_all() **kwargs parameter can’t find tags with some HTML 5 attributes(eg. the data-* attributes)

    example:

    >>> html_content = """内地剧"""
    >>> html_soup = BeautifulSoup(html_content)
    >>> html_soup.find(data-pb-other="area")
      File "<stdin>", line 1
    SyntaxError: keyword can't be an expression
    

    workaround:

    >>> html_soup.find(attrs={"data-pb-other":"area"})
    内地剧
    
  • CSS selector can’t find tags with some HTML 5 attributes(eg. the data-* attributes)

    example:

    >>> html_content = """内地剧"""
    >>> html_soup = BeautifulSoup(html_content)
    >>> html_soup.select('a[data-pb-other="area"]')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python3.4/site-packages/bs4/element.py", line 1313, in select
        'Unsupported or invalid CSS selector: "%s"' % token)
    ValueError: Unsupported or invalid CSS selector: "a[data-pb-other="area"]"
    

    workaround:

    >>> html_soup.find("a", attrs={"data-pb-other": "area"})
    内地剧
    
  • CSS selector can’t find tags with multiple attributes

    example:

    >>> html_content = """大道通天第1集"""
    >>> html_soup = BeautifulSoup(html_content)
    >>> html_soup.select('a[class="movie_name" id="movie_name"]')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python3.4/site-packages/bs4/element.py", line 1313, in select
        'Unsupported or invalid CSS selector: "%s"' % token)
    ValueError: Unsupported or invalid CSS selector: "a[class="movie_name""
    

    workaround:

    >>> html_soup.find("a", attrs={"class":"movie_name", "id":"movie_name"})
    大道通天第1集
    

References:

Leave a Reply

Your email address will not be published. Required fields are marked *