Beautiful Soup - How to Repair Broken Labels

advertisements

I'd like to know how to fix broken html tags before parsing it with Beautiful Soup.

In the following script the td> needs to be replaced with <td.

How can I do the substitution so Beautiful Soup can see it?

from BeautifulSoup import BeautifulSoup

s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""

a = BeautifulSoup(s)

left = []
right = []

for tr in a.findAll('tr'):
    l, r = tr.findAll('td')
    left.extend(l.findAll(text=True))
    right.extend(r.findAll(text=True))

print left + right


Edit (working):

I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
                           a|abbr|acronym|address|applet|area|\
                           b|base|basefont|bdo|big|blockquote|body|br|button|\
                           caption|center|cite|code|col|colgroup|\
                           dd|del|dfn|dir|div|dl|dt|\
                           em|\
                           fieldset|font|form|frame|frameset|\
                           head|h1|h2|h3|h4|h5|h6|hr|html|\
                           i|iframe|img|input|ins|\
                           kbd|\
                           label|legend|li|link|\
                           map|menu|meta|\
                           noframes|noscript|\
                           object|ol|optgroup|option|\
                           p|param|pre|\
                           q|\
                           s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                           table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                           u|ul|\
                           var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)

Produces:

>>> print s

<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
                       a|abbr|acronym|address|applet|area|\
                       b|base|basefont|bdo|big|blockquote|body|br|button|\
                       caption|center|cite|code|col|colgroup|\
                       dd|del|dfn|dir|div|dl|dt|\
                       em|\
                       fieldset|font|form|frame|frameset|\
                       head|h1|h2|h3|h4|h5|h6|hr|html|\
                       i|iframe|img|input|ins|\
                       kbd|\
                       label|legend|li|link|\
                       map|menu|meta|\
                       noframes|noscript|\
                       object|ol|optgroup|option|\
                       p|param|pre|\
                       q|\
                       s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                       u|ul|\
                       var)>", "><\g<1>>", s)

<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>


This one should match broken ending tags as well (</endtag>):

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
                 b|base|basefont|bdo|big|blockquote|body|br|button|\
                 caption|center|cite|code|col|colgroup|\
                 dd|del|dfn|dir|div|dl|dt|\
                 em|\
                 fieldset|font|form|frame|frameset|\
                 head|h1|h2|h3|h4|h5|h6|hr|html|\
                 i|iframe|img|input|ins|\
                 kbd|\
                 label|legend|li|link|\
                 map|menu|meta|\
                 noframes|noscript|\
                 object|ol|optgroup|option|\
                 p|param|pre|\
                 q|\
                 s|samp|script|select|small|span|strike|strong|style|sub|sup|\
                 table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
                 u|ul|\
                 var)>", "><\g<1>\g<2>>", s)