How can I convert HTML to unpolished text in Python?

advertisements

I need to get plain text from an HTML document while honoring <br> elements as newlines. BeautifulSoup.text does not process <br> and newlines. HTML2Text is quite nice, but it converts to markdown. How else could I approach this?


I like to use the following method. You can do a manual .replace('<br>','\r\n') on the string before passing it to strip_tags(html) to honor new lines.

From this question:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()