I have the following Python code using Genshi (simplified):
with open(pathToHTMLFile, 'r') as f: template = MarkupTemplate(f.read()) finalPage = template.generate().render('html', doctype = 'html')
The source HTML file contains entities such as
®. Genshi replaces these with their UTF-8 character, which causes problems with the viewer (the output is used as a stand-alone file, not a response to a web request) that eventually sees the resulting HTML. Is there any way to prevent Genshi from parsing these entities? The more common ones like
& are passed through just fine.
& isn't passed through, it's parsed into an ampersand character, and then serialised back to
& on the way out because that's necessary to represent a literal ampersand in HTML.
©, on the other hand, is not a necessary escape, so it can be left as its literal character.
So no, there's no way to stop the entity reference being parsed. But you can ensure that non-ASCII characters are re-escaped on the way back out by serialising to plain ASCII:
template.generate().render('html', doctype= 'html', encoding= 'us-ascii')
You still won't get the entity reference
© in your output, but you will get the character reference
© which is equivalent and should hopefully be understood by whatever is displaying the final file.