Removing extra spaces in Chinese HTML files using lxml

advertisements

I have a bunch of improperly formatted Chinese html files. They contain unnecessary spaces and line breaks which will be displayed as extra spaces in the browser. I've written a script using lxml to modify the html files. It works fine on simple tags, but I'm stuck on nested ones. For example:

<p>祝你<span>19</span>岁
    生日快乐。</p>

will be displayed is the browser as:

祝你19岁 生日快乐。

Notice the extra space. This is what needs to be deleted. The result html should be like this:

<p>祝你<span>19</span>岁生日快乐。</p>

How do I do this?

Note that the nesting(like the span tag) could be arbitrary, but I don't need to consider the content in the nested elements, they should be preserved as they are. Only the text in the outer element needs to by formatted.

This is what I've got:

# -*- coding: utf-8 -*-

import lxml.html
import re

s1 = u"""<p>祝你19岁
    生日快乐。</p>"""
p1 = lxml.html.fragment_fromstring(s1)
print p1.text         # I get the whole line.
p1.text = re.sub("\s+", "", p1.text)
print p1.tostring()   # spaces are removed.

s2 = u"""<p>祝你<span>19</span>岁
    生日快乐。</p>"""
p2 = lxml.html.fragment_fromstring(s2)
print p2.text     # I get "祝你"
print p2.tail     # I get None
i = p2.itertext()
print i.next()   # I get "祝你"
print i.next()   # I get "19" from <span>
print i.next()   # I get the tailed text, but how do I assemble them back?
print p2.text_content()  # The whole text, but how do I put <span> back?


Controversially, I wonder whether this is possible to complete without using an HTML/XML parser, considering that it appears to be cause by line wrapping.

I built a regular expression to look for whitespace between Chinese text with the help of this solution here: https://stackoverflow.com/a/2718268/267781

I don't know whether the catch-all of any whitespace between characters or whether the more specific [char]\n\s*[char] is most suitable to your problem.

# -*- coding: utf-8 -*-
import re

# Whitespace in Chinese HTML
## Used this solution to create regexp: https://stackoverflow.com/a/2718268/267781
## \s+
fixwhitespace2 = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\s+)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
## \n\s*
fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)

sample = u'<html><body><p>\u795d\u4f6019\u5c81\n    \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'

fixwhitespace.sub('',sample)

Yielding

<html><body><p>祝你19日快乐。</p></body></html>


However, here's how you might do it using the parser and xpath to find linefeeds:

# -*- coding: utf-8 -*-
from lxml import etree
import re

fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
sample = u'<html><body><p>\u795d\u4f6019\u5c81\n    \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'

doc = etree.HTML(sample)
for t in doc.xpath("//text()[contains(.,'\n')]"):
  if t.is_tail:
    t.getparent().tail = fixwhitespace.sub('',t)
  elif t.is_text:
    t.getparent().text = fixwhitespace.sub('',t)

print etree.tostring(doc)

Yields:

<html><body><p>祝你19日快乐。</p></body></html>

I'm curious what the best match to your working data is.