Marking Stripping (XML?) From a document using Python

advertisements

I've file which contains name of scientist in following format <scientist_names> <scientist>abc</scientist> </scientist_names> i want to use python to strip out name of scientists from above format How should I do it?? I would like to use regular epressions but don't know how to use it...please help


As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).

Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).

In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML