I'm writing a python script to assign grammatical categories to words in several text files. In each text file, I have file headers within angle brackets <>. Throughout the texts there are also additional lines with information such as time stamps, page numbers, and questions from the transcriber. I want to remove these lines. This is basically what the text files look like:
<title Titipuru Supay> <speaker name> <sex female> <dialect Pastaza> <register narrative> <contributor name> chan; payguna serenkya man chiga; <ima?> payguna kirina man, chiga, mana shayachira; ninagunan shi tujsirani nira: illaparani nira shi illapay <173> pasasha, ima shi kasna nin, nisha,
Even though there are the same number of headers in each file the other <> material varies, so I can't just eliminate specific lines. So I thought I'd try something simple like a re.sub statement that removes everything inbetween <> and including the brackets.
with open(file, encoding='utf-8') as file_in: text = file_in.read() re.sub(r"<.*>", " ", text)
I tried <.*> on pythex.org and regex101 it worked in both places with a test string, but not in my script (yes I have import re). I also tried other solutions like:
Am I just not getting the regex right or there something deeper here?
From what I understand, you may have several
<...> on the same line. In this case, you are much safer with a negated character class solution:
text = re.sub(r"<[^>]*>", " ", text)
text variable, of course, should be updated as Python strings are immutable, and the regex is now matching
<, then zero or more characters other than
>, and then
See the regex demo