Python, eliminates lines in hooks with regex

I'm writing a python script to assign grammatical categories to words in several text files. In each text file, I have file headers within angle brackets <>. Throughout the texts there are also additional lines with information such as time stamps, page numbers, and questions from the transcriber. I want to remove these lines. This is basically what the text files look like:

<title      Titipuru Supay>
<speaker    name>
<sex        female>
<dialect    Pastaza>
<register   narrative>
<contributor    name>

chan; payguna serenkya man chiga;
<ima?>
payguna kirina man, chiga, mana
shayachira; ninagunan shi tujsirani nira:
illaparani nira shi illapay
<173>
pasasha, ima shi kasna nin, nisha,

Even though there are the same number of headers in each file the other <> material varies, so I can't just eliminate specific lines. So I thought I'd try something simple like a re.sub statement that removes everything inbetween <> and including the brackets.

with open(file, encoding='utf-8') as file_in:
        text = file_in.read()
        re.sub(r"<.*>", " ", text)

I tried <.*> on pythex.org and regex101 it worked in both places with a test string, but not in my script (yes I have import re). I also tried other solutions like: \<.*\>

Am I just not getting the regex right or there something deeper here?


From what I understand, you may have several <...> on the same line. In this case, you are much safer with a negated character class solution:

text = re.sub(r"<[^>]*>", " ", text)

The text variable, of course, should be updated as Python strings are immutable, and the regex is now matching <, then zero or more characters other than >, and then >.

See the regex demo