I'm trying to parse a txt file and put sentences in a list that fit my criteria. The text file consists of several thousand lines and I'm looking for lines that start with a specific string, lets call this string 'start'. The lines in this text file can belong together and are somehow seperated with \n
at random.
This means I have to look for any string that starts with 'start', put it in an empty string 'complete' and then continue scanning each line after that to see if it also starts with 'start'.
If not then I need to append it to 'complete' because then it is part of the entire sentence. If it does I need to append 'complete' to a list, create a new, empty 'complete' string and start appending to that one. This way I can loop through the entire text file without paying attention to the number of lines a sentence exists of.
My code thusfar:
import sys, string
lines_1=[]
startswith = ('keys', 'values', 'files', 'folders', 'total')
completeline = ''
with open (sys.argv[1]) as f:
data = f.read()
for line in data:
if line.lower().startswith(startswith):
completeline = line
else:
completeline += line
lines_1.append(completeline)
# check some stuff in output
for l in lines_1:
print "______"
print l
print len(lines_1)
However this puts the entire content in 1 item in the list, where I'd like everything to be seperated.
Keep in mind that the lines composing one sentence can span one, two, 10 or 1000 lines so it needs to spot the next startswith
value, append the existing completeline
to the list and then fill completeline
up with the next sentence.
Much obliged!
Two issues:
- Iterating over a string, not lines:
When you iterate over a string, the value yielded is a character, not a line. This means for line in data:
is going character by character through the string. Split your input by newlines, returning a list, which you then iterate over. e.g. for line in data.split('\n'):
- Overwriting the completeline inside the loop
You append a completed line at the end of the loop, but not when you start recording a new line inside the loop. Change the if
in the loop to something like this:
if line.lower().startswith(startswith):
if completeline:
lines_1.append(completeline)
completeline = line