Iterating through an array and searching for each item in the array in a file


I don't know if I'm even asking this question the right way, but I want to search through a log file and look for each word in an array. At this point, I've asked the user to drag the file in question into terminal, then to build an array out of inputs. the program should print out every line a word is found in.

Once I get that working I'll format, have a counter, or make a little summary of what I found in the file, etc.

Here's what I've got so far, only when I run it, it doesn't actually find any words. I've been looking through re usage examples, but I think may be overly complicated for what I have in mind:

def wordsToFind():
    needsWords = True
    searchArray = []
    print "Add words to search ('done') to save/continue."
    while needsWords == True:
        word = raw_input("Enter a search word: ")
        if word.lower() == "done":
            needsWords = False
            print word + " added"
    return searchArray

def getFile():
    file_to_read = raw_input("Drag file here:").strip()
    return file_to_read

def main():
    filePath = getFile()
    searchArray = wordsToFind()
    print "Words searched for: ", searchArray
    searchCount = []

    with open(filePath, "r") as inFile:
        for line in inFile:
            for item in searchArray:
                if item in line:
                    print item


Obviously, any suggestions to optimize or recommendations for better python-coding are strongly welcomed here, I only know what I know, and appreciate all the help!

This is exactly the kind of problem that map-reduce is intended to solve. In case you are not familiar, map-reduce is a simple, two step process. Suppose you have a list storing the words you are interested in finding on a text. Your mapper function can iterate through this list of words, for each line of the text, and if it appears in the line, it returns a value, say, ['word', lineNum] which is stored in a results list. The mapper is essentially a wrapper over a for loop. You can then take your results list and "reduce" it, by writing a reducer function which in this case, could take the results list which should look like [['word1', 1]...['word1', n]...] into an object that looks like {'word1': [1, 2, 5], 'word3': [7], ...}.

This approach is advantageous because you abstract the process of iterating over lists while performing a common action to each item, and should your analysis needs change (as they do often), you only need to change your mapper/reducing functions without touching the rest of the code. Additionally, this method is highly parallelizable, should it ever become an issue (just ask Google!).

Python 3.x has built-in map/reduce methods as map() and reduce(); look them up in the python docs. So you can see how they work, I implemented a version of map/reduce based on your problem without using the built-in libraries. Since you didn't specify how your data was stored, I made a couple of assumptions about it, namely that the list of words of interest was to be given as a comma-separated file. To read the text files, I used readlines() to get an array of lines, and a regular expressions pattern to split the lines into words (namely, split on anything that isnt an alphanumerical character). Of course, this might not suit your needs, so you can change this to whatever make sense for the files you're looking at.

I tried to stay away from the esoteric python features (no lambdas!), so hopefully the implementation is clear. One last note, I used a loop to iterate over the lines of the text file, and a map function to iterate over list of words of interest. You could use nested map functions instead, but I wanted to keep track of the loop index (since you care about line numbers). If you really want to nest the map functions, you could store your array lines as a tuple of line and line number when you read the file, or you can modify the map function to return the index, your choice.

I hope this helps!

    #!usr/bin/env/ python

    #Regexp library
    import re

    #This function returns a new array containing
    #the elements after that have been modified by whatever function we passed in.
    def mapper(function, sequence):

        #List to store the results of the map operation
        result = []

        #Iterate over each item in sequence, append the values to the results list
        #after they have been modified by the "function" supplied as an argument in the
        #mapper function call.
        for item in sequence:

        return result

    #The purpose of the reduce function is to go through an array, and combine the items
    #according to a specified function - this specified function should combine an element
    #with a base value
    def reducer(function, sequence, base_value):

        #Need to get an base value to serve as the starting point for the construction of
        #the result
        #I will assume one is given, but in most cases you should include extra validation
        #here to either ensure one is given, or some sensible default is chosen

        #Initialize our accumulative value object with the base value
        accum_value = base_value

        #Iterate through the sequence items, applying the "function" provided, and
        #storing the results in the accum_value object
        for item in sequence:
            accum_value = function(item, accum_value)

        return accum_value

    #With these functions it should be sufficient to address your problem, what remains
    #is simply to get the data from the text files, and keep track of the lines in
    #which words appear
    if __name__ == 'main':

        word_list_file = 'FILEPATH GOES HERE'

        #Read in a file containing the words that will be searched in the text file
        #(assumes words are given as a comma separated list)
        infile = open(word_list_file, 'rt')    #Open file
        content =     #read the whole file as a single string
        word_list = content.split(',')  #split the string into an array of words

        target_text_file = 'FILEPATH GOES HERE'

        #Read in the text to analyze
        infile = open(target_text_file, 'rt')   #Open file
        target_text_lines = infile.readlines()    #Read the whole file as an array of lines

        #With the data loaded, the overall strategy will be to loop over the text lines, and
        #we will use the map function to loop over the the word_list and see if they are in
        #the current text file line

        #First, define the my_mapper function that will process your data, and will be passed to
        #the map function
        def my_mapper(item):

            #Split the current sentence into words
            #Will split on any non alpha-numeric character. This strategy can be revised
            #to find matches to a regular expression pattern based on the words in the
            #words list. Either way, make sure you choose a sensible strategy to do this.
            current_line_words = re.split(r'\W+', target_text_lines[k])

            #lowercase the words
            current_line_words = [word.lower() for word in current_line_words]

            #Check if the current item (word) is in the current_line_words list, and if so,
            #return the word and the line number
            if item in current_line_words:
                return [item, k+1]    #Return k+1 because k begins at 0, but I assume line
                                      #counting begins with 1?
                return []   #Technically, this does not need to be added, it can simply
                            #return None by default, but that requires manually handling iterator
                            #objects so the loop doesn't crash when seeing the None values,
                            #and I am being lazy :D

        #With the mapper function established, we can proceed to  loop over the text lines of the
        #array, and use our map function to process the lines against the list of words.

        #This array will store the results of the map operation
        map_output = []

        #Loop over text file lines, use mapper to find which words are in which lines, store
        #in map_output list. This is the exciting stuff!
        for k in range(len(target_text_lines)):
            map_output.extend(mapper(my_mapper, word_list))

        #At this point, we should have a list of lists containing the words and the lines they
        #appeared in, and it should look like, [['word1', 1] ... ['word25': 5] ... [] ...]
        #As you can see, the post-map array will have an entry for each word that appeared in
        #each line, and if a particular word did not appear in a particular line, there will be a
        #empty list instead.

        #Now all that remains is to summarize our data, and that is what the reduce function is
        #for. We will iterate over the map_output list, and collect the words and which lines
        #they appear at in an object that will have the format { 'word': [n1, n2, ...] },where
        #n1, n2, ... are the lines the word appears in. As in the case for the mapper
        #function, the output of the reduce function can be modified in the my_reducer function
        #you supply to it. If you'd rather it return something else (like say, word count), this
        #is the function to modify.

        def my_reducer(item, accum_value):
            #First, verify item is not empty
            if item != []:
                #If the element already exists in the output object, append the current line
                #value to it, if not, add it to the object and create a set holding the current
                #line value

                #Check this word/line combination isn't already stored in the output dict
                if (item[0] in accum_value) and (item[1] not in accum_value[item[0]]):
                    accum_value[item[0]] = [item[1]]

            return accum_value

        #Now we can call the reduce function, save it's output, print it to screen, and we're
        #(Note that for base value we are just passing in an empty object, {})
        reduce_results = reducer(my_reducer, map_output, {})

        #Print results to screen
        for result in reduce_results:
            print('word: {}, lines: {}'.format(result, reduce_results[result]))