Removing unknown characters from a text file

advertisements

I have a large number of files containing data I am trying to process using a Python script.

The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).

In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:

stripped_data=[]
for root,dirs,files in os.walk(PATH):
    for rawfile in files:
        (dirName, fileName)= os.path.split(rawfile)
        (fileBaseName, fileExtension)=os.path.splitext(fileName)
        h=open(os.path.join(root, rawfile),'r')
        line=h.read()
        for raw_value in line.split('\x00'):
            try:
                test=float(raw_value)
                stripped_data.append(raw_value.strip())
            except ValueError:
                pass

However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.

How can I remove all non-ASCII characters from these files prior to processing?


You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().