I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data= for root,dirs,files in os.walk(PATH): for rawfile in files: (dirName, fileName)= os.path.split(rawfile) (fileBaseName, fileExtension)=os.path.splitext(fileName) h=open(os.path.join(root, rawfile),'r') line=h.read() for raw_value in line.split('\x00'): try: test=float(raw_value) stripped_data.append(raw_value.strip()) except ValueError: pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().