What is the best way to handle text files downloaded from different encodings?

advertisements

Internally our PHP application uses UTF-8, and we do processing on .csv files and fixedwidth (text) files. We have written some nice libraries to work with these files (classes essentially).

We recently added the ability for administrators to upload files of these types so they could be processed and quickly ran into issues across multiple OS's. What we soon realised is that the files being read in were of different encodings to our application (i.e Windows-1252 or ISO-8859).

Since it is impossible to control what encoding of files are submitted to us my question is; what is the best way to handle uploaded text files of different encodings? I can think of two solutions currently:

  • When a file is received, detect its encoding and convert it to UTF-8, then re-save it. The rest of the system then only needs to be UTF-8 aware and can ignore 'encoding' issues.
  • Change the csv / fixed width library so they become encoding aware themselves

I also thought about the pro's and con's of these too:

  • Converting input makes the rest of the libraries smaller and reduces duplication, however it seems wasteful in terms of processing
  • Make libraries internally aware - this seems to involve more code but might be more speedy

Thoughts please?

Edit: I am really interested to know where to apply, architecturally, character encoding/transforming should happen - is it at the point of input or during the use of the files?


This is tricky, and there is no perfect solution.

phpMyAdmin for example offers the user the possibility to specify the encoding of the uploaded file. Seeing as all the automatic detection methods are not 100% reliable, if at all possible, this is the best way to go IMO.

An import dialog that allows the user to select the right encoding while seeing a preview of what their data looks like in that encoding might be optimal.

A way to do this could be

  • Receive the uploaded file and store it in a temporary file

  • Display a dialog with a drop-down selection of the most important encodings

  • Have an iframe that, when the selected value in the drop-down changes, converts the contents of the uploaded file using iconv() (source = the selected encoding; target = utf-8) and shows a preview.

  • When the user selects an encoding, do a final iconv() and store the file as UTF-8.