How to parse a string that is in a different Java encoding


I have a string that I have read in from a Word document. I think it is in "Cp1252" encoding. Java uses UTF8.

How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?

specifically, I want to replace the "En Dash" character with a plain "-"

The following code block takes the projDateString which is coming from the Word document, and trying to do such a thing

    char[] test = projDateString.getBytes("Cp1252");
    for(int i = 0; i < test.length; i++){
    System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
    String projDateString2 = new String(test);
    projDateString2.replaceAll("\0x96", "\u2013");
    System.out.println("projDateString2: " + projDateString)

I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.

This gives me the following output:

test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 ΓÇô Present

As you can see, the replace did nothing, and the println still gives me garbage chars instead of a plaintext "-"

Java strings are always in UTF-16, at least as far as the API is concerned... but you can generally just think of them as being "Unicode". The fact that they're UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairs in Java. But I don't think you need to worry about this in your case. So just think of the values in Strings as "Unicode text" without a specific encoding... in particular, definitely not in UTF-8 or CP1252. Those are the encodings used to convert binary data (e.g. a byte array) into text data (e.g. a string).

You shouldn't be using String.getBytes() or new String(byte[]) without specifying the encoding - that's the problem. Those always use the platform default encoding - which is almost always the wrong choice.

You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

If you have the bytes and you know the relevant encoding, you should use:

String text = new String(bytes, encoding);

You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won't change the existing string. It will instead return a new string with the replacements made.

So this statement:

projDateString2.replaceAll("\0x96", "\u2013");

will never do what you want. Even if everything else is correct, you should be using:

projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");

(or something similar). I don't think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.