The utf8 data looks good in mysql but is broken in rails

advertisements

I'm setting a rails environment up for one of my colleagues, who's using a mac (in case that's relevant). I've pulled the data down from our live mysql database and made a local development database with that data. If i open the mysql console, and look at the data for a record which has extended charset characters in its name field, then it looks fine. However, in the rails console (and in a rails-generated web page) the encoding is broken: an endash is replaced by "—" for example.

The only rails config options i know about that are relevant to this is in config/database.yml. I currently have this set:

encoding: utf8
collation: utf8_general_ci

which makes it work fine on my machine for example. But like i say it's not working on my colleague's machine. Any ideas anyone?

EDIT 1: on the live server, where i copied the data FROM, the charset info looks like this:

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

EDIT 2: in response to @eggyal's comment i've done a couple of mysqldumps, which has been quite revealing. Here's the first dump:

$ mysqldump -u root -h127.0.0.1  dbname lessons --where="id=79510"
-- MySQL dump 10.11
--
-- Host: 127.0.0.1    Database: e_learning_resource_v3
-- ------------------------------------------------------
-- Server version   5.0.32-Debian_7etch4-log

/*!40101 SET @[email protected]@CHARACTER_SET_CLIENT */;
/*!40101 SET @[email protected]@CHARACTER_SET_RESULTS */;
/*!40101 SET @[email protected]@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @[email protected]@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @[email protected]@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @[email protected]@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @[email protected]@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @[email protected]@SQL_NOTES, SQL_NOTES=0 */;

--
-- Table structure for table `lessons`
--

DROP TABLE IF EXISTS `lessons`;
CREATE TABLE `lessons` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(255) default NULL,
  `description` text,
  `user_id` int(11) default NULL,
  `created_at` datetime default NULL,
  `privacy` int(11) default '1',
  `is_official` tinyint(1) default '0',
  `is_readonly` tinyint(1) default NULL,
  `comments_allowed` tinyint(1) default NULL,
  `hours` int(11) default NULL,
  `sessions` int(11) default NULL,
  `updated_at` datetime default NULL,
  `custom_menu_swf` varchar(255) default NULL,
  `pupil_liked_at` datetime default NULL,
  `user_liked_at` datetime default NULL,
  `pupil_favorite_count` int(11) default '0',
  `user_favorite_count` int(11) default '0',
  `teacher_notes` text,
  `pupil_notes` text,
  PRIMARY KEY  (`id`),
  KEY `user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

--
-- Dumping data for table `lessons`
--
-- WHERE:  id=79510

LOCK TABLES `lessons` WRITE;
/*!40000 ALTER TABLE `lessons` DISABLE KEYS */;
INSERT INTO `lessons` VALUES (79510,'Jazz–Man',NULL,NULL,'2014-04-03 12:08:05',1,0,NULL,NULL,NULL,NULL,'2014-04-03 12:08:05',NULL,NULL,NULL,0,0,NULL,NULL);
/*!40000 ALTER TABLE `lessons` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET [email protected]_TIME_ZONE */;

/*!40101 SET [email protected]_SQL_MODE */;
/*!40014 SET [email protected]_FOREIGN_KEY_CHECKS */;
/*!40014 SET [email protected]_UNIQUE_CHECKS */;
/*!40101 SET [email protected]_CHARACTER_SET_CLIENT */;
/*!40101 SET [email protected]_CHARACTER_SET_RESULTS */;
/*!40101 SET [email protected]_COLLATION_CONNECTION */;
/*!40111 SET [email protected]_SQL_NOTES */;

-- Dump completed on 2014-04-03 11:16:42

So, this was just a straight mysqldump and it's got the broken character in it (Jazz–Man) in the "INSERT INTO lessons" line.

I do it again with some extra options, and the data looks ok in the dump file:

$ mysqldump -u root -h127.0.0.1  dbname lessons --extended-insert --single-transaction --default-character-set=latin1 --skip-set-charset --where="id=79510"
-- MySQL dump 10.11
--
-- Host: 127.0.0.1    Database: e_learning_resource_v3
-- ------------------------------------------------------
-- Server version   5.0.32-Debian_7etch4-log
/*!40103 SET @[email protected]@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @[email protected]@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @[email protected]@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @[email protected]@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @[email protected]@SQL_NOTES, SQL_NOTES=0 */;

--
-- Table structure for table `lessons`
--

DROP TABLE IF EXISTS `lessons`;
CREATE TABLE `lessons` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(255) default NULL,
  `description` text,
  `user_id` int(11) default NULL,
  `created_at` datetime default NULL,
  `privacy` int(11) default '1',
  `is_official` tinyint(1) default '0',
  `is_readonly` tinyint(1) default NULL,
  `comments_allowed` tinyint(1) default NULL,
  `hours` int(11) default NULL,
  `sessions` int(11) default NULL,
  `updated_at` datetime default NULL,
  `custom_menu_swf` varchar(255) default NULL,
  `pupil_liked_at` datetime default NULL,
  `user_liked_at` datetime default NULL,
  `pupil_favorite_count` int(11) default '0',
  `user_favorite_count` int(11) default '0',
  `teacher_notes` text,
  `pupil_notes` text,
  PRIMARY KEY  (`id`),
  KEY `user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

--
-- Dumping data for table `lessons`
--
-- WHERE:  id=79510

LOCK TABLES `lessons` WRITE;
/*!40000 ALTER TABLE `lessons` DISABLE KEYS */;
INSERT INTO `lessons` VALUES (79510,'Jazz–Man',NULL,NULL,'2014-04-03 12:08:05',1,0,NULL,NULL,NULL,NULL,'2014-04-03 12:08:05',NULL,NULL,NULL,0,0,NULL,NULL);
/*!40000 ALTER TABLE `lessons` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET [email protected]_TIME_ZONE */;

/*!40101 SET [email protected]_SQL_MODE */;
/*!40014 SET [email protected]_FOREIGN_KEY_CHECKS */;
/*!40014 SET [email protected]_UNIQUE_CHECKS */;
/*!40111 SET [email protected]_SQL_NOTES */;

-- Dump completed on 2014-04-03 11:18:20

So, it looks like the extra options did the trick:

--extended-insert --single-transaction --default-character-set=latin1 --skip-set-charset


When a MySQL client interacts with the server:

  1. the server receives any text merely as a string of bytes; the client will have previously told it how such text would be encoded.

  2. if the server then has to store that text in a table, it must transcode it to the encoding of the relevant column (if different).

  3. if the client subsequently wants to retrieve such text, the server must transcode it to the encoding expected by the client.

If the encodings used by the client in steps 1 and 3 are the same (which is usually the case, especially when the client in both cases is the same application), then it often goes unnoticed if the client is using an encoding other than the one it said it would. For example, suppose the client tells MySQL that it will use latin1, but actually sends data in utf8:

  • The string 'Jazz–Man' is sent to the server in UTF-8 as 0x4a617a7ae280934d616e.

  • MySQL, decoding those bytes in Windows-1252, understands them to represent the string 'Jazz–Man'.

  • To store in a utf8 column, MySQL transcodes the string to its UTF-8 encoding 0x4a617a7ac3a2e282ace2809c4d616e. This can be verified by using SELECT HEX(name) FROM lessons WHERE id=79510.

  • When the client retrieves the value, MySQL thinks that it wants it in latin1 and so transcodes to the Windows-1252 encoding 0x4a617a7ae280934d616e.

  • When the client receives those bytes, it decodes them as UTF-8 and therefore understands the string to be 'Jazz–Man'.

Conclusion: the client doesn't realise anything is wrong. Problems are only detected when a different client (one that does not misstate its UTF-8 connection as latin1) tries to use the table. In your case, this occurred when mysqldump obtained an export of the data; using the --default-character-set=latin1 --skip-set-charset options effectively forced mysqldump to behave in the same broken way as your application, so it ended up with correctly encoded data.

To fix your issue going forward, you must:

  1. Configure your application so that it correctly sets its MySQL connection character set (e.g. set encoding: utf8 in config/database.yml for Rails);

  2. Recode the data in your database, e.g. UPDATE lessons SET name = BINARY CONVERT(name USING latin1) (note that this must be done for every misencoded text column).

Also note that you will probably want to perform these two actions atomically, which may require some thought.