Base cntr next (1)
|

Converting your data from latin1 to utf8mb4

Overview

There are many different encoding schemes, each with their purpose, benefits and disadvantages. For most languages, a traditional scheme such as UTF-8 might be enough to cover all the letters and a bunch of the most used symbols.

But, what happens when you need an encoding for the data in your MySQL database that encompasses different characters?

In that case, a conversion of an encoding scheme to another is needed. This is more common when working with some foreign languages, but still it is a good thing to know. In this article we’ll tell you how to convert data from latin1 (otherwise called ISO-8859-1) to utf8mb4.

Why would I want to do that?

As said before, the encoding scheme you are using on your MySQL data might not cover all the characters you need. You don’t want your data to have some empty squares or a different symbol instead of the actual characters written in there. That is why you would need to convert your data to another encoder.

Let’s say you have a MySQL table with data encoded with latin1 (ISO-8859-1) encoding scheme, but latin1 encodes just the first 256 code points of the Unicode character set, so you have some symbols or letters in different languages that are not being displayed correctly. You would want to change to an encoding scheme that covers more characters.

A popular option would be the UTF-8 scheme, which contains many different characters, but converting into that one is not recommended. The best practice would be converting your data to the utf8mb4 set, which uses a maximum of four bytes per code point, contrary to the maximum of three bytes offered by UTF-8, so it supports many supplemental characters.

Converting new data

So, before we convert the data from latin1 to utf8mb4, we need to convert the database, and then the tables and columns, so that all new data will be under the utf8mb4 set.

We will start by converting the database:

ALTER DATABASE db_name CHARACTER SET utf8 COLLATE utf8_general_ci

Then, the next step will be to convert the table (or tables) to utf8mb4. From now on, all newly created columns among those tables will have the proper charset:

ALTER TABLE for_each_table_in_db CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Note that you will need to execute these statements for each database and table inside the database.

Converting existing data

Now that we took care of the new data, we are ready to convert existing data in the tables among the database. In order to do it, we need to run the following statements for each database and the tables inside of it:

select table_schema, table_name, column_name, data_type, character_set_name, collation_name 
from information_schema.columns 
where 
   table_schema = 'db_name' 
   and table_name = 'tbl_name' 
   and data_type in ('varchar', 'char', 'text', 'tinytext', 'mediumtext', 'longtext');

If you find any columns with the previously mentioned data types and character set, then you would need to convert it from latin1 (it would also work to convert from other encoding schemes) to utf8mb4. 

This can be done by using the following statement to convert a single column (varchar(10) type is just an example here):

ALTER TABLE table_name CHANGE column_name column_name varchar(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci

NOTE: We do not need to alter blob-based fields because they are stored as “raw” data without encoding.

Automation options

There are three common options when altering a table:

Executing statements directly

A direct alteration works fine, as long as you can afford to have some downtime on you system while the table is locked.

ALTER TABLE table_name
   CHANGE c1 c1 varchar(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci,
   CHANGE c2 c2 varchar(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci,
   ...
   CHANGE cN cN varchar(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Running pt-osc

This method offers the benefit of altering tables without downtime. Keep in mind that, if you have several columns, then you would need to combine several alters into one.

pt-online-schema-change --user=<username> --password=<password> --alter "CHANGE c1 c1 varchar(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci" D=db_name,t=tbl_name --execute

Running changes on a replica with the following failover

Unfortunately, when working with MySQL or MariaDB, the third option will not work in this case due to replication limitations. Basically, if a replication has a different character set than the source, it will not work. But, you can read more about it in this article.

Post-check

In case you want to be extra careful, and make sure that your data was converted to the utf8mb4 encoding scheme, you can run the same statement we ran here to extract the columns that would be converted. This is so that you can verify that those columns have indeed the new charset.

Conclusion

There are many articles concerning the differences between encoding schemes and the characters they encompass. Once you find the encoding scheme you need for your data, then you can follow the guidelines mentioned through this article to convert your database, tables, and the necessary columns so that all of your data is in the charset you need.

If you are looking for more MySQL database tricks, then stay around and take a look at our articles. You will find that managing your database is not as complicated as it seems!