Born Geek » Unicode and the Web: Part 2

In my previous article on Unicode, I discussed a little bit of background on Unicode, how to prep PHP to serve UTF-8 encoded content, and how to handle displaying Unicode characters. There's still a bit more we need to talk about, however, before we can truly claim internationalization support.

Prepping MySQL for Unicode

MySQL allows you to specify a character encoding at four different levels: server, database, table, and column. This flexibility becomes quite useful when working on a shared host (like I do at DreamHost). In my particular case, I do not have control over either the server or database setting (and both are unfortunately set to latin1). As a result, I set my desired character encoding at the table level.

To see what your current system and database settings are, issue the following SQL commands at the MySQL command prompt:

SHOW VARIABLES LIKE 'character_set_system';
SHOW VARIABLES LIKE 'character_set_database';

To see what character set a table is using, issue the following command:

SHOW CREATE TABLE myTable;

If you are fortunate enough to have control over the database-level character set, you can set it using the following command:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8;

The table-specific commands are similar:

(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8;

Column level character encoding can be specified when creating a table or by altering the desired column:

CREATE TABLE MyTable ( column1 TEXT CHARACTER SET utf8 );
ALTER TABLE MyTable MODIFY column1 TEXT CHARACTER SET utf8;

I personally recommend setting the character encoding as high up as you have the capability to. That way, you won't have to remember to set it on any new tables or columns (or even databases).

If you have existing tables that do not use the utf8 character encoding, you can convert them with a simple command:

ALTER TABLE ... CONVERT TO CHARACTER SET utf8;

Be very careful when attempting to convert your data. The convert command assumes that the existing data is encoded as latin1. Any Unicode characters that already exist will become corrupted during the conversion process. There are some ways to get around this limitation, which may be helpful if you've already got some Unicode data stored in your database.

Communicating with MySQL

Once our tables are ready to accept Unicode data, we need to make some minor changes in the way we connect our application to the database. Essentially, we will be specifying the character encoding that our connection should use. This call needs to be made very early in the order of operations. I personally make this call immediately after creating my database connection. There are several ways we can set the character encoding, depending on the version of PHP and the programming paradigms in use. The first method involves a call to the mysql_query() function:

mysql_query("SET NAMES 'utf8'");

An alternative to this in PHP version 5.2 or later involves a call to the mysql_set_charset() function:

mysql_set_charset('utf8',$conn);

And yet another alternative, if you're using the MySQL Improved extension, comes via the set_charset() function. Here's an example from my code:

// Change the character set to UTF-8 (have to do it early)
if(! $db->set_charset("utf8"))
{
    printf("Error loading character set utf8: %s\n", $db->error);
}

Once you have specified the character encoding for your database connection, your database queries (both setting and retrieving data) will be able to handle international characters.

Accepting Unicode Input

The final hurdle in adding internationalization support to our web application is accepting unicode input from the user. This is pretty easy to do, thanks to the accept-charset attribute on the form element:

<form accept-charset="utf8" ... >

Explicitly setting the character encoding on each form that can accept extended characters from your users will solve all kinds of potential problems (see the "Form submission and i18n" link in the Resources section below for much more on this topic).

Potential Pitfalls

Since PHP (prior to version 6) considers a character just one byte long, there are some potential coding problems that you might run into in your application:

Checking String Length

Using the strlen function to check the length of a given string can cause problems with strings containing international characters. For example, a string comprising 10 characters of a double-byte alphabet would return a length of 20. This might cause problems if you are expecting the string to be no longer than 10 characters. Thankfully, there's an elegant hack that we can use to get around this:

function utf8_strlen($string) {
    return strlen(utf8_decode($string));
}

The utf8_decode function will turn anything outside of the standard ISO-8859-1 encoding into a question mark, which gets counted as a single character in the strlen function (which is exactly what we wanted). Pretty slick!

Case Conversions

Forcing a particular case for string comparisons can be problematic with international character sets. In some languages, case has no meaning. So there's not a whole lot that one can do short of creating a lookup table. One example of such a lookup table comes from the mbstring extension. The Dokuwiki project implemented this solution in their conversion to UTF-8.

Using Regular Expressions

The Perl-Compatible Regular Expression (PCRE) functions in PHP support the UTF-8 encoding, through use of the /u pattern modifier. If you are making use of regular expressions in your application, you'll definitely want to look into this modifier.

Additional Resources

In learning about how to add internationalization support to web applications, I gathered a number of excellent resources that I highly recommend bookmarking. Without further ado, here's the list I've created:

Character Sets / Character Encoding Issues
Handling UTF-8 with PHP
MySQL and UTF-8
Do you know your character encodings?
A tutorial on character code issues - Lots of theory; in-depth discussion
MySQL and UTF-8 — no more question marks!
Form submission and i18n
Survival guide to to i18n
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software

2 Comments

kip

4:54 PM on Aug 19, 2008

Thanks, I've switched my site to UTF-8 recently too (actually I had been thinking about it for a while and finally got motivated to do it after you last post reminded me of it). Lots of useful information here I didn't know about. I've found that the MySQL stuff isn't strictly necessary--you can store a UTF-8 string in a regular MySQL table, you'll just see some funny characters if you are querying the database directly rather than serving it on a page. And the regular expressions things I thought would be a problem, until I learned that no byte in a UTF-8 string will start with a 0 bit unless it is ASCII, so there's no chance of a byte in a UTF-8 encoded string having the same value as, say, an ampersand.

Jonah

10:51 PM on Aug 19, 2008

If you don't specify the MySQL encoding, you can get corrupted data. From one of the resource links above:

If you have characters that don’t encode the same in UTF-8 and latin1 (e.g. text in Chinese, Russian, ...) then the behavior depends on both the table definition and the encoding of the connection to the database.

So, if the table or column character set is latin1, and your connection is using the SET NAMES utf8 directive, the result will be a string where all non-latin1 characters are converted to question marks (which is clearly not what you wanted). I like forcing everything to UTF-8 for this very reason; it helps prevent unforeseen problems. Here's some more in-depth information on this topic.

Unicode and the Web: Part 2