Browsing all posts tagged mysql

Even though the site aggravated me at first, I still occasionally troll Stack Overflow. One of the leading problems I see in questions pertaining to PHP & MySQL, is people's use of the MySQL extension in PHP. This extension, it turns out, is being deprecated. But does the documentation reflect this fact? Yes and no.

Certain function pages, such as mysql_real_escape_string, have big red boxes at the top indicating that the extension is being deprecated. "Don't use this", they seem to shout. Other function pages, however, such as the mysql_result page, don't have these warnings. Likewise, the top-level MySQL Drivers and Plugins page lists the MySQL extension first, with no indication whatsoever that the extension is being deprecated.

At the very least, every single documentation page that deals with the MySQL extension in any form or fashion, needs to include information about its intended deprecation. Otherwise, thousands upon thousands of programmers will write code using a plugin that is quickly nearing it's end-of-life. Which, based on what I see at Stack Overflow, already seems to be the case.

I ran into an interesting phenomenon with PHP and MySQL this morning while working on a web application I've been developing at work. Late last week, I noted that page loads in this application had gotten noticeably slower. With the help of Firebug, I was able to determine that a 1-second delay was consistently showing up on each PHP page load. Digging a little deeper, it became clear that the delay was a result of a change I recently made to the application's MySQL connection logic.

Previously, I was using the IP address 127.0.0.1 as the connection host for the MySQL server:

$db = new mysqli("127.0.0.1", "myUserName", "myPassword", "myDatabase");

I recently changed the string to localhost (for reasons I don't recall):

$db = new mysqli("localhost", "myUserName", "myPassword", "myDatabase");

This change yielded the aforementioned 1-second delay. But why? The hostname localhost simply resolves to 127.0.0.1, so where is the delay coming from? The answer, as it turns out, is that IPv6 handling is getting in the way and slowing us down.

I should mention that I'm running this application on a Windows Server 2008 system, which uses IIS 7 as the web server. By default, in the Windows Server 2008 hosts file, you're given two hostname entries:

127.0.0.1 localhost
::1 localhost

I found that if I commented out the IPV6 hostname (the second line), things sped up dramatically. PHP bug #45150, which has since been marked "bogus," helped point me in the right direction to understanding the root cause. A comment in that bug pointed me to an article describing MySQL connection problems with PHP 5.3. The article dealt with the failure to connect, which happily wasn't my problem, but it provided one useful nugget: namely that the MySQL driver is partially responsible for determining which protocol to use. Using this information in my search, I found a helpful comment in MySQL bug #6348:

The driver will now loop through all possible IP addresses for a given host, accepting the first one that works.

So, long story short, it seems as though the PHP MySQL driver searches for the appropriate protocol to use every time (it's amazing that this doesn't get cached). Apparently, Windows Server 2008 uses IPV6 routing by default, even though the IPV4 entry appears first in the hosts file. So, either the initial IPV6 lookup fails and it then tries the IPV4 entry, or the IPV6 route invokes additional overhead; in either case, we get an additional delay.

The easiest solution, therefore, is to continue using 127.0.0.1 as the connection address for the database server. Disabling IPV6, while a potential solution, isn't very elegant and it doesn't embrace our IPV6 future. Perhaps future MySQL drivers will correct this delay, and it might go away entirely once the world switches to IPV6 for good.

As an additional interesting note, the PHP documentation indicates that a local socket gets used when the MySQL server name is localhost, while the TCP/IP protocol gets used in all other cases. But this is only true in *NIX environments. In Windows, TCP/IP gets used regardless of your connection method (unless you have previously enabled named pipes, in which case it will use that instead).

When I added the favorite photos feature to my photo album software, I wanted a way to randomly show a subset of said favorites on the albums display page. I initially thought about implementing my own means of doing this through PHP. Ultimately, I wanted random selection without replacement, so that viewers would not see multiple copies of the same image in the 'Favorites Preview' section. Thankfully, MySQL saved the day!

When sorting a MySQL query, you can opt to sort randomly:

SELECT {some columns} FROM {some tables}
WHERE {some condition} ORDER BY rand()

The rand() function in PHP essentially gives you random selection without replacement for free! How great is that? It was an easy solution to a not-so-simple problem, and saved me a lot of programming time.

Update: I have since learned that the ORDER BY rand() call is horribly inefficient for large data sets. As such, it should ideally be avoided. There's a great article describing ways to work around these performance limitations.

In my previous article on Unicode, I discussed a little bit of background on Unicode, how to prep PHP to serve UTF-8 encoded content, and how to handle displaying Unicode characters. There's still a bit more we need to talk about, however, before we can truly claim internationalization support.

Prepping MySQL for Unicode

MySQL allows you to specify a character encoding at four different levels: server, database, table, and column. This flexibility becomes quite useful when working on a shared host (like I do at DreamHost). In my particular case, I do not have control over either the server or database setting (and both are unfortunately set to latin1). As a result, I set my desired character encoding at the table level.

To see what your current system and database settings are, issue the following SQL commands at the MySQL command prompt:

SHOW VARIABLES LIKE 'character_set_system';
SHOW VARIABLES LIKE 'character_set_database';

To see what character set a table is using, issue the following command:

SHOW CREATE TABLE myTable;

If you are fortunate enough to have control over the database-level character set, you can set it using the following command:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8;

The table-specific commands are similar:

(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8;

Column level character encoding can be specified when creating a table or by altering the desired column:

CREATE TABLE MyTable ( column1 TEXT CHARACTER SET utf8 );
ALTER TABLE MyTable MODIFY column1 TEXT CHARACTER SET utf8;

I personally recommend setting the character encoding as high up as you have the capability to. That way, you won't have to remember to set it on any new tables or columns (or even databases).

If you have existing tables that do not use the utf8 character encoding, you can convert them with a simple command:

ALTER TABLE ... CONVERT TO CHARACTER SET utf8;

Be very careful when attempting to convert your data. The convert command assumes that the existing data is encoded as latin1. Any Unicode characters that already exist will become corrupted during the conversion process. There are some ways to get around this limitation, which may be helpful if you've already got some Unicode data stored in your database.

Communicating with MySQL

Once our tables are ready to accept Unicode data, we need to make some minor changes in the way we connect our application to the database. Essentially, we will be specifying the character encoding that our connection should use. This call needs to be made very early in the order of operations. I personally make this call immediately after creating my database connection. There are several ways we can set the character encoding, depending on the version of PHP and the programming paradigms in use. The first method involves a call to the mysql_query() function:

mysql_query("SET NAMES 'utf8'");

An alternative to this in PHP version 5.2 or later involves a call to the mysql_set_charset() function:

mysql_set_charset('utf8',$conn);

And yet another alternative, if you're using the MySQL Improved extension, comes via the set_charset() function. Here's an example from my code:

// Change the character set to UTF-8 (have to do it early)
if(! $db->set_charset("utf8"))
{
    printf("Error loading character set utf8: %s\n", $db->error);
}

Once you have specified the character encoding for your database connection, your database queries (both setting and retrieving data) will be able to handle international characters.

Accepting Unicode Input

The final hurdle in adding internationalization support to our web application is accepting unicode input from the user. This is pretty easy to do, thanks to the accept-charset attribute on the form element:

<form accept-charset="utf8" ... >

Explicitly setting the character encoding on each form that can accept extended characters from your users will solve all kinds of potential problems (see the "Form submission and i18n" link in the Resources section below for much more on this topic).

Potential Pitfalls

Since PHP (prior to version 6) considers a character just one byte long, there are some potential coding problems that you might run into in your application:

Checking String Length

Using the strlen function to check the length of a given string can cause problems with strings containing international characters. For example, a string comprising 10 characters of a double-byte alphabet would return a length of 20. This might cause problems if you are expecting the string to be no longer than 10 characters. Thankfully, there's an elegant hack that we can use to get around this:

function utf8_strlen($string) {
    return strlen(utf8_decode($string));
}

The utf8_decode function will turn anything outside of the standard ISO-8859-1 encoding into a question mark, which gets counted as a single character in the strlen function (which is exactly what we wanted). Pretty slick!

Case Conversions

Forcing a particular case for string comparisons can be problematic with international character sets. In some languages, case has no meaning. So there's not a whole lot that one can do short of creating a lookup table. One example of such a lookup table comes from the mbstring extension. The Dokuwiki project implemented this solution in their conversion to UTF-8.

Using Regular Expressions

The Perl-Compatible Regular Expression (PCRE) functions in PHP support the UTF-8 encoding, through use of the /u pattern modifier. If you are making use of regular expressions in your application, you'll definitely want to look into this modifier.

Additional Resources

In learning about how to add internationalization support to web applications, I gathered a number of excellent resources that I highly recommend bookmarking. Without further ado, here's the list I've created:

Dustin and his wife recently uncovered an interesting limitation of my Monkey Album software: characters outside of the ISO-8859-1 (Latin 1) character set don't render properly. This comes as no surprise, seeing as I didn't design for Unicode. Being a rather egregious display error, I decided to set out and fix the problem. In the process, I learned quite a lot about Unicode, and how it affects web applications. This post will be the first of two detailing how to add Unicode support to a web application. I will only be exposing a tip of the Unicode iceberg in these posts. The ideas and practices behind Unicode support can (and do) fill the pages of many books. That said, let's jump in.

Brief Background

For the uninitiated, Unicode is a coded character set. That is, it maps a unique scalar value (a code point) to each character in a character set. ASCII is another example of a coded character set. Each character in a coded character set is intended to be encoded using a character encoding scheme. ISO-8859-1 is an example of a character encoding scheme.

It is important to note that ISO-8859-1 is the default encoding for documents on the web served via HTTP with a MIME type beginning with "text/". So, if you're not set up to specifically serve another encoding, your web pages are most likely using ISO-8859-1. This works just fine if you speak English or a subset of European languages. But because the ISO-8859-1 character encoding uses only 8 bits for its encoding scheme, it is limited to 256 possible characters. It turns out that 256 characters isn't enough for international text representation (the Chinese and Japanese languages come to mind). What can we do?

Thankfully, we have a solution in Unicode. A number of Unicode encoding schemes are available for us to use: UTF-7, UTF-8, UTF-16, and UTF-32. Each has its merits and detractors, but it turns out that UTF-8 is the preferred encoding of choice in the computing world (it's a nice trade off between space allocation and capability). As a bonus, UTF-8 works nicely with ASCII, which makes migrating English-based websites much easier.

Unfortunately, we have another major problem to deal with. All PHP releases (prior to the upcoming PHP 6) internally represent a character with 8-bits. That's right: PHP has no native support for international characters (yet)! This means that we have to be extra vigilant in our pursuit of internationalization support. So how do we do it?

Prepping Our PHP for Unicode

In order for our PHP application to properly display Unicode characters, we need to do some preparatory work. This involves setting the appropriate character encoding in a few places. We'll first set the encoding in the header:

header('Content-Type: text/html; charset=UTF-8');

Remember that the header() function must be called before we output any HTML, so it needs to appear early in the chain of events. Note also that the header call incorrectly labels the encoding as a 'charset,' making the naming conventions even more confusing.

We can also specify the encoding through the use of a meta tag (I recommend setting this even if you set the header):

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

If you take this route, make sure this tag is placed near the top of the <head> element in your HTML (before your <title> element, in fact). Otherwise, the browser may select an incorrect encoding.

To verify that that the appropriate encoding is being used, you can use the View Page Info feature in Firefox (just right click the page and you'll see it in the context menu). Here's an example:

Page Info Dialog in Firefox Showing UTF-8 Encoding
Displaying Unicode Text

One of the primary functions that PHP provides to convert characters into their HTML entity equivalents is the aptly named htmlentities() function. However, since we're converting our application to support UTF-8, we don't need to make use of this function. Why is this? First, HTML entities are generally only understood by web browsers. By converting special characters into HTML entity equivalents, it becomes much harder to move data between the web application and other data sources (RSS feeds, for example). Second, and most importantly, UTF-8 allows us to display extended characters directly. To quote Harry Fuecks [PDF], with UTF-8 "we don't need no stinkin' entities." Instead, we should only worry about the "special five":

  • Ampersand (&)
  • Double Quote (")
  • Single Quote (')
  • Less Than (<)
  • Greater Than (>)

Thankfully, PHP gives us the htmlspecialchars() function to handle these five special characters. One very important thing to note is that this function allows you to specify the character encoding to use when parsing the supplied text. For example:

htmlspecialchars($incomingString, ENT_QUOTES, "UTF-8");

Specifying the character encoding is very important when using this function! Otherwise, you open yourself up to to a rather nasty cross-site-scripting vulnerability, something that even Google was susceptible to a while back. In short, the character encoding specified in your htmlspecialchars() call should match the encoding being served by the page.

What Next?

In the next article, I'll cover the following topics:

  • Prepping MySQL databases for Unicode
  • Accepting Unicode characters from the user
  • Potential PHP pitfalls
  • Useful resources (loads of helpful links)

As always, if you have suggestions or questions, feel free to post them.

Sun Purchases MySQL

Jan 18, 2008

It seems as if Sun Microsystems has purchased MySQL. I don't fully understand the motivation behind this purchase, but Sun must have some plan; otherwise they wouldn't have paid one billion dollars for the company. In my opinion, Sun doesn't have the greatest track record in software, so it should be interesting to see what happens as a result of this change. According to the official MySQL post:

Will MySQL’s support for other programming languages and operating systems now be given less attention? Absolutely not. MySQL is still being managed by the same people, and the charter is still the same.

We can only hope.

While working on my rewrite of Monkey Album, I ran into an interesting programming dilemma. In the past week or so, I've been introduced to the MySQLi extension in PHP. The current Monkey Album implementation makes use of the PHP 4 mysql_*() calls, so I thought I'd try out the MySQLi interface to see how it works.

MySQLi includes support for what are known as "prepared statements" (only available in MySQL 4.1 and later). A prepared statement basically gives you three advantages: (1) SQL logic is separated from the data being supplied, (2) incoming data is sanitized for you which increases security, and (3) performance is increased, since a given statement only needs to be parsed a single time.

It seems to me that the performance benefit can only be seen in situations where the query is executed multiple times (in a loop, for example). In fact, an article on prepared statements confirms this suspicion; the author in fact mentions that prepared statements can be slower for queries executed only once.

So here's the problem I face: the queries that get executed in Monkey Album are, for the most part, only ever executed once. So, do I make use of prepared statements just to get the security benefit? It doesn't seem worth it to me, since I can get the same security by escaping all user input (something I already do today). Does someone with more knowledge of this stuff have an opinion? If so, please share it.

While working on my photo album software, I ran into an interesting SQL problem. I wanted to be able to display information about my photo albums, along with the number of images in each album. The problem is that my data is broken up into two tables: an albums table and an images table. My goal was to use exactly one SQL query to access all of the data, including the count of images. And I wanted empty albums (no images) to also show up in the query's results. But try as I might, I couldn't get the query to return the data I wanted. I finally found a solution that works, and I present an example below.

Let's suppose we have two MySQL tables: one that represents directories, and another that represents files. The directories table has the following columns:

  • ID
  • Name

And the files table has the following columns:

  • ID
  • Parent_ID
  • Name

The Parent_ID field in the files table corresponds to the ID field in the directories table. In order to select both the count of files in each directory, as well as all of the directory information, we do a simple join. But here's the trick: the order of your tables matters! Here's the query that works for this scenario:

SELECT d.*, Count(f.ID) AS Count FROM directories d LEFT JOIN files f ON f.Parent_ID = d.ID GROUP BY d.ID

When the tables are reversed in the JOIN, only tables with 1 or more entries show up in the results. What a subtle change! Hopefully someone will find this tip useful. It sure took me a while to get this working.