Fixing Corrupt Database Encoding

From MythTV Official Wiki
Revision as of 21:39, 23 December 2008 by Sphery (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Background

In 0.21-fixes and below, MythTV required that the database be set up such that MySQL thought that text columns were encoded using the latin1 encoding and that all communications with the database be done using the latin1 encoding. Some distros (notably, Gentoo) and some users have misconfigured their MySQL servers incorrectly with the effect that all 8-bit characters are corrupted in the database.

The end result of this misconfiguration is that any user with a misconfigured database can not successfully upgrade to post-r16789 SVN trunk (which means that upgrades to 0.22, when released, will also fail). Before attempting to upgrade to post-r16789 SVN trunk or to 0.22, those users who have been running MythTV against a misconfigured database must fix the data encoding as described below.

Note that the use of the latin1 encoding does not prevent MythTV from using non-Latin characters. Instead, MythTV actually writes UTF-8 characters into the column and does all required conversions when reading/writing database data. The fact that the database uses latin1 encoding simply means that the MySQL server is unable to understand the data in the database--i.e. any 8-bit characters will be encoded in such a way that the MySQL server or any mysql client will see gibberish rather than the expected character. However, MythTV programs (including mythfrontend's GUI and OSD and MythWeb) will be able to interpret the data properly, meaning the user will see the proper characters through MythTV programs. For example, in a misconfigured database, the character, "å", would appear as shown when using MySQL programs, such as the mysql command-line client. When properly encoded, the character would be, "Ã¥". Note that properly encoded, the character (which is represented as a 2-byte character in UTF-8) seems to MySQL programs to be 2 characters. Similarly, 3-byte UTF-8 characters would be represented as 3 characters when using MySQL programs.

Determining if your database is misconfigured

To determine if your database is misconfigured, execute the following command from the shell (fixing the path to the mysql command-line client and/or the username/database name as required) and type in the password when prompted:

echo 'status;' | mysql -umythtv -p mythconverg

Misconfigured server

A misconfigured MySQL server will show output such as:

--------------
mysql  Ver 14.12 Distrib 5.0.60, for pc-linux-gnu (i486) using readline 5.2

Connection id:          14
Current database:       mythconverg
Current user:           mythtv@localhost
SSL:                    Not in use
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server version:         5.0.60-log Gentoo Linux mysql-5.0.60-r1
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    utf8
Db     characterset:    latin1
Client characterset:    utf8
Conn.  characterset:    utf8
UNIX socket:            /var/run/mysqld/mysqld.sock
Uptime:                 4 min 45 sec

Threads: 10  Questions: 45369  Slow queries: 0  Opens: 125  Flush tables: 1  Open tables: 107  Queries per second avg: 159.189

Note the Server characterset, Client characterset, and Conn. characterset are utf8.

Properly-configured server

A properly configured MySQL server should show output as below:

--------------
mysql  Ver 14.12 Distrib 5.0.60, for pc-linux-gnu (i686) using readline 5.2

Connection id:          8162
Current database:       mythconverg
Current user:           mythtv@localhost
SSL:                    Not in use
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server version:         5.0.60
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    latin1
Conn.  characterset:    latin1
UNIX socket:            /var/run/mysqld/mysql.sock
Uptime:                 14 days 22 hours 24 min 56 sec

Threads: 5  Questions: 15125446  Slow queries: 42  Opens: 9904  Flush tables: 1  Open tables: 64  Queries per second avg: 11.722

Note the Server characterset, Db characterset, Client characterset, and Conn. characterset are latin1.

Recognizing the difference

The important difference between the output of a misconfigured versus a properly-configured server is that some or all of the character sets in a misconfigured database will be utf8. For a MythTV 0.21-fixes or below system, the character sets will be latin1.

Note that once upgrade to post-r16789

Fixing the database corruption

Note that the approach described here will only work on a database that was corrupted due to a misconfigured MySQL server, as described above. If you have any other type of corruption, applying the procedure outlined below will further corrupt your database--likely breaking it.

To fix the broken encoding, you will need to create a database backup.