Fixing Corrupt Database Encoding
- 1 Background
- 2 Determining if your database is misconfigured
- 3 Fixing the database corruption
- 4 What to do if problems occur
- 5 Miscellaneous information
In 0.21-fixes and below, MythTV required that the database be set up such that MySQL thought that text columns were encoded using the latin1 encoding and that all communications with the database be done using the latin1 encoding. Some distros (notably, Gentoo) and some users have misconfigured their MySQL servers incorrectly with the effect that all 8-bit characters are corrupted in the database.
The end result of this misconfiguration is that any user with a misconfigured database can not successfully upgrade to post-r16789 SVN trunk (which means that upgrades to 0.22, when released, will also fail) and any non-Latin characters in data will be improperly displayed, even in MythTV 0.21-fixes and below. Before attempting to upgrade to post-r16789 SVN trunk or to 0.22, those users who have been running MythTV against a misconfigured database must fix the data encoding as described below. Those users running MythTV 0.21-fixes or below with non-Latin characters in their data should probably fix the data encoding as soon as possible.
When do I need to fix my database configuration?
You must fix your database configuration before upgrading to post-r16789 SVN trunk or to 0.22. If you plan to run MythTV 0.21-fixes, you may continue to use your system without fixing your database configuration. However, it doesn't hurt to fix the database configuration, now. If you decide not to fix your configuration now, make sure you remember to do so before upgrading MythTV in the future.
Note also that those users whose data contains non-Latin characters will see character/data corruption even in MythTV 0.21-fixes until they fix their database configurations. Those who notice character corruption should fix their database configurations as soon as possible.
Note on MythTV character encoding
Note that the use of the latin1 encoding does not prevent MythTV from using non-Latin characters. Instead, MythTV actually writes UTF-8 characters into the column and does all required conversions when reading/writing database data. The fact that the database uses latin1 encoding simply means that the MySQL server is unable to understand the data in the database--i.e. any 8-bit or multi-byte characters will be encoded in such a way that the MySQL server or any mysql client will see gibberish rather than the expected character. However, MythTV programs (including mythfrontend's GUI and OSD and MythWeb) will be able to interpret the data properly, meaning the user will see the proper characters through MythTV programs.
For example, in a misconfigured database, the character, "å", would appear as shown when using MySQL programs, such as the mysql command-line client. When properly encoded for MythTV, the character would be shown as, "Ã¥", by MySQL clients. Similarly, the character, "á", would be represented as "Ã¡" in a properly-encoded database. Note that--properly encoded--these characters (which are both represented as 2-byte characters in UTF-8) seem to MySQL programs to be 2 distinct characters. Similarly, 3-byte UTF-8 characters would be represented as 3 characters when viewed using MySQL programs.
Determining if your database is misconfigured
To determine if your database is misconfigured, execute the following command from the shell (fixing the path to the mysql command-line client and/or the username/database name as required) and type in the password when prompted:
echo 'status;' | mysql -umythtv -p mythconverg
Note that the default USE flags for configuring MySQL on Gentoo-based systems include the
-latin1 flag. This flag configures the MySQL server incorrectly for MythTV. While Gentoo users may check which USE flags were specified for MySQL, they should also verify their current configuration is incorrect by querying the server status, as shown above, and comparing results with the examples below.
A misconfigured MySQL server will show output such as:
-------------- mysql Ver 14.12 Distrib 5.0.60, for pc-linux-gnu (i486) using readline 5.2 Connection id: 14 Current database: mythconverg Current user: mythtv@localhost SSL: Not in use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 5.0.60-log Gentoo Linux mysql-5.0.60-r1 Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: utf8 Db characterset: latin1 Client characterset: utf8 Conn. characterset: utf8 UNIX socket: /var/run/mysqld/mysqld.sock Uptime: 4 min 45 sec Threads: 10 Questions: 45369 Slow queries: 0 Opens: 125 Flush tables: 1 Open tables: 107 Queries per second avg: 159.189
Client characterset, and
Conn. characterset are
A properly configured MySQL server should show output as below:
-------------- mysql Ver 14.12 Distrib 5.0.60, for pc-linux-gnu (i686) using readline 5.2 Connection id: 8162 Current database: mythconverg Current user: mythtv@localhost SSL: Not in use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 5.0.60 Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1 UNIX socket: /var/run/mysqld/mysql.sock Uptime: 14 days 22 hours 24 min 56 sec Threads: 5 Questions: 15125446 Slow queries: 42 Opens: 9904 Flush tables: 1 Open tables: 64 Queries per second avg: 11.722
Client characterset, and
Conn. characterset are
Recognizing the difference
The important difference between the output of a misconfigured versus a properly-configured server is that some or all of the character sets in a misconfigured database will be utf8. For a MythTV 0.21-fixes or below system, the character sets will be latin1.
Note that once upgraded to post-r16789 SVN trunk (or 0.22 or above), the output of the status command will differ--namely, the
Db characterset will be
utf8 in a properly-configured system. The value of the others is not important on post-r16789 SVN trunk/0.22+ versions of MythTV.
Fixing the database corruption
Note that the approach described here will only work on a database that was corrupted due to a misconfigured MySQL server, as described above. If you have any other type of corruption, applying the procedure outlined below will further corrupt your database--likely breaking it.
Backing up the MythTV database
To fix the broken encoding, you will first need to create a database backup, as described by Database Backup and Restore. Note that this is not an optional step that's recommended solely for safety--the backup will actually be used to "uncorrupt" the data.
Changing the backup file
The backup file created above was encoded using UTF-8 encoding. However, restoring it, unchanged, will restore an exact copy of the corrupt database. Instead, we will modify the backup to "re-encode" the data into the format that MythTV expects. To do so, execute the following command from the shell while in the directory that contains the backup created above (change the backup filename, as appropriate):
zcat mythconverg-1214-20081217145744.sql.gz | sed 's/SET NAMES utf8/SET NAMES latin1/' > mythconverg-to_uncorrupt.sql
If your backup file is not gzip'ed (if it is uncompressed), use cat rather than zcat.
Changing the MySQL server configuration
Then, once you have successfully created a database backup and modified it to "uncorrupt" your database, you will need to reconfigure your MySQL server such that it does not specify a database server default character set. In so doing, rather than forcing all database clients to use utf8 encoding for all communications (even though a program, such as MythTV, may have been written to use a different encoding), you will have configured your server such that, by default, clients use the character encoding of the database to which they connect (but may still request a specific character encoding). Therefore, changing this should not break other programs using other databases on the server; however, verifying this is up to the user. Note, also, that the original configuration--the one that won't work with MythTV--may have been causing data corruption for data used by the other applications the same way it did for MythTV applications, and changing the configuration may suddenly make the corruption visible.
The Gentoo database configuration file, /etc/mysql/my.cnf, should be similar to the one shown below. To fix the server configuration to work with MythTV 0.21-fixes, we must remove the lines starting with, "character-set-server" and "default-character-set". You can do so by executing the following commands as root (or with root permissions, using sudo):
mv /etc/mysql/my.cnf /etc/mysql/my.cnf-orig sed 's/^\(default-character-set.*\)/#\1/ ; s/^\(character-set-server.*\)/#\1/' /etc/mysql/my.cnf-orig > /etc/mysql/my.cnf
These commands will preserve the original my.cnf (as /etc/mysql/my.cnf-orig ) and create a copy at /etc/mysql/my.cnf with the incorrect lines commented.
Once this is done, restart the mysql server.
Clearing the corrupt database
For the following commands, you will need to use the mysql command-line client. The MySQL user you choose to use must have sufficient permissions to perform the commands. The MySQL
root user will have sufficient permissions. You can log in to the database as the
root user by executing the following command from the shell (change the database name--here,
mysql -uroot -p mythconverg
Once logged in, drop the corrupt the database (change the database name, if appropriate):
DROP DATABASE IF EXISTS mythconverg;
Then, create a new empty database (change the database name, if appropriate):
CREATE DATABASE IF NOT EXISTS mythconverg;
Then, set the appropriate character set (change the database name, if appropriate):
ALTER DATABASE mythconverg DEFAULT CHARACTER SET latin1;
Restoring the backup
Restore the backup, making sure to specify the filename using
--filename mythconverg-to_uncorrupt.sql and, if required, the directory (using
--directory /path/to/backup). During this process, the data stored inside the database will be converted to the format required by MythTV.
Backing up the uncorrupted database
Now that the database server and database data have been fixed, you should back up your database again, as described by Database Backup and Restore. Once complete, you should ensure you never restore a backup created before this time (a backup of the corrupt database).
What to do if problems occur
If problems occur while attempting to follow the above steps, please post a message to the [mailing list] explaining your configuration (distro/output of MySQL status), what part failed, and the error messages you received (please copy/paste the messages to ensure all relevant info is provided). You may also get a faster response if you mention my nick (sphery) in the message.
In the meantime, you may "revert" to the invalid configuration you were using previously by restoring your /etc/mysql/my.cnf-orig and the backup file you made at the beginning of the process (i.e. the one from before the zcat/sed script above (in the example, the one called,
Example misconfigured MySQL configuration file
# /etc/mysql/my.cnf: The global mysql configuration file. # $Header: /var/cvsroot/gentoo-x86/dev-db/mysql/files/my.cnf-4.1,v 1.3 2006/05/05 19:51:40 chtekk Exp $ # The following options will be passed to all MySQL clients [client] #password = your_password port = 3306 socket = /var/run/mysqld/mysqld.sock [mysql] character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8 [mysqladmin] character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8 [mysqlcheck] character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8 [mysqldump] character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8 [mysqlimport] character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8 [mysqlshow] character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8 [myisamchk] character-sets-dir=/usr/share/mysql/charsets [myisampack] character-sets-dir=/usr/share/mysql/charsets # use [safe_mysqld] with mysql-3 [mysqld_safe] err-log = /var/log/mysql/mysql.err # add a section [mysqld-4.1] or [mysqld-5.0] for specific configurations [mysqld] character-set-server = utf8 default-character-set = utf8 user = mysql port = 3306 socket = /var/run/mysqld/mysqld.sock pid-file = /var/run/mysqld/mysqld.pid log-error = /var/log/mysql/mysqld.err basedir = /usr datadir = /var/lib/mysql skip-locking key_buffer = 48M max_allowed_packet = 8M table_cache = 128 sort_buffer_size = 48M net_buffer_length = 8M thread_cache_size = 4 query_cache_type = 1 query_cache_size = 4M read_buffer_size = 256K read_rnd_buffer_size = 512K myisam_sort_buffer_size = 8M language = /usr/share/mysql/english # security: # using "localhost" in connects uses sockets by default # skip-networking bind-address = 127.0.0.1 log-bin server-id = 1 # point the following paths to different dedicated disks tmpdir = /tmp/ #log-update = /path-to-dedicated-directory/hostname # you need the debug USE flag enabled to use the following directives, # if needed, uncomment them, start the server and issue # #tail -f /tmp/mysqld.sql /tmp/mysqld.trace # this will show you *exactly* what's happening in your server ;) #log = /tmp/mysqld.sql #gdb #debug = d:t:i:o,/tmp/mysqld.trace #one-thread # uncomment the following directives if you are using BDB tables #bdb_cache_size = 4M #bdb_max_lock = 10000 # the following is the InnoDB configuration # if you wish to disable innodb instead # uncomment just the next line #skip-innodb # # the rest of the innodb config follows: # don't eat too much memory, we're trying to be safe on 64Mb boxes # you might want to bump this up a bit on boxes with more RAM innodb_buffer_pool_size = 32M # this is the default, increase it if you have lots of tables innodb_additional_mem_pool_size = 8M # # i'd like to use /var/lib/mysql/innodb, but that is seen as a database :-( # and upstream wants things to be under /var/lib/mysql/, so that's the route # we have to take for the moment #innodb_data_home_dir = /var/lib/mysql/ #innodb_log_arch_dir = /var/lib/mysql/ #innodb_log_group_home_dir = /var/lib/mysql/ # you may wish to change this size to be more suitable for your system # the max is there to avoid run-away growth on your machine innodb_data_file_path = ibdata1:10M:autoextend:max:128M # we keep this at around 25% of of innodb_buffer_pool_size # sensible values range from 1MB to (1/innodb_log_files_in_group*innodb_buffer_pool_size) innodb_log_file_size = 5M # this is the default, increase it if you have very large transactions going on innodb_log_buffer_size = 8M # this is the default and won't hurt you # you shouldn't need to tweak it set-variable = innodb_log_files_in_group=2 # see the innodb config docs, the other options are not always safe innodb_flush_log_at_trx_commit = 1 innodb_lock_wait_timeout = 50 [mysqldump] quick max_allowed_packet = 16M [mysql] # uncomment the next directive if you are not familiar with SQL #safe-updates [isamchk] key_buffer = 20M sort_buffer_size = 20M read_buffer = 2M write_buffer = 2M [myisamchk] key_buffer = 20M sort_buffer_size = 20M read_buffer = 2M write_buffer = 2M [mysqlhotcopy] interactive-timeout