[mythtv] Database schema upgrade 1215 and 1216 causes loss of data

Wed Apr 23 20:21:42 UTC 2008

On 04/23/2008 03:58 PM, mythtv at miwers.dk wrote:
> I just took a look at ticket #5070 after having problems upgrading the 
> database (duplicate key errors).
> http://svn.mythtv.org/trac/ticket/5070
>
> It seems that converting the columns from latin1 to utf8 causes loss of 
> data, when the columns are first converted from varchar to varbinary and 
> then back to varchar again.
>
> When MySQL hits a special character during conversion from varbinary to 
> varchar, it gives a "Warning (1366) Incorrect string value" and 
> truncates the rest of the string.
>
> Since we're talking loss of data here (program titles, recording titles 
> and such), someone really should revert or fix that change, so nobody 
> else looses data over this. (good thing there's a DB backup, but still...)
>
> I made som tests and posted my results in the ticket.
> Skipping the varchar to varbinary conversion, and instead convert 
> directly from varchar (latin1) to varchar (utf8) worked fine for me 
> without loss of data.
>   

I'm not a dev, so don't get too upset by my (not important) opinion, but 
considering there should /never/ be any "special" characters in any 
valid MythTV database, this is not a problem. 

Myth has always required a latin1 character encoding in the DB.  Where 
characters exist that cannot be expressed in latin1, they are encoded as 
multiple latin1 characters before being written to the DB.  Myth had 
always done the character encoding conversions itself upon reading the 
data from the DB.  This approach allowed us to use 8-byte characters for 
most data (throughout much of the world) and to only get a multi-byte 
penalty where special (non-latin) characters actually exist in the data.

Many have warned users not to convert their databases to use other 
character sets.  I've been saying that doing so is creating a database 
time bomb--this being yet another case where the time bomb goes off.  
See the nice big warning at the top of:  
http://www.mythtv.org/wiki/index.php/Utf8_Text_in_OSD

> Is there a reason why it's converted to varbinary first?
> It seems like an unnecessary step to me...

Because we need to convert the UTF-8 data stored in latin1 encoding to 
real UTF-8.  If we have MySQL do the conversion directly from latin1 to 
UTF-8 on a /valid/ DB (one where any multibyte characters would be 
encoded as multiple latin1 characters), MySQL will add extra (garbage) 
characters where the extra bytes (which MySQL thinks are extra 
characters) of the multibyte character exist.  By converting from latin1 
to varbinary (=no character set, no data conversions), we're removing 
MySQL's assumptions about the data format so that it can convert from 
varbinary (=raw data already in UTF-8 binary format) to (MySQL's 
internally-recognized) UTF-8.

Therefore, I'd say that anyone who has corrupted his/her database by 
doing character set conversions is likely to have issues, and, IMHO, the 
devs shouldn't be expected to waste their time cleaning up messes that 
users made of their own data by altering the database outside the 
official process.

Users who had problems with character encodings should have reported the 
issues as specific bugs that could have been fixed properly, but it 
seems many people took the "easy" way and instead broke their DB's.

If you really want to help, figure out how to convert your (broken) 
pre-upgrade UTF-8 database to latin1 with UTF-8 encoded in latin1 so 
that any users who broke their databases can unbreak them before doing 
an upgrade.  Then, document the process on the wiki (with links from, 
i.e., the broken page to which I linked, above).

Again, just my opinion.

Mike