[mythtv-users] SOLVED Random lockups on Mythbackend

f-myth-users at media.mit.edu f-myth-users at media.mit.edu
Sun Apr 10 21:17:56 UTC 2011


    > Date: Sun, 10 Apr 2011 15:47:11 -0500
    > From: dwoody1 <dwoody1 at charter.net>

    > After a lot of testing I found the problem. The memory was bad. The
    > memory test ran for 4 days and over 100 passes without even one failure
    > (go figure).

That's perfectly reasonable, if unfortunate.

Lots of people assume, "If it passes memtest86+, it -must- be good,"
but that assumes that memtest86+ can test everything, which it can't.
(If it -fails-, you know you have a problem, but if it passes, you
actually don't know much.)

For example, I've got an old EPoX Ultra 9NPA3 that would corrupt a few
bits out of a few GB (in a pattern-dependent and mostly-reproduceable
way), -only- if CPU throttling was enabled.  Since memtest86+ always
runs at full CPU, it couldn't detect the problem.  But I spent a while
tracking it down and implicated only the memory or its datapaths,
because I could take one of the problematic files (of a few hundred
meg or a gig), read it once (to get it in the FS cache in RAM), and
then md5sum of the in-memory cached copy would return incorrect
results in a loop like "sleep 10; md5sum thing" but would return
correct results without the sleep---or with the sleep but if I
ran something CPU-intensive in another shell.  [I had earlier
exonerated all disk datapaths---tried IDE, SATA, and USB---and
since I was using a crypto FS, I -knew- those paths couldn't be
flipping random bits or they'd be flipping random -cleartext-
bits and trashing entire sectors of the FS, which wasn't happening.
I'm very glad I was paranoid and tested the machine before putting
it into production---I'd originally found the problem simply copying
a terabyte to it and checksumming the results, and when they didn't
match, started backtracing, beginning with the network hardware.]

My solution for that motherboard was to just disable CPU throttling.

Now, it's -possible- that some different brand of memory might have
been just enough different that these marginal throttling-dependent
changes in datapath speeds wouldn't have led to corruption, but I
didn't feel like screwing with it; turning off throttling didn't
matter and instantly fixed the problem for good.  It's now the first
thing I try when I see nondeterministic behavior, along with disabling
any spread-spectrum the motherboard might have available.  (I have a
pair of other, even older, motherboards where having SS on often leads
to boots where the clock runs fast many minutes/hour; turning off SS
decreases the incidence of that by about 10x or more.)


More information about the mythtv-users mailing list