[mythtv] Re: [mythtv-commits] mythtv commits

Bruce Markey bjm at lvcm.com
Tue Jun 1 22:16:42 EDT 2004


David Engel wrote:
> On Tue, Jun 01, 2004 at 08:12:18AM -0500, Kevin Kuphal wrote:
> 
>>What about using full text searches?  I think these are supported in 
>>MYSQL3 (but I could be mistaken) and they're generally pretty fast in 
>>matching text compared to a LIKE I believe.
> 
> 
> Interesting.  I didn't know about full text searches and patterned the
> search rules after the existing search functionality, though I did
> consider switching from LIKE to REGEXP.
> 
> On one hand, the full text searches could provide the flexibility I'd
> hoped to provide with REGEXP, but in a much user friendlier way.  On
> the other hand, the "fuzzy" nature of full text searches gives me some
> pause over false positives.  I'm also concerned about adding data to
> tables with existing full text indexes being slow, think
> mythfilldatabase.

I don't think about mythfilldatabase much 'cause I'm usually
asleep at the time =). If it added significant time (even 30sec
or a minute) but saved tenths of a second while the user was
interacting it would be a win. If this is any indication, I
don't think the text indexing will be a problem.

mysql> ALTER TABLE program add FULLTEXT (title,subtitle,description);
Query OK, 26198 rows affected (3.40 sec)
Records: 26198  Duplicates: 0  Warnings: 0

I'd be more concerned about fuzzy matches. Using search terms
to find a list to choose from and matches to automatically record
seem similar but in practice, they are used differently.

I may want to do a search for "Lakers" so that I can get a proglist
for games, pre-game shows, highlights, player interviews, documentaries
and a few fishing shows then choose which things I'd like to set
for Single record. However, if I want to automatically record games
I want "Los Angeles Lakers" preferably in the subtitle only and
preferably with just the title "NBA Basketball".

Fuzzy matches would be a good thing for the search pages but may
cause false positives for k*Search rules, as you mentioned, and
this could lead to some unintended conflicts.

I don't see the time for the search rules as being much of a
problem. I usually have about 190-230 items in my record list.
Scheduler runs used to average .6-.7 sec. I've added five search
rules and scheduler runs are now about 1.1-1.2 sec. I don't think
the latency is really any issue under say 3 sec. The old scheduler
used to take me 3 to 9 seconds so I don't feel the new scheduler
with a few search rules is a problem. I could see that if someone
got carried away and added 40 or 50 search rules they may be
disappointed.

I went to find some test words to compare but ended up finding
more questions than benchmarks ;-). for the amount of data I have
and the CPU, all my LIKE queries for any word took 0.09 to 0.12
seconds. All the full text searches reported 0.00. However, it doesn't
seem to support wildcards at all. Someone correct me if I'm wrong
but you couldn't do "Junkyard%Wars" as you can with LIKE. Also,
it seems to do exact word matches as in this example with "John"
and "Johnson"

mysql> select count(*) from program where description like '%John%';
+----------+
| count(*) |
+----------+
|      236 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from program where description like '%Johnson%';
+----------+
| count(*) |
+----------+
|       24 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM program WHERE MATCH (title,subtitle,description) AGAINST ('John') and description LIKE '%John%';
+----------+
| count(*) |
+----------+
|      183 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM program WHERE MATCH (title,subtitle,description) AGAINST ('John') and description LIKE '%Johnson%';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.01 sec)

If I search for "the" it returns 0 even though there are 7423.
That's good that it doesn't index these but "before" and "after"
also return 0.

The "Miss Universe Pageant" has one showing with the description:

"Women vie for the crown at the 2004 competition; co-hosts Daisy
Fuentes and Billy Bush; Gloria Estefan performs."

If I match:

'Women vie'    141
'Women'        141
'vie'            0
'Women 2004'   211
'2004'          71

"Vie" doesn't match? How big is their stopword list? In addition
to each word being some form of exact match and no wildcards, each
word is "or"d. This may not be what we need. For the purpose of
setting automatic record rules, I kind of like wildcard or regex
matches. I think the full text searchs would be less flexible, and
less accurate for limiting the result to exactly what to want to
record. I think the full text is better suited to scoring results
for a search box on a website or that kind of thing.

--  bjm


More information about the mythtv-dev mailing list