This won’t interest most visitors to this site, but I had to post it somewhere as it will help someone out there, so it’s filed under Techie stuff.
A bout of heavy Google crawling of the site forced me to look at some of our slow database queries, which could be overlooked when we had one or two visitors per minute, but not when Google was hitting our site up to ten times per second (as it was at one point).
I got almost all of them optimised and the site was running fine, there were just a few left that were taking 2-3s or more to execute and I couldn’t work out why. We could live with 2-3s occasionally, but when the requests are coming multiple times a second they pile up, MySQL can’t catch up with the backlog and eventually the server crashes.
The queries were like this (some details have been changed for security reasons) – fulltext matches on product titles, used to find “related products” using three keywords chosen from the product title:
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('riders destiny dvd') LIMIT 8;
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('ghost town dvd') LIMIT 8;
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('ridge war new') LIMIT 8;
So why would a query like
SELECT prod_title FROM titles WHERE MATCH (prod_title)
AGAINST ('cruise mission impossible') LIMIT 8 run in less than 0.01 seconds, but queries like the above end up in my slow queries log, taking 5-6 seconds, and often over 10s when the backlog started to pile up? Most of the delay was in the Fasttext Initialization stage (as revealed by the profiler). I tried changing all kinds of settings with InnoDB, I tried limiting the number of results still further, but no luck. Of course, I did have a fulltext index on the field in question, else all the fulltext queries would have been slow.
So can you spot what is specific about the first 3 queries? If you can then you can have my job because it took me around 10 days to realise. Here are some more slow ones, see if you can spot it now:
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('riddick collection dvd') LIMIT 8;
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('riddle room new') LIMIT 8;
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('riddick collection new') LIMIT 8;
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('riddle dvd 2013') LIMIT 8;
SELECT prod_title FROM titles WHERE MATCH (prod_title) AGAINST ('ride cowgirl dvd') LIMIT 8;
The first thing I (finally) realised was that a lot of them had “dvd” among the keywords. But that didn’t really explain anything. It wasn’t till I realised that the titles that didn’t have “dvd” often had the word “new” that the penny finally dropped. It was three-letter keywords slowing down my fulltext searches! Sure enough, when I eliminated 3-letter keywords search times were drastically reduced. I even added a fourth keyword to my searches (to compensate for possible lost relevance in removing the 3-letter words), and expanded the LIMIT to 10 and searches remained in the hundredths-of-a-second range, and slow queries all but disappeared from my logs!
So I hope this will help someone else who is having problems with slow fulltext queries on InnoDB – try eliminating 3-letter words, if your application will allow!
Well, I don’t have an explanation as I don’t know enough about how MySQL works. I can tell you that my
innodb_ft_min_token_size setting (specifying the minimum word length to index) is/was set to the default of 3. I may actually set it to 4 and rebuild the index to see if that improves things further.
I thought maybe it might be about stop-words since a lot of those are 3 letters – maybe it was having to check the 3-letter words against the stopwords table somehow, but that doesn’t really make sense to me – there are four-letter words in there too, and words like “dvd” and “new” aren’t in there either.
Maybe someone can enlighten me, but I hope in any case the above article helps someone solve their problems with slow InnoDB fulltext queries!
Some new light I might be able to shed on this issue – I am almost certain this is what is going on. I found I can run this query:
SELECT prod_title FROM titles WHERE MATCH (prod_title)
AGAINST ('tom cruise mission impossible') LIMIT 8;
…and it will still run in a few hundredths of a second, even though it has the 3-letter “tom” in there. However, if I exchange “dvd” for “tom”, the query suddenly takes between half a second and a second to execute. What’s going on? Well, it seems obvious now. The string “dvd” has a very high cardinality in my index, at least 250,000 occurrences. Same for the string “new”, for example, which occurs more than 300,000 times. “Tom” on the other hand occurs less than 2000 times.
That MUST be what’s going on – words which are extremely common in a fulltext index need to be added to the stopwords list, or eliminated from queries, otherwise they slow down searches immensely. So it might not be 3-letter words at all – it’s just that they are the most commonly occurring strings. Someone more knowledgeable than me can maybe explain why.
Have I discovered something that is blindingly obvious to everyone else who knows anything about databases, or did I just make an important discovery? Either way it’s speeded up my queries no end, so I am happy, and hopefully I will help someone else out too.
This is definitely an issue with stopwords. I have at least one other word that has a very high frequency in my index, and that is “pack”. Not 3 letters, but still extremely common (think 3-pack, 4-pack etc.), occurring more than 200,000 times in product titles. And somehow most of my remaining slow queries have the word ‘pack’ in them:
'petkin jumbo pack wipes'
'sims showtime expansion pack'
'hold white pack spray'
That pretty much confirms it for me – I need to create a custom stopword table, add words like “pack” and any other very common words I can find (or just filter them out in my code) and the problem will pretty much go away. Cross-indexing with very common words can evidently cause very slow query times, presumably because the algorithm has to search through many occurrences of the stopword just to find a collation with other words in the search phrase. In fact just by eliminating the three-letter stopwords we have sped up queries to the point that we don’t even need to cache the results anymore, we can have them run on every page refresh with no major impact on site performance. Eliminating remaining high-frequency stopwords should finally put this problem to rest.