Usually, “searching a database” means looking for rows that contain a “document” in a field that matches a certain keywords set (mostly using the MATCH AGAINST construct); let’s say you want to do the opposite instead, that is finding those rows that have a field containing keywords, or a phrase, that matches to a text you already know. This is what I call an inverted search, or reverse search.
Real-life scenario: you build a website for job offers, candidates come in searching for listings that suit them, but find none, or find just a few that are not satisfactory. Either they keep coming to the website doing a search (let’s say they use the keywords “veterinar* cat” because they want to take care of animals, and like cats above all), or they subscribe to the RSS feed for that search (if your site offers one)… or they “save” this preference, and have the site send them an email as soon as a new offer pops up that matches their search, and this is where this article comes in.
Now, it’s easy to save the keywords for each user in a MySQL table, but how easy is it to check if the text of new messages matches said keywords? In PHP, you would go about building an array of keywords, and using strpos() on each of them to see if they are all contained in the text.
In MySQL you have the INSTR() which is the equivalent of PHP’s strpos(), but you cannot store arrays in a field (I mean real arrays, not serialized ones), and storing as many rows as the keywords are is going to be troublesome, as normalization isn’t needed: a “set of keywords” is mostly unique to each user subscribing to a search, so it makes sense to store them in a text field in the form of “keyword1,keyword2,keyword3,…”; again, if you were in PHP, you would do a explode(“,”,$keywords) to obtain the array of keywords, but in MySQL there is no such function, so you pretty much have to create one.
Ideally, you need a function where you pass the known document/text to match against (the concatenated subject and the body of the job offer, in this example), the string in the form “keyword1,…,keywordN“, and a parameter which tells the function that the delimiter (“glue” in PHP’s explode/implode) is a comma, so:
somefunction(document, keywords, delimiter)
This function would then return a positive integer if all of the keywords are present in the text, or 0 otherwise.
The function I put together by scraping info around was:
DELIMITER ;; DROP FUNCTION IF EXISTS `matchkwdarray`;; CREATE FUNCTION `matchkwdarray`(`str` text, `arr` text, `delimit` tinytext) RETURNS char(1) CHARSET utf8 DETERMINISTIC BEGIN DECLARE i INT DEFAULT 1; DECLARE matches INT DEFAULT 1; DECLARE token TINYTEXT; label:WHILE(matches>0) DO SET token=replace(substring(substring_index(arr, delimit, i), length(substring_index(arr, delimit, i - 1)) + 1), delimit, ''); IF token='' THEN LEAVE label; END IF; SET matches=INSTR(str, token); SET i=i+1; END WHILE; RETURN matches; END;; DELIMITER ;
To add this to your MySQL database, you can either connect via CLI, or, as most people with a website on a shared hosting, not having command-line access, using a database frontend; phpMyAdmin, bloated as it is, doesn’t support functions/routines, so you need another one, and I found Adminer to be a massively better alternative, performance and flexibility-wise, and it’s only a 150kb-something single PHP file, which makes it all the better.
So, when you test the function, you have the following:
SELECT matchkwdarray('this is just a very simple test', 'just,test', ',') Returns: 2 SELECT matchkwdarray('this is just a very simple test', 'some,stuff', ',') Returns: 0 SELECT matchkwdarray('this is just a very simple test', 'this,very,stuff', ',') Returns: 0
The function is resource efficient: as soon as a keyword is not in the text, it stops processing the keywords array. Surely, if you don’t want a 100% match but are fine with lower than that, you can easily modify the function to “stream” the whole keywords array, even removing the last character off each keywords to allow some variations, and return a percentage match which you can then evaluate later on.