String split using full text indexes

By | March 16, 2017

Recently I needed a way to extract keywords from email messages in order to create a list of noise words for a spam filter as well as implement an auto complete kind of option for the search box. The first option that came to mind was to use Term extraction in a SSIS package. The plan was to run the package every few hours to identify and extracts terms from new emails. Below I have attached a link to a video where I show how to use SSIS to extract terms. But this method wouldn’t give a lot of flexibility to the user to search for recent emails in addition we knew there had to be a better way.

The second option that was considered was to use any one of the STRING_SPLIT functions out there and simply process a list of keywords as we insert the data into the table. While this could be technically done it still made us anxious about the performance impact of doing such an operation while performing the insert. In addition it didn’t really distinguish between noise words and actual words like we have in SSIS.

The third option and the one we settled on was using the full text index catalog. We already had a full text index on the column in order to search emails for recent keywords. Naturally this means the index already has a list of keywords and a mapping to the documents they belong to. We simply needed a way to access them. Enter

sys.dm_fts_index_keywords_by_document

The above DMV allows us to query the full text catalog to identify the different words pick from columns within the table. Naturally we can specify the column as well as the min number of occurrence.

SELECT  * FROM sys.dm_fts_index_keywords_by_document
( 
    DB_ID('AdventureWorksDw2012'),     
    OBJECT_ID('DimProduct') 
)

 

Result set below

The above DMV allows us have a refreshed list of new keywords without having to implement another set of logic around it. You will also notice it doesn’t have any noise words in the output. The disadvantage of this approach is that string split only works in the context of natural language processing not comma delimited etc.

We also considered using another function related to FTS

SELECT * FROM sys.dm_fts_parser (' "The LL Frame provides a safe comfortable ride while offering superior bump absorption in a value-priced aluminum frame. It provides a nice ride" ', 1033, 0, 0)

This gives us a more detailed breakdown of the individual words within the text and their position. However since it was still a function we decided against using it and reinventing the wheel.


So as you can see when it comes to splitting strings using a delimiter such as space bar we have the ability to use Full text Search which has the added advantage of performing some standard house keeping as well.

References

http://www.enabledbusinesssolutions.com/blog/mary-series-3-using-ssis-to-extract-terms/

https://msdn.microsoft.com/en-in/library/mt684588.aspx