Monthly Archives: March 2017

String split using full text indexes

Recently I needed a way to extract keywords from email messages in order to create a list of noise words for a spam filter as well as implement an auto complete kind of option for the search box. The first option that came to mind was to use Term extraction in a SSIS package. The plan was to run the package every few hours to identify and extracts terms from new emails. Below I have attached a link to a video where I show how to use SSIS to extract terms. But this method wouldn’t give a lot of flexibility to the user to search for recent emails in addition we knew there had to be a better way.

The second option that was considered was to use any one of the STRING_SPLIT functions out there and simply process a list of keywords as we insert the data into the table. While this could be technically done it still made us anxious about the performance impact of doing such an operation while performing the insert. In addition it didn’t really distinguish between noise words and actual words like we have in SSIS.

The third option and the one we settled on was using the full text index catalog. We already had a full text index on the column in order to search emails for recent keywords. Naturally this means the index already has a list of keywords and a mapping to the documents they belong to. We simply needed a way to access them. Enter

sys.dm_fts_index_keywords_by_document

The above DMV allows us to query the full text catalog to identify the different words pick from columns within the table. Naturally we can specify the column as well as the min number of occurrence.

SELECT  * FROM sys.dm_fts_index_keywords_by_document
( 
    DB_ID('AdventureWorksDw2012'),     
    OBJECT_ID('DimProduct') 
)

 

Result set below

The above DMV allows us have a refreshed list of new keywords without having to implement another set of logic around it. You will also notice it doesn’t have any noise words in the output. The disadvantage of this approach is that string split only works in the context of natural language processing not comma delimited etc.

We also considered using another function related to FTS

SELECT * FROM sys.dm_fts_parser (' "The LL Frame provides a safe comfortable ride while offering superior bump absorption in a value-priced aluminum frame. It provides a nice ride" ', 1033, 0, 0)

This gives us a more detailed breakdown of the individual words within the text and their position. However since it was still a function we decided against using it and reinventing the wheel.


So as you can see when it comes to splitting strings using a delimiter such as space bar we have the ability to use Full text Search which has the added advantage of performing some standard house keeping as well.

References

http://www.enabledbusinesssolutions.com/blog/mary-series-3-using-ssis-to-extract-terms/

https://msdn.microsoft.com/en-in/library/mt684588.aspx

Charting controls: – Things to keep in mind – Part 2

In this second part of the series on charting controls we take a look at a few lesser known charting controls. You can view the first part of the series by clicking here.

Don’t use Area charts to plot a single metric unless it has to do with Volume.

If there is a single metric that needs to be plotted it can always be plotted against a line graph. The only exception would be if you’re trying to give a sense of volume to the data. Much like population charts or volume of shares traded. Take the below example where I am trying to show the number of flights being flown every month. The line chart doesn’t give a real indication from a volumes perspective. If on the other hand I was trying to plot the ticket price variation on a monthly basis then the line graph makes much more sense.

 

Using the right series with just the right kind of data

Area charts are best used to show variations in quantity proportionate to other metrics over time. For example we can instantly see that June and July seem to cause a spike in late Aircraft arrival. While at the same time being able to compare the relative sizes of each type of delay against the other.

This too can be done in a Line graph but the differences won’t be as obvious. This is a limitation of the data I am using so I have attached a better example of an Area chart below. The below chart plots the different genres popularity over the last 60 years. As you can see there are much more discrete series than the four I have plotted above. But in spite of it we can see the most popular genres at any time simply by visually comparing the areas. The limitation of this chart is it is possible for a few large metrics to completely mitigate any data points for smaller metrics. E.g. in the previous chart there is one more series called Security delay which is so insignificant that it doesn’t even look like it’s part of the area chart. It helps to have values such that the average of each series in the chart has a difference of at least 250% from each other.

Use a time series

As you can see from the above examples Area charts are best suited for time series data. In other words almost always Area charts are used when the x axis represents some measure of time. Which is why it is really good for plotting seasonal changes etc.

Using Polar charts

A polar chart or radar chart is a variation of the Area chart where the time series component is removed and replaced with some discrete series groups. Much like how I mentioned in my previous post that a line chart is good for showing flights by month and a bar chart is good for showing flights by carrier. One of the most common places the average person might have seen this chart is to plot performance characteristics of car within a dashboard. Below is an example of a Radar/polar chart showing a performance characteristics of a Car.? Notice how I am plotting only one care here. The nature of this chart is such that plotting more than 2 -3 different cars will result in the chart looking cluttered. The best aspect of this kind of chart is to notice outliers quickly. See how the car score a zero for safety. When used often enough the shapes make inherent sense to the user and the chart becomes super easy to read.

All cares plotted for reference

 

An interesting conversation with Google Assistant

If you have read my previous posts you know that I have an interest in AI and machine learning. Recently my phone was updated to include Google Assistant. For those using the iPhone Google Assistant is the Android version of Siri and supposedly smarter at doing things for you. While I have already been using Google Now( the older version ) for some time and feel comfortable using it I wasn’t ready for what happened next. As an adult I know how to use these tools and so it didn’t really dazzle me when I could send out email or open YouTube videos using voice. But when my 6 year old got hold of my phone that is when I realized that things have actually changed.

I find it difficult to keep my daughter entertained so it was shocking that she spent a solid 45 minutes using Google Assistant and didn’t get bored even once. I have previously talked about The Turing Test. Which according to google is

“a test for intelligence in a computer, requiring that a human being should be unable to distinguish the machine from another human being by using the replies to questions put to both.”

After what I observed I would like to propose a new definition.

“a test for intelligence in a computer, requiring that a human being should be able to distinguish the machine from another human being by using the replies to questions put to both while still being able to interact with the machine as if it is a human being”

While this might seem like chat bots are included there is a key distinction here, the user is still treating the machine like it would a friend. I am obviously referring to the emotional aspect. When I use Google Assistant I treated it like a machine doing a job for me but when my daughter used it she treated it like a friend who maybe has some difficulty understanding the language. While at the same time being aware that she is talking to a machine resulting in her saying things like “What is wrong with you” something she would never say to any actual person. Naturally google replied “I am sorry I might be a little buggy” and my daughter replied “like a lady bug”

The fact is that depending on the age and inclination to imagine/ ignore a few things computers are already capable of passing the truing test. The two played tic tack toe for about 10 minutes and I believe google graciously let her win. Look at me I am already referring to Google assistant like its human. The fact is if a machine can keep a 6 year old with a short attention span engaged in conversation for 45 it’s a big deal.

While I am not ready to a have a Google assistant Powered T-800 babysit I think the future is looking good for AI and everything it can do for us. All we need now is to get it into the hands of the needy.

Charting controls: – Things to keep in mind

I often do a number of trainings on MSBI and PowerBI and one of the tasks we do during the course is “I would ask all participants to use the raw data and represent any information in the way they see it being used by the CEO of the company”. The participants have complete freedom to select any kind of charting control and do any changes they want to the report. Often once this exercise is done I go to each of them and explain what is great and not so great about their choice. In this post I try to explain some of the common mistakes people make when using charting controls.

Too Much 3D

By far the most common problem with beginners is the over use of 3D visualizations. While I can understand the appeal. Making a chart 3D doesn’t convey any new information and when not implemented properly can often result in distracting the user from the actual numbers. Take the below example. Both pie charts show the same information. Do you find yourself staring at the blue rim at the bottom of the 3D chart? It is much easier to focus on the numbers on the left hand side chart than the right. A clean report not only performs better during loading but also helps people stay focused on the stuff that matters i.e. the actual numbers. I often see that this mistake is made by application developers used to designing websites and windows applications.

Note that I am not saying using 3D is bad in fact later on I will show an example where 3D becomes critical to the success of the report.

Too much data too little space

The next most common mistake I see includes trying to put too much data within a single control. This is often visible when using Bar charts. Database developers often make this mistake. They try to show a single chart that answers more than 1 question. While this might be valid in some scenarios often it’s a pain to try and visualize only the subset of data that you’re interested in. PowerBI and visual interactions are a great solution to this problem because a subset of secondary charts can be configured to show only relevant data points based on the interaction with the primary chart. However this is not true for SSRS as a result the user ends up with too much information on the chart and often misses out on the one relevant data point.

Ideally a chart should answer just one question or at most two. If a chart is trying to display more than 2 dimensions it should probably be a drill through report. Take for example the below chart. Here I have listed the number of flights for the year 2015 by Month for each carrier.

 

Since I am showing the data only for 2015 it actually makes sense to rewrite the report like below.

 

 

Notice that the above chart has some noticeable difference. First it is a stacked chart secondly the x Axis represents time and the series is the Carrier. The amount of information being displayed is the same however if you notice the chart pointer you will see the US Air ceased operations in June and the number of flights operated by it seem to be taken over by AA as can be seen from the higher percent of flights operated by them in July. Knowing which chart is best suited and how to represent the data best comes with experience but if the developer puts a little thought into the report before designing it these things become clear at the get go.

Not everything needs to be chart

This is another common issue I find with developers. Often when I ask for a report it’s assumed that I am asking for a chart visual. Developers needs to understand that while charts are good. Sometimes it is much easier to represent the data clearly using a standard table or matrix. With the ability to group on rows and columns and add subtotals it provides a lot of flexibility to show “big picture” like grand totals as well as detailed analysis. For example going of the same data set we used earlier I have created a Matrix report which conveys the same information where even the numbers are available for further analysis. Not to mention additional coding can be performed to change the color of the value depending on how number net change has occurred compared to previous month etc. I have added a data bar in order to give a sense of trend here just to point out that tables and Matrix can show trends without needing to resort to a line chart or bar chart.

Using Bar chart where a Line chart should be used

A line a continuous set of points and as such when the data is continuous and implies some sort of continuity it makes sense to plot it on the Line chart for everything else there is the bar chart. For example when I am plotting the number of flights per month it makes sense for me to use a line chart. As shown below

The same could also be represented as a bar graph as shown below. While I can admit the bar graph results in a more filled out graph it doesn’t suitably imply on the continuous nature of the months against which the data is being plotted.

The line graph can be made more “FULL” by changing the axis properties to show a scale that is range bound. This helps not only to fill up the chart but also accentuate the peaks and valleys and shows subtle changes in the data more clearly.

In the next post I cover some of the other charting controls such as pyramid, KPIs as well as new controls available in tools like PowerBI.