Monthly Archives: July 2015

ETL 2 million rows – Comparison of BULK INSERT, BCP, OLEDB, ADO NET, OPEN ROWSET, SQL Server Destination task

A quick video comparing the performance of different ETL options when importing 2 million rows into a database. All tasks use the BULK Logged Options where possible.

ETl method performance

Have you seen this link on the ETL world record using SSIS?

https://msdn.microsoft.com/en-us/library/dd537533.aspx

Summary

Method Time
Bulk Insert 28 sec
DFT OLEDB – 500 1:13 sec
DFT OLEDB -9500 35 sec
SQL Server Destination 15 sec
ADO NET 1:20 sec
bcp 40 sec
Openrowset 36 sec

Why I don’t have a lot of personal info on the internet- OLAP – BIGDATA and the future of privacy

I was asked this question while having a chat with a few friends. Mainly the complaint was I don’t have a lot of social media presence. The only real social media I follow is LinkedIn and that’s because it is mostly a Professional network. But more importantly, the reason why I don’t have a massive internet presence is because I have been part of some significant BI projects and appreciate the level of information that can be extracted by the person who knows how.

While most of today’s discussions revolve around BigData, I would like to share an example from my days in Thomson Reuters. We used SSAS to query data from the FAST Search engine during the subprime mortgage crisis. What most people didn’t realize about the subprime mortgage crisis was that a few months before the banks started collapsing, Citibank, HSBC etc. announced a quarterly loss of excess 10.8 billion dollars mainly due to bad debt or subprime mortgages. Within 24 hours of this news hitting the wire, the most frequently searched term was Lehman Brothers. About 6 -8 months later Lehman Brothers filed for Bankruptcy. While sitting in IT, we weren’t able to connect the dots. Those in the industry were able to forecast which banks were in trouble and planned mergers and acquisitions in anticipation.

Cut to the present.

Today big news doesn’t just belong to big corporate MNCs. Things that shape the public perception could be anything from Gagnam style to the presidential elections. But a mistake people make is to assume that big analytics is done only for big things. A lot of companies use social media to keep track of the buzz around their products and even when people talk about poor service etc. While this is a good thing, it is still a double edged sword. I had recently posted about performing BI on your phone bill. Take your phone bill for a few months, plug it into SQL Server and identify patterns.

e.g.

  • If you attend a call in the middle of the night I know the person you are talking to is important to you
  • If I track the average times between the first call of the day and the last I will know when you go to sleep and when you wake up
  • If you get calls on the weekend I can assume those people are personal friends than professional
  • If you call someone in the middle of the night I can assume they are very close and the people you depend on.
  • If certain numbers only call you and certain other numbers only receive calls from you then I can make certain assumptions of the dynamics of the relationship.
  • If you are on a call regularly at a particular time each week I can assume you have a recurring meeting at this point.

This and many more….

Now you’re probably thinking that all this is based on the assumption that I have your phone bill with me. Thing is, I could probably have something better like a mobile app installed on your phone (especially the free kind or the poorly built one that is easy to hack or the one where I tell you outright but you are too busy to read the terms). Now imagine the same thing happening to your online conversations on Facebook, twitter or LinkedIn (they have apps too). Say for example, I take it one step further and correlate this with your bank statement and things get even more interesting, take it further and tie it up to your medical records and it gets even more interesting and so on and so on.

How did things get this far. It’s actually very simple when Facebook launched, a lot of people claimed that being an online platform it can never replace real face to face interactions but we see today that its pervasive in how it has impacted almost all the social connections a person has. This false sense of security allowed companies to really mine the data without the end user actually putting any restrictions on what they share on the internet.

So am I winning my fight against the internet:-

    No because it’s not a fight, it’s a choice I made on how much of my life I want a clustered index on.

    Even if it was a fight I don’t hope to win, because when it’s all said and done, the internet still makes my life a lot easier than it makes hard.

    Will I ever adopt social media if my privacy was guaranteed, no because I still lack the social skills to really make friends on the internet.

Should you be bothered about this post, no not really, until maybe I start up a Hadoop cluster in Azure HDInsight and decide to mine twitter for all your tweets ;-). I am just kidding I would never do that to you. You can trust** me.

 

https://azure.microsoft.com/en-in/documentation/articles/hdinsight-analyze-twitter-data/

You can now create nonclustered indexes on the Clustered Columnstore index

This post is to correct a mistake I made in my previous post on Updateable Non Clustered Columnstore index in SQL 2016; in that post I mentioned that Clustered Columnstore index doesn’t allow other indexes to be create on the table. This is true and false.

Let me explain.

If your using SSMS to create the index then you will see the below screen and message

However if you use the T-SQL Command you can still create the index on the table as shown below. In this case I guess it’s just SSMS that’s not updated.

I have raised a connect for this at the below link

https://connect.microsoft.com/SQLServer/Feedback/Details/1552617

How to achieve something similar in SQL 2014

http://www.brentozar.com/archive/2014/03/add-nonclustered-indexes-clustered-columnstore-indexes/

 

 

SQL 2016- Live Query Stats

I am a bit confused about this feature so I thought I would post about it to get feedback on its usage in SQL 2016. With SQL 2016 there is a feature called Live Query Stats being introduced. As its name suggests it provides details on the actual tasks being executed in real time. What needs to be kept in mind is that previously we had the Estimated (before) and Actual (after) execution plan. With this feature we get the Current execution plan we get to see the order and the progress of the tasks within the plan as they are being executed.

The most useful details are provided when you run

SET STATISTICS PROFILE ON

Before viewing the Live Query stats

SET STATISTICS PROFILE ON – notice the % done values under each task and the time in sec

 

The feature is only really useful for long running queries since smaller queries finish before the UI can actually display the animation. Also the feature does degrade the performance a bit. You can view the live execution plans from within activity monitor for long running queries. One of the way I see this being used is by L1 DBA looking to kill long running queries that are consuming system resources. You can now check how far along the query is before deciding to kill the query.

Is there any other use you can think of for this feature that you couldn’t do before?

SQL 2016- Activity monitor – Active Expensive Queries

Hurray L1 DBAs you now have an easier way to query active expensive queries. A new tab is added in activity monitor to show DBAs currently active queries. However I would still hope that everybody is familiar with using sys.dm_exec_requests

Plenty of new option popping up all over Management Studio get exploring.

SQL 2016 – Updatable Columnstore Index finally OLTP and DW workloads delivered

So SQL 2016 comes with updatable non clustered columnstore index. Are you wondering why this is a big deal. It is and here’s why

In SQL 2012 the Non clustered Columnstore index was introduced but was severely handicapped by the fact that once implemented the table was read only.

In SQL 2014 the Clustered Columnstore index was introduced that offered better compression and comparable performance to a Non Clustered Columnstore index but this also has several limitations such not being able to implement Primary key – Foreign Key relationships etc. which was not practical from a relational database stand point. Especially when most clients who wanted to use this feature in order to improve performance of already existing TB scale database. It did however allow DML on the table.

In SQL 2016 we now have the ability to insert data into a table with a Non Clustered Columnstore index this gives us the power of performing DW workloads without compromising the OLTP workloads seek performance. Now you could argue that a Clustered columnstore index would provide the best of both worlds but you will see in the screenshots below that the clustered columnstore index doesn’t really match up to a traditional index when it comes to seek. Now this would have been alright if we were allowed to create additional indexes on the table alongside the clustered columnstore index however that is not the case.

So SQL 2016 is the version where we finally have a properly functioning index that actually delivers on the promise on OLTP and DW workloads running side by side without compromising basic design practices like referential integrity.

Comparisons

Datawarehouse Workload

I created a test table on a VM running Windows 2012 and SQL 2016 with 3 GB RAM. The PK table is 1000 rows and the FK table is 2.5 M rows approx. 1.7 M rows qualify for the below query.

I create a Non Clustered Columnstore index and run the query and the average query execution time is 408ms.

Now I drop the Nonclustered columnstore index and replace it with a Clustered columnstore index and the average query execution time is now 293 ms.

Now I drop the Clustered columnstore index and replace it with a traditional index with key columns on id and Qty and included columns on price and Logdate supplemented with a clustered index on id giving the index the best possible chance to perform. The average query execution time is 818 ms

OLTP WORKLOAD

Now we explore the OLTP workload using the same indexes and notice that the Nonclustered columnstore index performs a scan on the table and there is a missing index suggestion to improve the seek behavior.

Next we try creating the traditional Non Clustered index and execute the query

The final numbers

Read performance

 

DW

OLTP

Traditional indexes

818

96

Non Clustered Columnstore

408**

176

Clustered Columnstore

293

150

**This number gets better with the number of rows added.

Write performance

 

100 Rows

Traditional indexes

16

Non Clustered Columnstore

20

Clustered Columnstore

15

NC index + NCCS index

24

 

While I did see some decrease in write performance for a combination for Non Clustered index with Non Clustered columnstore index I don’t think it is significant enough to be of concern. If you have a high concurrency system where these times matter then you’re probably better off not hosting both OLTP and DW on the same database to begin with.

My Code

select cd.country,cd.city,cd.Currency, ft.Logdate , sum(ft.price ) as price
from [dbo].custinfo cd
inner join fk_table ft
on cd.id = ft.id
where ft.Qty >= 3 and ft.Qty < = 8 Group by cd.country,cd.city,cd.Currency,ft.Logdate Go select cd.country,cd.city,cd.Currency, ft.Logdate , sum(ft.price ) as price from [dbo].custinfo cd inner join fk_table ft on cd.id = ft.id where cd.id= round(rand()*1000,0) Group by cd.country,cd.city,cd.Currency,ft.Logdate

 

Fill Factor and Page split demonstrated

Recently on one of the popular SQL Forums I read a post by a DBA who seemed experienced but still had a poor understanding of fill factor and why it is important for SQL server. I’ve been asked this questions a number of times during trainings so here is a video hoping to show some fundamental behavior of data page behavior when inserting rows into SQL server.

In the video I cover how the data page looks when you insert data into it. Followed by what typically happens when a data page is full, followed by the changes to the data page when a clustered index is added for a lower fillfactor than the default value of 100.

I then go on to demonstrate how lower fill factor ensures that data is added into the pages that are already allocated and how a page split might arise.

This is followed by how to correct the issue with page splits. The video covers just the basics, but if you need any further clarifications please feel free to ping me.

https://www.youtube.com/watch?v=ON4fy7_mbUQ