Monthly Archives: October 2015

Using Azure Scheduler

Azure scheduler is a feature that allows users to schedule activities within Azure. Scheduler works by first creating job collections. This is a great way to automate the call of web services.

Creating the job

After clicking the create scheduler job the below screen appears

Click Custom Create, choose the region and give the job a name, notice that job collections are started as standard by default and can be changed to Free later.

Press the next arrow and configure the Steps within the job

Press the Next button to visit the next screen, where you can define the schedule as one time or recurring as shown below.

Press the Finish button the start creating the job.

Changing the job to Free

Select the job collection from the list

Select Scale from the menu and change the job to free. The maximum limit for recurring job is once per hour for Free tier which means the job can run at most once per hour.

Once changed press Save at the bottom

And Confirm by pressing Yes.

Viewing the Job History

Click the History option from the Menu, select the Job from the drop down list and status as needed.

Clicking the job history enables the View History details button at the bottom of the screen, clicking the View history details will display any errors or message from the request.

Modify a Schedule

Select the job collection and then select Jobs from the menu

Select the job you want to modify and then it will open a page where all the properties for the job can be modified.

The hazards of data types when deleting data

Recently in one of the forums a question was posted where the developer had run a delete statement similar to the one below

delete from #tmp
where data =1

This deleted all the data in the table, obviously there was a large number of different values for the column data. Now the first and most obvious thought we had was that only the first line was selected and run but actually that wasn’t the case.

A common thing overlooked by developers when writing code is the datatype. In this case the data column was varchar. So what comes after 1 in varchar?

11,12,13,2,111,111 and pretty much everything else.

The below code is a quick demonstration.

create table #tmp
( id int,
data varchar(10)

insert into #tmp
select 1 ,'01'
select 2 ,'1'


delete from #tmp
where data =1

--notice how two rows got deleted the column value get converted into int
-- confirm this by looking at the execution plan

drop table #tmp

Keep in mind that implicit conversions occur when performing lookups, this often adds to query performance issues and as you can see in this case it can sometime cause data issues as well. Keep in mind that when performing implicit conversion SQL converts the target column into the data type used by the user and not the other way around.

Call me anything but a GURU

The first time someone called me a SQL Guru it felt a bit odd coz I never considered myself one. But once the initial shock had worn off, I felt extremely proud about finally having attained a title that I have been aiming for almost all my career. However over time I have realized the title brings with it an enormous responsibility and that is something I wasn’t prepared for. While this post is from my personal experience I am sure most people will be able to relate to it. Attaining recognition is a double edged sword, on the one hand it feels great to be acknowledged, while on the other hand it adds a lot of pressure to keep earning that recognition.

I have met a number of Gurus / Experts / Ninjas, call them what you will, but one thing is for sure, they never gave the title to themselves it was always given to them by someone else. Obviously to reach such a high standing takes years of commitment and constant improvement of self and ones processes and this adds certain other traits which I am mentioning below:-

Extreme pride in ones work and ability, after spending years learning and doing a job people naturally take pride in it and gain confidence in their ability, 9 times out of 10 this is a good thing. But the one time it’s not good is when the same pride turns to over confidence and an opinion or belief held by the Guru starts becoming preachy. I have done this and I know it was wrong and most importantly I have tried to refrain ever since. I have had long standing beliefs , things I have learnt or trusted from day one turned on its head by someone I felt didn’t know as much as I did.

Pressure to be correct all the time, this too is 9 times out of 10 a good thing, it keeps us on our toes. Constantly forcing us to check and double check our numbers. But sometimes the pressure of losing face is more than some can bear and rather than saying Mea Culpa it turns into an all-out slugfest with egos bruised everywhere. I reserve my right to be Human and make mistakes. Being a Guru doesn’t mean you can’t be wrong, it means you will try your best to be right.

Pressure to reinvent one’s self, nobody likes to be called a one trick pony. Once you become a Guru the challenge is to stay relevant constantly updating your skills, learning and trying out new things and incorporating it into your daily life. There are still only 24 hours in a day and more knowledgeable you become the harder it gets to choose topics of substance that others aren’t already doing. This pressure is a good thing but people reaction to it have been different.

Humility, it’s a feeling that mountaineers and astronauts get when they see the world from that high up. A sense of proportion in seeing how vast the world is, after you climb the hill and realize there is a whole world still left to conquer. Gurus often get their title before they have fully mastered all aspects of their field. A DBA well known for performance tuning might only have a slight idea of BI but is still called a SQL Guru and expectation is he knows everything. While all that goes on in the Gurus mind is look at all this stuff I don’t know.

Protecting a reputation; most experts I know don’t use alias or nicknames, they use their actual names be it in blogs, online forums, websites, social media etc. They put everything on the table, every time. With this comes a level of accountability as well as a nagging sensation you’re being quoted out of context or referenced and linked back to at the drop of a hat. For every point being argued I am sure we will find one expert for and one against it. People then leave it to the experts to slug it out.

Always being nice. Last time I checked I am human and there are still plenty of things in the world that annoy me to no end. However being a guru means that people also look up to you. Try to emulate you and in this comes one of our biggest challenges, being nice all the time. We need to always conduct ourselves professionally no matter how pissed off we are. We might want to scream our heads off at how stubborn the other guy is being but we can’t coz it sets a bad example and also only brings down others opinion of us. Every once in a while we explode only to write a sheepish email about how sorry we are and how we should know better, but in the end we are still human and will still get angry, frustrated or just have one of those days.

So in summary what I am trying to say is being a Guru is not the end of the journey as most people believe. It’s actually the start of a much more epic one. One that requires a lot of the values that got us started along our paths and we lost along the way. From a time when we didn’t know better, weren’t afraid to admit it and couldn’t care less what everybody else thought.

So call me an Amateur any day!! Coz it derives from Amare which mean to love, in other words “A person who does something for the love of it”.

Does Partitioning Improve Performance?

Partitioning is a great feature for improving the way we manage our data. However we can also improve performance of select statements from the table as well. This requires a good understanding of the way partitioning works and how properly placing files on the disk can improve the IO performance. One common way that partitioning can improve performance for loading data is the sliding window method but in this post I am mainly talking about how to improve read performance. Before showing the improvements from partitioning it’s important to visit the indexes since a common belief is that a good index can provide faster data access than partitioning can.

You can watch the video here

Does partitioning improve performance?

We start off with a table that has 8.9 M rows in it.

The Final results are posted below as a summary to give some context

Description Time sec
Unpartitioned unindexed table 22
Unpartitioned unindexed table (MAXDOP=4 , single file) 22
Partitioned unindexed table (serial) 24
Partitioned unindexed table ( MAXDOP=4, four files in FG) 16
Unpartitioned Clustered index Column( serial and parallel) 28
Unpartitioned Clustered and Non Clustered index Column 15
Non clustered Columnstore index ( no aggregate, no column elimination, batch mode) 20


As a benchmark I try fetching data for all customers of a particular age, the distribution of data by age is shown below.

After flushing the buffer pool the time taken to fetch data from the heap is shown below

select *
from [dbo].[stg_Daytrip]
where REF_AGE =30

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 3525 ms, elapsed time = 22752 ms.


So we have a baseline of 17 sec with a table scan as the execution plan.


Now I create a unique clustered index on the ID column which is an identity column, after rerunning the query the new stats are shown below, this step is unnecessary on its own but included for clarity

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 26 ms.


SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 4181 ms, elapsed time = 28139 ms.

Not much of a change in fact it’s essentially the same with the table scan replaced with a Clustered index Scan


Next we add a Non Clustered index on the REF_AGE Column to improve the ability seek and filter rows by age. Shouldn’t this prevent the need to do a scan on the entire table?

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 5 ms.


SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 3900 ms, elapsed time = 26736 ms.


There is not much change, this is important!! The output rows fetched by the above query is 202768, which the optimizer decides is significant enough to not change the execution plan. Now I decide to free the procedure cache and let’s see if we can force a new execution plan to be created. After running the query the execution plan is still the same. The reason for this is

Even after filtering for rows with age 30 SQL still needs to fetch all the remaining columns , this will require an expensive key lookup and wasn’t worth it.


So when it comes to data warehousing just creating the index on the column doesn’t ensure it gets used especially since DW queries unlike OLTP queries end up fetching a significant portion of the data anyway. Now you could argue that no business case will fetch all columns from the table so the index should have included columns. And this would be right for an OLTP system , but in DW systems we design fact tables and all columns in the fact table are used in some query or the other and creating a different set of indexes with different included columns for each scenario is not practical or efficient and degrades Load performance too.

For the sake of completeness I have forced the index to be used with a query hint as shown below



from [dbo].[stg_Daytrip] with (INDEX =[NonClusteredIndex-20151022-130840])


where REF_AGE =’30’

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 2652 ms, elapsed time = 15741 ms.


As you can see after two indexes and query hints we have brought down the execution plan to 15 sec. However imagine how many more indexes will be needed before all reports are done.


So maybe we are using the wrong type of index. With SQL 2012 and above Microsoft have provided us with a special index called the non-clustered and clustered column store index. These indexes provide much better IO and CPU characteristics compared to traditional indexes. Below are the stats provided by the NONCLUSTERED COUMNSTORE INDEX

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 20592 ms, elapsed time = 20445 ms.




Notice how the execution time has increased to 20 sec. This is obviously worse than when we had normal indexes (the mode used for the NCS is batch mode which is the best way to fetch data using this index). So looks like Column store indexes are bad right? WRONG


A single columnstore index can be used to answer any query off the fact table now. So I don’t need to create a large number of indexes on my fact table for all my other reports also I don’t need to include columns which creates additional copies of the data and increases the size of the DW. CS indexes work much faster when the data is being summarized, a common use case for DW queries.


Here I am not summarizing my data therefore there is no benefit from CPU batch mode execution and since I am fetching all the columns from the table there is no columns being eliminated. If I execute the same query with only half the columns in the table the execution time also drops to about half.

Partitioning the table

Now that we know how indexes will affect our query the next step is to create a partitioned table and see the improvements. After dropping all the indexes I create two different file groups, the primary FG now contains only the rows where age is 30 and the rest of the data has been moved to another file. Also the primary filegroup now has 4 files in it.

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 3806 ms, elapsed time = 24412 ms.


After executing the query I see that the performance has degraded to 25 sec, hmm so it looks like partitioning doesn’t improve performance?

But see what happens when I execute the same query this time forcing parallelism and thus using multiple threads.

select *
from [dbo].stg_Daytrip
where REF_AGE ='30'
option ( maxdop 4)

SQL Server parse and compile time:


CPU time = 0 ms, elapsed time = 0 ms.


(202768 row(s) affected)


SQL Server Execution Times:


CPU time = 1809 ms, elapsed time = 16313 ms.




The execution time came down to 16 seconds, this is the improvement obtained by just placing the data on multiple files. In my case I am using a single SSD drive on which all these files are located , image if there are multiple dedicated DISK LUNs the improvement would be better.



Setting up Integrated Security / Windows Authentication with Azure SQL Databases

In this lengthy post I explain how to setup Azure AD Authenticated users such that they may be used to connect to SQL Database hosted in Azure. Users of Windows azure will note that the only way till now to connect with SQL database hosted in Azure was via User names and passwords created for logins. This is the typical SQL authentication format. However with large number of users connecting to the database and multiple applications connecting to the same database it can be cumbersome to manage SQL logins within the Azure database. A long time need for Azure developers was to allow some form of AD authenticated users to access the database since that was the industry standard for a long time. A place where all users can be managed from one place.

With the latest preview we are now able to add AD or domain authenticated users in Azure and then map them to contained users within the Azure database. Here is how you can do it.

Create an Azure AD and an Account.

Click New at the bottom of the Azure Management Console and follow the green arrows


Once you click CUSTOMER CREATE add the details below

Where my Domain is the name of the domain that belongs to you. After pressing the check mark you will see the below entry in the Azure Portal

Add a User to the azure AD Domain


Click the domain into which you want to add the user, you can repeat this step as many times as you need once for each user. In our case we will need an Database Admin AD Account as well as a Database Contained AD user account.


Click the Users link on the top of the AD Page

Click Add User at the bottom of the page

In the below pop up enter the account name for the user, for example in this case I call it testuser1

Press the Arrow at the bottom of the page, Enter details like first name , last name etc. and press the arrow

Create temporary password for the user in the below screen and share it with the user. The user will be prompted to change the password when they login next time.

Once created a domain cannot be delete until all users under the domain are deleted first.

Create and Azure V12 Database Server

Click the SQL Database option in the Azure portal and select Servers

Click the add button at the bottom of the screen

Enter the Username and Password details for the Server Admin in the screen below, this is not the same as the AD account that was created in the previous Steps, this is the Azure equivalent of SA account.

Make sure the Check box in red is checked, this will automatically create the database as a V12, if you already have a server you can upgrade it as shown here.

Once the database is created we need to add our AD user as the Admin for this SQL Server.

Adding AD user as Admin account


Connect to your Azure database by selecting the database from within the management portal.

Navigate to new portal link on the dashboard screen as shown below

Within the new portal navigate to the Server properties as shown below

Agree to the Terms

The Click Set admin on the top of the screen

Press Select at the bottom of the scree followed by the Save button marked in Red.

Now you have added an AD Authenticated Admin account for your SQL Server instance. This azure account is important because only this account can create a contained user that authenticates with Azure AD.

Download and install SQL 2016 SSMS

The Step is to install SQL 2016 SSMS which in turn installs the providers need to connect to Azure AD managed account

You can download SSMS 2016 here

Once installed you need to connect to your Azure database using the newly created and permission AD account.

Add the connection details

Press Connect and open a new query window.

Adding a Contained DB user

Before adding a contained DB user, go back to the first step and add an AD account for the contained user, the steps are exactly the same as creating the AD Admin Account.




Once the account is created you can start connecting using the contained user account, be sure to connect to the user database instead of the master database. Also if you get a message saying Password is expired make sure you’re not using the default temporary password generated by the Azure and that you have changed the password to a new value.

And that’s all there is to it, now you have AD authenticated users connecting to SQL Azure. This is especially usefule for connecting across domains where trust relationships are a problem.

Upgrading existing Azure databases to V12

V12 of Azure database has been out for some time now. Combined with all the new features being released in Azure and PowerBI it makes sense to upgrade. This post shows you how:-

Login to Azure website and select the database that you want to upgrade

Under the quick glance menu on the right select the new portal

Within the new portal, Click SQL Database> then the name of the database > followed by Server Version and then Upgrade this Server.

In the screen asking to enter the server name , enter the name in the text box

After verification there should be no more messages. Simply Press OK at the bottom of the screen.

Once the upgrade is scheduled you will see the below screen explaining how long the upgrade might take and the Server version will say “Upgrade Scheduled” instead of V2

When creating a new server in Azure you can simply check the box at the bottom of the screen where it asks the DBA username and password.

For a list of whats new in Azure V12 click here.

Very Large datasets GB to TB – Where to get them for free?

Every once in a while I need a very large dataset to work with, most often this required creating dummy data using one of my favorite data generators called Mockaroo. But sometimes a few GBs is not enough, at these times I turn to a little know feature of AWS (Amazon Web Services). AWS has provided a repository of public datasets that run in the 100s of GB ranging on everything from Climate data , NASA Satellite data , Census data , Economic Indicators, Transportation data etc.

You can find these datasets at the below link

Keep in mind the data is so huge the only way it can be provided is as an AWS Snapshot from which you need to create a volume. The size of the dataset range from 15 GB to 541 TB so there is something for everybody. Some of my favorites include Wikipedia Stats, Transportation data and Daily Global weather data.

Watch out for the processing costs associated with each instance by the way. This would be a great starting place for those looking to test Polybase with SQL 2016 as well. I have included the links on how to Launch an AWS instance and attach a snapshot volume to the instance below (for those who need an intro to AWS)

Attaching a Volume






Measuring Uptime in SQL Server – Why nines don’t work

A key part of every DBAs jobs is maintaining uptime. A DBAs capability is measured by the amount of Uptime he is able to ensure on servers he manages. An important part of this is calculating the uptime of the servers. While it might sound simple over the years I have seen different calculations being used to derive the uptime for a server.

In its simplest term uptime is calculated as the total time the server has been running over a particular duration. So if the defined time period is 1 year the calculation goes as shown below:-

Count the number of minutes in the year = Total Time = 525600

Count the number of minutes the server was down for the said year= 5.26

Calculate the uptime in nines format (1- (Downtime/Total Time)) = 1- (5.26/525600) = 99.9990%


However depending on where you work the definition of “down time” differs. For example a hosting provider doesn’t see it as downtime when the server is offline during non-business hours. Also the same amount of downtime might indicate different number of nines if measured over a leap year. Planned deployments are generally not accounted for under downtime since downtime implies unexpected shutdown. Sometimes the server doesn’t have to be down it simply needs to be unresponsive. If the server is unresponsive it doesn’t get counted under downtime since technically the server is still up, however the end user sees it as down, also its more difficult to measure unresponsiveness. Similarly network downtime doesn’t count as DB down time even if the DB is unavailable since there is nothing to fix on the database side.

Most often the DBA measures only the database uptime, if a server fails the down time of the server might be significantly higher compared to the database downtime e.g. when using clustering or mirroring.

While to concept of 5 nines is good to hear from a business stand point, the true measure of Server and Infrastructure health is “Mean time between failures”, a term the airline industry is very familiar with. The mean time between failures as its name suggests is the time duration between failures, put another way it’s the frequency with which errors happen on the system. How is this a true metric?

To achieve 4 nines you have a window of 52 min a year. This means you can have 1 failure of 52 minutes or 10 failures of 5.2 minutes each. This means in best case scenario the MTBF is either 1 year or in the worst case its once a month. Considering the above information would you still feel safe deploying mission critical databases to the latter? A drawback of the MTBF is that it doesn’t measure the duration or the impact of the failure, this means you can have a failure that lasted 2 days and then not have any other failures for the rest of the year and still end up with a number that looks really good. A combination of the two would provide a more complete picture on how frequently and how badly a server is affected by outages.

So the next time you decide to go with a hosting or cloud provider find out the much advertised Nines and the MBF before taking a call.

SSIS and SSRS Training for Wells Fargo

4 day training for MS SQL Server SSRS and SSIS with Wells Fargo in Whitefield Bangalore. It’s going to be awesome. Going to create and SQL Server Monitoring Application using SSIS and SSRS and hopefully a few other different types of Reports too.

Busy months ahead.