Monthly Archives: August 2018

Why does the query throw NULL and Schrödinger’s cat?

An interesting problem came up in #SQLHelp today. The query is fairly simply and aims to generate a random distribution of value 1,2 and 3. I am not sure of the use case for it but this is what the code tries to achieve.

SELECT choose(ABS(CHECKSUM(newid()))%3+1,1.0,0.6,0.3)

 

 

When the above block is executed multiple times we encounter a situation where the result of the above query is sometimes NULL. The question is why?

Let’s break down the problem is little further. The first thing to note is that the below code will always return a value between 1 and 3. It doesn’t not return NULLS or any other value.

ABS(CHECKSUM(newid()))%3+1

The below diagram explains the flow a little more clearly. The expectation is that first the orange block will run and then based on the output one of three possible paths may be taken.

This above flow makes lot of sense based on how we see the code being written. However we need to consider the unique nature of the CHOOSE function. It is essentially a CASE statement without an ELSE BLOCK. So the actual flow as far as SQL Server is concerned is more like the below diagram.

Notice how in the above case the operation is more serial in nature, because the check is performed three times, compared to our original assumption (the check is performed once and then simply directed to the corresponding block). In other words the query would be rewritten as below

SELECT   CASE
			WHEN ABS(CHECKSUM(newid()))%3+1 = 1 THEN 1
    		WHEN ABS(CHECKSUM(newid()))%3+1 = 2 THEN 0.6
		    WHEN ABS(CHECKSUM(newid()))%3+1 = 3 THEN 0.3
    END

 

But even in this case we should default to one of three possible values i.e. after the orange blocked is evaluated we would still end up with values 1,2 or 3 so the execution ( serial or parallel ) should still return a value from the green block. And here is where the OP made the mistake. The value of NEWID() changes with every call.

In the below screenshot notice how even when called within the same row in the same statement it has different values.

Because the value of newid changes with each call the actual execution flow of the query looks like this. With execution of newid() the value just misses the case statement condition that needs to be satisfied and is thus able to traverse the logic without being caught in any single condition.

To avoid the problem the OP should have written the query as below , so forcing SQL Server to use the persisted value of NEWID() till it reached the end of the case statement

DECLARE @newid UNIQUEIDENTIFIER = newid()
 
SELECT @newid, ABS(CHECKSUM(newid()))%3+1, choose(
ABS(CHECKSUM(@newid))%3+1,1.0,0.6,0.3),
ABS(CHECKSUM(@newid))%3+1

 

While writing tis post it occurred to me that this behavior has a very close similarity to something in quantum physics called superposition. In simple terms Superposition states that an electron can be thought of as being in multiple states until the time that it is measured. In other words the cat is both dead and alive and the true value can only be derived by measuring it. Much like our NEWID() where until we measure it , it’s thought to occupy all possible states and depending on when we measure it the value will converge to any of the possible states or none at all.

The use of table valued parameters and its performance impact – analysis

I read a very informative post today on Linkedin about how developers sometime don’t understand the impact of using table valued parameters in a stored procedure. A table valued parameter allows developers to pass an array or table in one call to a stored procedure. So it’s basically passing a table as the input. The post goes on to explain that because TVP do not have statistics they hinder the execution plan when it comes to optimizing the code. It further recommends that the developer move data from the TVP into a temp table (which has statistics created for it) before running the logic.

This got me thinking about the behavior of the Optimizer when it starts using the TVP and Temp table, e.g. would the overhead of loading into temp table, calculation of stats and the recompile effort be worth the improvements in performance if any?

To start off I create a table variable data type

CREATE TYPE LocationTableType AS TABLE   
( rid INT
, [OriginCityName] VARCHAR(200)
, [OriginStateName] VARCHAR(200)
, [DestCityName] VARCHAR(200)
, [DestStateName] VARCHAR(200)
 );

 

Then I create my procedure which calls the TVP as an input variable.

CREATE PROCEDURE checkrecompile @Location LocationTableType READONLY
 AS
 SET NOCOUNT ON
 
 SELECT * INTO #location FROM @Location 
 
 SELECT  [YEAR], [Quarter], [MONTH],  COUNT(*)
 FROM On_Time o
 INNER join #location l
 --inner join @location l
 ON l.rid = o.rid
 GROUP BY [YEAR], [Quarter], [MONTH]
 GO

 

Once done I start calling the above procedure for two different use cases, One in which we pass only 100 rows into the TVP and another with 100000 rows to see the impact.


DECLARE @Location AS LocationTableType;  
 
/* Add data to the table variable. */  
INSERT INTO @Location (rid,[OriginCityName],[OriginStateName],[DestCityName],[DestStateName])  
    SELECT TOP 100000 rid,[OriginCityName],[OriginStateName],[DestCityName],[DestStateName]
    FROM on_time;  
	PRINT '$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$'
/* Pass the table variable data to a stored procedure. */  
EXEC checkrecompile @Location;  
GO

The below execution plan shows the behavior when using TVP with 100 rows as the input, you can ignore the plan in Query 1 because that is used to populate the variable (something usually done via application). This query takes on average 1 ms.

The next execution plan is for 100000 rows in a TVP, This query takes on average 850 ms

Notice the warnings that arise due to insufficient memory grant causing the spill to tempdb hence the requirement for statistics.

From the above we can see that using a TVP doesn’t change the execution but causes inefficient memory utilization for workloads that have very different patterns.

Now we change the procedure code to store data from the TVP into a temp tale before running the join and group by

The below execution plan show the performance for 100 rows, here we see two plans the first for loading the data into temptable and then for the query execution. Notice how it isn’t very different from the plan we had for a TVP except for the HashMatch. This query takes 5 ms to complete as opposed to the 1 ms for the TVP
before but if we execute the same code once more the execution time dips to 1 ms. because the recompile overhead is no longer there. However if I switch to 100000 rows and then back to 100 rows we again encounter the recompile overhead.

Now let’s explore the TempTable logic with 100000 rows

As you can see we now benefit from parallelism due to statistics being available for data in the temptable. As a result we get a significantly different execution plan too. The above query takes on average 321 ms compared to the TVP which had an execution time of 850 ms. So we definitely see the advantage of using TempTables when it comes to large number of rows. Naturally we see increased CPU usage but that is a good thing considering the number of rows we are dealing with.

If the above procedure is recompiled due to changes in the number of rows we still see only an additional 73 ms overhead which answers the question of recompile for large data sets.

So to summarize

Using a TVP makes sense for small datasets (typically a few hundred rows)

Using a TVP also makes sense if the Server is already CPU bound or not TempDB optimized

Using a TempTable makes sense for large datasets (typically many thousands of rows)

For patterns where there is a frequent fluctuation in the number of input rows to the TVP it make sense to use TVP if the majority count is few hundreds ( its much better to have a query that takes 1 ms vs a query where recompile takes 6ms ) else use a TempTable ( it’s much better to have query that recompiles in 73ms than a query that doesn’t and still takes 3 times more time).

 

Reference

Table valued parameters and performance

Table valued parameters and performance