Monthly Archives: January 2017

Set based vs row by row (procedural code)

I was reading a blog about how to write procedural code today ( link below) and wanted to take a moment to explore another approach just to provide a more complete reference. To provide some context I am explaining the business case here.

A hospital wants to identify people whose heights have changed over time in order to detect if the person is who they say they are. Since the height of a person doesn’t change after a certain age the approach is used to detect identity theft. The requirement is that we compare the heights logged during different visit and flag any patients where they have different heights logged.

The code to create the dummy data is mentioned below and was taken from the link mentioned earlier.

SET NOCOUNT ON;
 
CREATE TABLE Dates
(ID INT, VisitDate DATETIME);
 
--populate table with 20 visit dates
DECLARE @i INT, @startdate DATETIME;
SET @i = 1;
SET @startdate = GETDATE();
 
WHILE @i <= 20
BEGIN
    INSERT Dates
    (ID, VisitDate)
    VALUES (@i, @startdate);
 
    SET @startdate = DATEADD(dd,7, @startdate);
    SET @i = @i+1;
END
 
CREATE TABLE PatientHeight
(PatientID  INT NOT NULL
,Height INT);
 
-- populate table with 1000 patientids with heights between 59 and 74 inches
SET @i = 1;
 
WHILE @i <= 10000
BEGIN
    INSERT PatientHeight
    (PatientID, Height)
    VALUES (@i, @i % 16 + 59);
 
    SET @i = @i+1;
END
 
ALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight
    PRIMARY KEY(PatientID);
 
-- cartesian join produces 200,000 PatientVisit records
 
SELECT 
    ISNULL(PatientID, -1) AS PatientID, 
    ISNULL(VisitDate, '19000101') AS VisitDate,
    Height
INTO PatientVisit
FROM PatientHeight
CROSS JOIN Dates;
 
ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit
    PRIMARY KEY(PatientID, VisitDate);
 
-- create changes of height
SET @i = 3;
 
WHILE @i < 10000
BEGIN
    UPDATE pv
    SET Height = Height +2
    FROM PatientVisit pv
    WHERE PatientID = @i
    AND pv.VisitDate = 
    (SELECT TOP 1 VisitDate 
    FROM Dates 
    WHERE id = ABS(CHECKSUM(@i)) % 19);
 
SET @i = @i + 7;
END
 
/*
-- return AdventureWorks to its previous state when you are finished
-- with this example.
 
DROP TABLE Dates;
DROP TABLE PatientHeight;
DROP TABLE PatientVisit;
*/

The first approach uses a cursor to iterate row by row and identify if there has been a change in height. The second approach uses a ranking function to perform a self-join based on patient id to identify any change. Both these approaches are detailed below as well to save the reader time.

CUROSR BASED APPROACH

-- CURSOR BASED APPROACH
CREATE TABLE #Changes
( PatientID INT
, VisitDate    DATETIME
, BeginHeight SMALLINT
, CurrentHeight    SMALLINT);
 
DECLARE @PatientID        INT
,        @CurrentID        INT
,        @BeginHeight    SMALLINT
,        @CurrentHeight    SMALLINT
,        @VisitDate        DATETIME;
 
SET @PatientID = 0;
 
DECLARE Patient_cur CURSOR FAST_FORWARD FOR
SELECT PatientID
, VisitDate
, Height
FROM PatientVisit
ORDER BY PatientID
,VisitDate;
 
OPEN Patient_cur;
 
FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;
 
WHILE @@FETCH_STATUS = 0
BEGIN
-- first record for this patient
IF @PatientID <> @CurrentID
BEGIN
    SET @PatientID = @CurrentID;
    SET @BeginHeight = @CurrentHeight;
END 
 
IF @BeginHeight <> @CurrentHeight
BEGIN
INSERT #Changes ( PatientID
, VisitDate
, BeginHeight
, CurrentHeight)
VALUES
(@PatientID
, @VisitDate
, @BeginHeight
, @CurrentHeight);
 
SET @BeginHeight = @CurrentHeight;
 
END
 
FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;
 
END
 
CLOSE Patient_cur;
DEALLOCATE Patient_cur; 
 
SELECT * FROM #Changes
 
DROP TABLE #Changes
GO

SET BASED APPROACH

-- RANKING FUNCTION BASED APPROACH
 
WITH PV_RN AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, * 
    FROM PatientVisit
)
SELECT t1.PatientID
,t2.VisitDate AS  DateChanged
,t1.Height AS HeightChangedFrom
,t2.Height AS HeightChangedTo
FROM PV_RN t1 
JOIN PV_RN t2 ON t2.ROWID = t1.ROWID + 1
    WHERE t1.patientid = t2.patientid
        AND t1.Height <> t2.Height
ORDER BY t1.PatientID, t2.VisitDate;

There is however a third option which I believe would be much simpler and is also much faster has been listed below. Here we use the math operator STDEV which calculates a standard deviation for the heights that were captured. When the height hasn’t changed STDEV will return a value of zero else a non-zero number which can be used to identify patient’s with different heights.

THIRD APPROACH

-- MATH OPERATOR BASED APPROACH
-- HAS THE ADVANTAGE OF PROVIDING A MARGIN FOR ERROR FOR MEASUREMENTS AS WELL
 
SELECT PatientID , MIN(Height) ,MAX(Height) FROM PatientVisit WHERE PatientID in (
SELECT  patientid
FROM  PatientVisit 
GROUP BY PatientID
HAVING STDEV(height) > 0)
GROUP BY PatientID

Hope this approach helps clarify what we are doing with set based approach since recently I have noticed that ranking functions and CTEs seem to be the go to approach for most developers when they think set based logic.

Reference

http://sqlmag.com/t-sql/programming-sql-set-based-way

Calculating Pi in MS SQL Server

PI is a very commonly used mathematical constant, in fact it is probably more famous than the equation E=MC2. We might not find a lot of use for it within the database community but it is a fascinating value to calculate and so we set out to do just that.

First we tried using the inbuilt function within SQL Server


This as you can see causes a problem since the value is accurate only to 14 decimal places and it you are familiar with Math you will know that the value of PI has been calculated up to a Trillion places. So we were nowhere close.

Naturally we are not aiming for a trillion places but we were hoping for more than 14 so the effort continued. Using the fractional notation

Starting with 22/7 which is the most commonly used we get the below value which is not even close in terms of accuracy

So we went on to 355/113, which is better than the 22/7 fraction but still not as good as our original result from using the inbuilt function

So we decided to go basic and simply calculate PI as the circumference divided by the diameter of a circle. Which brought us to the below screenshot which is the same accuracy as the value of PI derived from the function

So now we needed to go extreme and start using some proper Math. While there are many different series that approximate the value of Pi we found that they do not converge fast enough. Often after 100 iterations we are only accurate to 6 or 7 decimal places. The amount of computational overhead meant that we needed to find a series that converges faster. Enter Chudnovsky algorithm. This is the algo used to set the world record for Pi right now.

The equation is simple enough that we can code for it.

But first we needed a way to calculate factorials. Once we wrote the function to calculate factorials it became clear that this method is unsuitable within MS SQL Server since within k=2 we reach numbers that result in arithmetic overflow so the approach was abandoned for a less rigorous and easier to work with series called Nilakantha series.

This equation is easier to use

The code was implemented using a simple while loop to calculate each part of the above equation and pretty soon we were reaching accuracy up to 21 decimal places in a matter of seconds. So we went ahead and made the procedure a natively compiled store procedure to see if we can improve the execution time so that we can perform more iterations. In our final attempt we performed 1 Billion iterations which took 6 min and resulted in accuracy to 27 decimal places

The final output we got was

3.1415926535897932384626433831700000000

The more accurate value would be

3.14159265358979323846264338327950288419

The code we used is mentioned below

CREATE PROCEDURE ncsp_calculatepi
 
WITH NATIVE_COMPILATION, SCHEMABINDING, EXECUTE AS OWNER
AS BEGIN ATOMIC WITH
(
 TRANSACTION ISOLATION LEVEL =  SNAPSHOT, LANGUAGE = N'us_english'
)
   --Insert statements for the stored procedure here
 
DECLARE @id DECIMAL(38,10)=2
DECLARE @PI DECIMAL(38,37)=3
DECLARE @even DECIMAL(38,37)= 1.00000000000000000000000000000
WHILE @id < 10000000000
BEGIN
 
 
 
--select @even*( 4.000000000000000000000000/ ( @id*(@id+1)*(@id+2)))
SET @PI = @PI + @even*( 4.000000000000000000000000000/ ( @id*(@id+1)*(@id+2)))
SET @id = @id+2
 
SET @even =@even*-1
 
END 
 
SELECT  @PI AS PI 
END
GO

Frankly beyond 10 million iterations there is not much gain in accuracy (26 decimal places) compared to execution time. After 10 Billion Iterations we didn’t see any improvement in accuracy.

The biggest challenge with calculating pi inside SQL SERVER has to be the limiting datatypes which prevent any kind of logic which will really provide accuracy in the range of hundreds of decimal places.

Actual vs Estimated plan and the use of Time to arrive at the best plan

I came across a comment on linkedin today about the difference between the actual and estimated execution plan in SQL Server and how it is used to determine performance issues. I wanted to point out that the estimated and actual execution plans are most of the time similar but there are plenty of cases where they will be different at which point using the execution plan from within the DMV doesn’t really make any sense. I have already pointed this out in another blog. But I am mentioning another example here for the sake of the reader.

If you look at the below query you will see there is conditional logic here

When viewing the estimated execution plan you see a generic plan that will work regardless of input parameter.

While the actual execution plan looks completely different based on input criteria.

So basically the estimated execution plan cannot be used to determine the exact events in all cases.

The Second thing is that the plan you view when you run the below query

SELECT q.plan_handle, h.query_plan FROM sys.dm_exec_query_stats q 
    outer apply sys.dm_exec_sql_text(q.sql_handle) g 
    outer apply sys.dm_exec_query_plan(q.plan_handle) h 
    WHERE g.TEXT like '%getdatafromcondition%'

 

is the estimated execution plan, I understand that some DBA assume this to be the actual plan. If you need the actual plan run profiler of XEs instead.

The third thing I wanted to address was that time is not a criteria when arriving at the best plan for execution. But before that we need to address what best means. Here we refer to the best plan for now not the best possible plan. The best plan for now is determined keeping in mind a number of factors which arrive at a relative number we call cost. This number is not absolute and cannot be used to compare cost across plans either. The “cost” however is a multifaceted component which is derived by looking at a number of factors such as number of rows being fetched the type of algorithm being used , the current system overhead etc. But Time for the query to complete execution is not a factor here.

Think of it this way. I want to travel from Bangalore to Mysore (about 178 Km away). Could assume that I average about 45Km per hour so I should take 4 hours to reach there. But the fact is it doesn’t work like that. E.g. I won’t know how much I average until I actually make the drive. But we also reuse the first plan so we get into a chicken an egg situation where you won’t know how long the query will take until you actually run it and as a result can’t estimate for it.

The second thing is like the road my average speed is determined by a number of factors such as my car (system configuration), the traffic rules (resource governor) the actual traffic (system load), the weather (locking, blocking etc.) and while some of these like resource governor are fixed and can be planned for the others aren’t.

So at arrive at an execution plan which depends on a factor that can vary wildly from time to time will often result in plans that are less effective most of the time. Having said this there is a time component that is present and that is the time taken to do the planning. In this case the time taken to plan a trip to Mysore could take just as long as the trip itself. The optimizer decides to figure out the best possible plan keeping cost and the time taken to arrive at such a plan( not the time taken for the plan to actually execute). I hope this helps clarify some of the concepts.

The other bog

http://www.enabledbusinesssolutions.com/blog/plan-reuse-for-conditional-statements-in-ms-sql-server/