Set based vs row by row (procedural code)

By | January 25, 2017

I was reading a blog about how to write procedural code today ( link below) and wanted to take a moment to explore another approach just to provide a more complete reference. To provide some context I am explaining the business case here.

A hospital wants to identify people whose heights have changed over time in order to detect if the person is who they say they are. Since the height of a person doesn’t change after a certain age the approach is used to detect identity theft. The requirement is that we compare the heights logged during different visit and flag any patients where they have different heights logged.

The code to create the dummy data is mentioned below and was taken from the link mentioned earlier.

SET NOCOUNT ON;
 
CREATE TABLE Dates
(ID INT, VisitDate DATETIME);
 
--populate table with 20 visit dates
DECLARE @i INT, @startdate DATETIME;
SET @i = 1;
SET @startdate = GETDATE();
 
WHILE @i <= 20
BEGIN
    INSERT Dates
    (ID, VisitDate)
    VALUES (@i, @startdate);
 
    SET @startdate = DATEADD(dd,7, @startdate);
    SET @i = @i+1;
END
 
CREATE TABLE PatientHeight
(PatientID  INT NOT NULL
,Height INT);
 
-- populate table with 1000 patientids with heights between 59 and 74 inches
SET @i = 1;
 
WHILE @i <= 10000
BEGIN
    INSERT PatientHeight
    (PatientID, Height)
    VALUES (@i, @i % 16 + 59);
 
    SET @i = @i+1;
END
 
ALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight
    PRIMARY KEY(PatientID);
 
-- cartesian join produces 200,000 PatientVisit records
 
SELECT 
    ISNULL(PatientID, -1) AS PatientID, 
    ISNULL(VisitDate, '19000101') AS VisitDate,
    Height
INTO PatientVisit
FROM PatientHeight
CROSS JOIN Dates;
 
ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit
    PRIMARY KEY(PatientID, VisitDate);
 
-- create changes of height
SET @i = 3;
 
WHILE @i < 10000
BEGIN
    UPDATE pv
    SET Height = Height +2
    FROM PatientVisit pv
    WHERE PatientID = @i
    AND pv.VisitDate = 
    (SELECT TOP 1 VisitDate 
    FROM Dates 
    WHERE id = ABS(CHECKSUM(@i)) % 19);
 
SET @i = @i + 7;
END
 
/*
-- return AdventureWorks to its previous state when you are finished
-- with this example.
 
DROP TABLE Dates;
DROP TABLE PatientHeight;
DROP TABLE PatientVisit;
*/

The first approach uses a cursor to iterate row by row and identify if there has been a change in height. The second approach uses a ranking function to perform a self-join based on patient id to identify any change. Both these approaches are detailed below as well to save the reader time.

CUROSR BASED APPROACH

-- CURSOR BASED APPROACH
CREATE TABLE #Changes
( PatientID INT
, VisitDate    DATETIME
, BeginHeight SMALLINT
, CurrentHeight    SMALLINT);
 
DECLARE @PatientID        INT
,        @CurrentID        INT
,        @BeginHeight    SMALLINT
,        @CurrentHeight    SMALLINT
,        @VisitDate        DATETIME;
 
SET @PatientID = 0;
 
DECLARE Patient_cur CURSOR FAST_FORWARD FOR
SELECT PatientID
, VisitDate
, Height
FROM PatientVisit
ORDER BY PatientID
,VisitDate;
 
OPEN Patient_cur;
 
FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;
 
WHILE @@FETCH_STATUS = 0
BEGIN
-- first record for this patient
IF @PatientID <> @CurrentID
BEGIN
    SET @PatientID = @CurrentID;
    SET @BeginHeight = @CurrentHeight;
END 
 
IF @BeginHeight <> @CurrentHeight
BEGIN
INSERT #Changes ( PatientID
, VisitDate
, BeginHeight
, CurrentHeight)
VALUES
(@PatientID
, @VisitDate
, @BeginHeight
, @CurrentHeight);
 
SET @BeginHeight = @CurrentHeight;
 
END
 
FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;
 
END
 
CLOSE Patient_cur;
DEALLOCATE Patient_cur; 
 
SELECT * FROM #Changes
 
DROP TABLE #Changes
GO

SET BASED APPROACH

-- RANKING FUNCTION BASED APPROACH
 
WITH PV_RN AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, * 
    FROM PatientVisit
)
SELECT t1.PatientID
,t2.VisitDate AS  DateChanged
,t1.Height AS HeightChangedFrom
,t2.Height AS HeightChangedTo
FROM PV_RN t1 
JOIN PV_RN t2 ON t2.ROWID = t1.ROWID + 1
    WHERE t1.patientid = t2.patientid
        AND t1.Height <> t2.Height
ORDER BY t1.PatientID, t2.VisitDate;

There is however a third option which I believe would be much simpler and is also much faster has been listed below. Here we use the math operator STDEV which calculates a standard deviation for the heights that were captured. When the height hasn’t changed STDEV will return a value of zero else a non-zero number which can be used to identify patient’s with different heights.

THIRD APPROACH

-- MATH OPERATOR BASED APPROACH
-- HAS THE ADVANTAGE OF PROVIDING A MARGIN FOR ERROR FOR MEASUREMENTS AS WELL
 
SELECT PatientID , MIN(Height) ,MAX(Height) FROM PatientVisit WHERE PatientID in (
SELECT  patientid
FROM  PatientVisit 
GROUP BY PatientID
HAVING STDEV(height) > 0)
GROUP BY PatientID

Hope this approach helps clarify what we are doing with set based approach since recently I have noticed that ranking functions and CTEs seem to be the go to approach for most developers when they think set based logic.

Reference

http://sqlmag.com/t-sql/programming-sql-set-based-way