Monthly Archives: February 2018

How Random is the RAND() function? SQL, Chance and the universe.

I use the Random function quite a bit to generate dummy data. So recently I was interested to find out actually how random it is. Randomness implies that all possible outcomes have an equal probability of happening and therefore cannot be predicted in advance. Since the outcome can’t be predicted it is pure chance and therefore random.

Under this assumption we should; over a suitably long period of time see almost all number repeated an equal number of times and over a short period generally the number doesn’t repeat itself.

The test for this is fairly simple, insert a million rows into a table and count the number of times the values occur.

 


While inserting the rows we realize that the randomness is also impacted by the cardinality of the data. For example a coin has only two states so a test of maybe ten flips should be enough to verify the randomness of the coin. A dice might need much more iterations to ensure that it is in fact random similarly a random number generator could theoretically require an infinite number of iterations before it can conclusively be deemed random. This might sound ridiculous initially because we might have one million unique numbers which is sufficient to prove that it’s random but it doesn’t conclusively prove that there is a certain number it’s trying to avoid. If a die is cast such that it never lands on six you can find that out after about 30 rolls. But how do you do that for something like RAND()?

Mathematically if the probability of one state is reduced then the probability of all other states is improved proportionally.

Notice how in the below test despite how cardinal the data in randvalue is we see there are duplicates (226). There is an infinite series of numbers that could be explored in the decimal places but we still end up with duplicates.

 

But scientific method requires that the test be repeatable so we do this again.

Second time around we see comparable results to the first run. At this point it’s safe to say that the RAND() function is not truly random.

Just to be sure we run the same test with only 100K iterations (maybe the duplicates are cause by the values rolling back on themselves). We see that even for 100K iterations we encounter duplicates. Well before other options are explored.

So why does this happen? The RAND() function actually expects a seed value which if not provided by the user is automatically assigned by SQL Server itself.

The seed value can be tinyint, smallint or int datatypes. For any input seed value an equivalent random number is generated. So you see that RAND() function is more like a hashing algorithm – for a given input there is always the same output. Basically what this means is “while the output or random function appears to be a random number the probability of that number being generated is not exactly pure chance”. If the seed value is int then there are only 2,147,483,647 possible inputs and as a result the same number of outputs.

In order to be truly random the function needs to be aware of previous output values and ensure they do not repeat. This way every value has an equal probability. However this can be extremely resource intensive so not really practical. The other option is to build into the logic some inherent randomness such as quantum randomness from the atomic level. Thanks to big bang theory I am sure everyone is aware of Schrodinger’s Cat now. Or use cosmic background radiation or gamma particles to seed randomness into the chip level design itself.

The more we use statistical analysis to improve machine learning, AI, game theory etc. the more important it becomes to define what random chance is. And how not everything that seems random actually is. But that’s a topic for Chaos theory and another day.

 

 

 

Tracking DML Operations in SQL tables

Recently I was asked if there was a better way than triggers to perform audits on a table. Essentially the client has a number of tables and each table needs a framework to audit DML operations occurring within them. Now the challenge is not really the audit part but how the data is consumed after it has been logged. While in some cases data needs to be returned back to the application in other cases it just needs to be stored in a history table for legal reasons. To begin with let me list out some of the common ways I have seen this kind of functionality implemented

Triggers

For most people this is the very first thing they think of. Simple put and AFTER trigger on the table that performs a subsequent insert into another table using the internal deleted table. While this is often the simplest approach there are challenges around performance since the trigger fires row by row and therefore can affect the write performance. In addition for highly transaction tables the changes can causes other functionality downstream like replication etc. to break. See the below link on how to go about using triggers. This problem increases with the number of table objects since every table will need a trigger and in addition it complicates deployments whenever there is a change to the underlying table.

https://solutioncenter.apexsql.com/how-to-create-and-use-dml-triggers-to-audit-data-changes/

Storing T-Log backups

While this approach is not as common, I have found it being implemented in a number of smaller companies where they simply retain the copy of transactional backups for a long duration to allow them to audit to any point in time historically with storage space being the main constraint. This approach has the added advantage of not cluttering the database with additional objects and not having to introduce additional features. However the disadvantage being having to retain backups for a long duration and if any backup is corrupted loss of Audit capability. Also the data is not readily accessible and has to be restored on another server which can often be a problem for VLDBs. However my main issues with this approach has been the fact the often the database goes down just when the data you want audited is being changed e.g. when the database is hacked. So as a way to track records it might not be perfect. Mainly backups are not meant to be used an auditing tool.

https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/transaction-log-backups-sql-server

Temporal tables

This option is the most attractive at the moment since it provides and inbuilt capability to track temporal data. Basically it tracks changes over time which is what auditing in this case is about. However it is a feature available in 2016 and therefore not really an option at the moment for most customers. By creating a shadow table for the user selected tables and versioning rows using a system time stamp it allows users to navigate back and forth to any point in time to see the state of the table. Think of it as a SCD type 2 where all columns are historical and tracked using Start Date and End Date.

http://www.enabledbusinesssolutions.com/blog/sql-2016-temporal-tables-the-way-we-are-most-likely-to-end-up-using-it/

Usage scenarios for temporal tables are listed below

https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-table-usage-scenarios

Database Audits

While this does say audit it isn’t really the best fit for DML operations the way we want it. Mostly because it can tell us who did what and when but doesn’t really care what happened to the existing data. A great tool from a security perspective but as a way to log DML this sucks. Mostly it is not very user friendly to extract the details you requires and even when you do extract the required details it isn’t really practical to use the data from an application perspective.

https://docs.microsoft.com/en-us/sql/relational-databases/security/auditing/create-a-server-audit-and-database-audit-specification

Change Data Capture

As its name suggests it captures changes in data and therefore seems like an ideal tools for us to use in this case. Since we can specify the table we want tracked and in addition track specific columns. Provided with the ability to see net changes or all rows it seems to provide all the options we require. While this tool offers us better performance when compared to Triggers (because changes are read directly from the T log) it does have some significant drawbacks too. The first being it needs to be enabled table by table and can cause a significant amount of clutter in the database with additional functions and jobs created to track the changes. The other problem being the duration for which the changes are captured. It doesn’t store the changes for an indefinite amount of time so we need to write additional logic to extract data then load it into history tables later on. Naturally this is an overhead and more importantly it is not real time unlike the triggers based logic.

http://www.enabledbusinesssolutions.com/blog/script-to-enabled-and-use-cdc-in-ms-sql-server/

Change Tracking

Change tracking is lightweight version of CDC in a way. It captures the fact a row has changed but not the nature of the changes. In this way it simply tracks the primary key of the row and lets you know which version of the row is current. And how many versions ago it was created. Not really helpful for an audit but great when using ETL tools to reconcile data across table’s

https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server

Within the Query

Probably the most elegant solution of the lot requires massive amount if changes to the application code to accommodate the required functionality. Essentially we piggy back on the internal Inserted and Deleted tables to capture the changes as the DML operation is happening and then decide how to handle them within the transaction. Because of the entire logic being written into the procedure it allows massive amount of customization on which columns and what rows qualify for audit. Along with the ability to add your own type of row version the same history table could be used for auditing and ETL processes as well.

The below code might provide a better explanations on how the whole thing works. Naturally this way provides a lot of flexibility but doesn’t overcome the issues faced by trigger such as row by row firing but if your using additional inbuilt function if provides a level of detail that CDC provides but in real time.

-- CREATE MASTER TABLE
CREATE TABLE insertedtest_hist
(	Id INT , 
	FullName VARCHAR(100),
	Age INT ,
	PreferredAddress VARCHAR(10)
)
 
GO
-- CREATE TRACKING TABLE
CREATE TABLE insertedtest_hist
(	Id INT , 
	FullName VARCHAR(100),
	Age INT ,
	PreferredAddress VARCHAR(10)
)
GO 
-- INSERT FIRST ROW INTO MASTER - PARALLEL INSERT INTO TRACKING TABLE
INSERT INTO insertedtest (Id , FullName , Age , PreferredAddress)
OUTPUT inserted.Id , inserted.FullName , inserted.Age, inserted.PreferredAddress
INTO insertedtest_hist (Id , FullName , Age , PreferredAddress)
VALUES (1, 'Insert1'  , 13, 'BLR')
GO
-- INSERT FIRST ROW INTO MASTER - PARALLEL INSERT INTO TRACKING TABLE
INSERT INTO insertedtest (Id , FullName , Age , PreferredAddress)
OUTPUT inserted.Id , inserted.FullName , inserted.Age, inserted.PreferredAddress
INTO insertedtest_hist (Id , FullName , Age , PreferredAddress)
VALUES (2, 'Insert2'  , 13, 'HYD')
GO
-- UPDATE THE MASTER TABLE AND THE SUBSEQIENT TABLE IN ALSO UPDATED INTO THE HISTORY TABLE AS WELL.
UPDATE insertedtest
SET Age = 25
OUTPUT inserted.Id , inserted.FullName , inserted.Age, inserted.PreferredAddress
INTO insertedtest_hist (Id , FullName , Age , PreferredAddress)
WHERE id = 2
-- TEST THE RESULT
SELECT * FROM insertedtest_hist
SELECT * FROM insertedtest

PowerBI Calendar control

A client recently asked for a calendar layout within the report. This is one place where I feel PowerBI seems to lacking significantly since there are no native controls available and the ones that can be added aren’t really what people call a typical calendar.

Here are some of the common examples you may find when you search for it.

One of the most basic ways this data can be presented in as a matrix, similar the one below and while it does the trick is not visually aesthetic.

You could also download a custom control by googling “Custom Calendar by Akvelon” which is the closest I have come to an actual calendar so far.

 

Another calendar control I keep seeing quite a bit off is the one below and while it does the job I find that it isn’t as intuitive as the control provided by Akvelon. The link for this calendar can be found here.