Monthly Archives: June 2018

Online Indexes Failures in Azure for VLDBs

Recently we had a requirement that was caused by a very unique situation. A table with a column of datatype nvarchar(max) was inadvertently populated with some text in excess of 20MB of data/row. This resulted in the table blowing up from around 3GB to over 250GB and later 350GB. Naturally the sudden surge in size of the database was immediately noticed and we needed to rectify the issue. The environment was an Azure database and as you can see the growth in DB size resulted into issues due to reaching the service tier (max DB size allowed is only 200GB). So we needed to fix the problem.

Here is how we went about it

Since the application had reached its service tier limit we needed to bring the size of the database down ASAP. Mainly we needed to ensure there were free data pages available for new data. Since the database was already in its limit we could not do a batch by batch update to NULL the column as this would make the t-Log grow. So we ended up dropping the column and recreating it. This immediately made the database available for new instructions but the size of the database still remained approx. 300GB because the pages that were assigned have not been released back to OS, in other words Shrunk.

So we now had the table accessible but the size of the database needed to come down. Traditionally we are offered two solutions to achieve this within SQL Server:-

DBCC CLEANTABLE

Reclaims space from dropped variable-length columns in tables or indexed views

As the above description makes clear DBCC CLEAN table can fix data pages for cases where there has been a change to variable length columns in this case Dropped. Which is exactly what we did. However we encountered a different problem now. The database was constantly in use and as a result it was taking far too long to acquire the required locks on the underlying table. The second problem we faced with this approach was the fact that the service tier assigns only a limited amount of resources from the elastic pool so we could not supercharge this command by allocating more resources and thus hoping it would complete sooner. After waiting for over 12 hours we realized this was a no go. So we moved on to the next step.

REBUILD INDEX.

Now in order to reclaim the space back we would need to create a whole new set of data pages which are rebuilt when we launch the clustered index rebuild option. Since the clustered index rebuild will allocated and reassign data into new datapages it would automatically take care of the unwanted space allocated to the nvarchar (max) column. We hoped because of the options available in REBUILD INDEX we would see better results than with DBCC CLEAN TABLE. Initially we issued a straight forward command to rebuild the index such as the one below:-

In its most simple form you can rebuild an index with the below command

ALTER
INDEX [PK_On_Time]
ON [dbo].[On_Time] REBUILD

 

But this would not work for us since we have other columns of database ntext etc which can only be rebuilt OFFLINE. If you try to rebuild a clustered index on a table with these LOB datatype you would get the below error:-

Msg 2725, Level 16, State 2, Line 3

An online operation cannot be performed for index ‘PK_On_Time’ because the index contains column ‘OriginState’ of data type text, ntext, image or FILESTREAM. For a non-clustered index, the column could be an include column of the index. For a clustered index, the column could be any column of the table. If DROP_EXISTING is used, the column could be part of a new or old index. The operation must be performed offline.

So we had only one choice really and that was to rebuild the index OFFLINE

 

USE [madworks]
GO
ALTER INDEX [PK_On_Time] ON [dbo].[On_Time] 
REBUILD PARTITION = ALL 
WITH (
 
PAD_INDEX = OFF, 
/** OFF by default therefore fillfactor of 100% is used for intermediate page in the b tree , 
If ON then fillfactor from sys.indexes is used not an issue in most cases**/
 
STATISTICS_NORECOMPUTE = OFF,
/** OFF by default means that DBA needs to recompute stats later , 
its good to leave off by default since it adds to the time taken to rebuild. 
However query plans maybe affected as a result of plans not being up to date 
so recompute stats as close the index operation as possible.**/
 
SORT_IN_TEMPDB = OFF, 
/** default is OFF if table is small usually sorting is done in memory and this setting is ignored.
if table is large setting this option to On can speed up rebuild operation by using the multiple files allocted to tempdb
as part of tempdb optimization. In Azure tempDb files are on special optimized disk so usually a good idea to set to ON in Azure.
**/
ONLINE = OFF, 
/** DEFAULT IS OFF , exclusively locks the table until the rebuild operation is complete. This makes the table unavailable causing other queries to fail.
set to On for large tables/ frequently accessed table since index rebuild can go on for hours ad system will become unresponsive.
in our case we are fored to set it OFF because of the ntext columns etc.
SETTING OFF make the rebuild finish faster but table is unsable till then else vice versa**/
 
ALLOW_ROW_LOCKS = OFF, 
/** tells SQL Server to acquire only row level locks when rebuilding the IX , can make lock management heavy due to large number of locks acquired.**/
 
ALLOW_PAGE_LOCKS = OFF,
/** tells SQL Server to acquire only page level locks when rebuilding the IX , can make lock management lighter due to large number of locks acquired.but other queried might be waiting for locks longer.
SETTING BOTH ROW AND PAGE LOCKS TO OFF WILL RESULT IN TABLE ACTING LIKE AN OFFLINE REBUILD i.e not available to other queries for DML operation.
**/
RESUMABLE =ON
)
 
 
GO

So armed with the above script we went ahead trying to rebuild the Clustered index. At this point performing an OFFLINE rebuild naturally means we cannot access the table in the meantime so we tried to do this during off peak hours but the operation would not complete even after 8 hours of running mostly due to Azure service tier limitations. We constantly kept running into excessive waits for WRITELOG and other associated wait types. So clearly INDEX rebuild also wasn’t going to work because it never completes within the SLA and the rollback was killing us every day.

Finally left with no other choice we went with the below approach.

  • Copy the data from the table into a staging table.
  • Drop FKs and indexes and constraints on the original table and any FKs so that it acted as a standalone table.
  • Truncated the stand alone table.
  • Repopulated it with data from the Staging table
  • Rebuilt the index (which completed in about 20 min) – since the table size is now 3 GB and we aren’t trying to rebuild a 300 GB table.
  • Recreated all FK, indexes etc
  • Reseed the identity column
  • SHRINK DATABASE

The whole thing took about 2 hours to complete and test.


Are we looking at gender pay gap incorrectly?

This is a topic that is very dear to me not because of the gender aspect of it but because I am a strong believer of Equal pay for Equal work regardless of gender, race, religion or nationality. I strongly believe there is a wage gap and that there is gender bias but I have been looking at the data myself and have found things I am not able to reconcile with what I see in the real world. This has already been analyzed by a number of Universities and Newspapers around the world. But I don’t believe things until I see them with my own eyes (more on this in another blog).

The first and foremost issue is we look at pay gap as a monolith. The most common number thrown around is 80 cents to the Dollar. That however is a gross generalization. Unless humanity is a uniform mixture of equal portions of men , women , black, white , married , single , young , old , rich , poor , Christian , Muslim porridge trying to bucket people into just two doesn’t make sense. But let’s assume all things being equal (which it is not) we will still have more men than women because of how evolution works see the video below for why.

https://www.youtube.com/watch?v=3IaYhG11ckA

 

To get a meaningful understanding we need to do an apples to apples comparison. So we look at a the finer details like within a specific industry and during my research I find that the most common example used is that of CEOs. But the truth is CEOs are not the best occupation to use because their salaries vary not because of their ability but the market valuation of the companies they serve. The closest relatable example I can think of was software development because this profession inherently doesn’t have any gender bias unlike Truck Drivers or Construction workers. Why, because a mind is a mind and doesn’t really matter what container it is using.

The below graph shows the distribution for pay ratio of Women/Men on the X axis and the population ratio for Women/men on the Y Axis. Any industry where the pay/gender details were unavailable or too low were ignored. What stands out in the below graph is the fact that software industry does a decent job of pay parity with quite a few occupations above 90% (the remaining 10% can be accounted for by overtime, travel, educational qualifications, position within organization etc.). However we see something shocking in terms of the number of women vs men in the industry. We see that in almost all cases the number of men is more than double the number of women. The root cause for this disparity in the number of Women is due to the fact the Women do not pursue technical skills as much as men. I know this to be especially true in India. Additionally women are willing to trade pay for benefits like flexible working hours, WFH etc.

Now let’s look at the service industry which is not technical , here again we see that there is a population disparity in terms of the number of women employed in each sector. Again we see that the jobs are physically demanding and as a result less attractive as a career option. But if you notice you will see that despite the rigours of the job and the obvious disparity in the population of women the pay gap is close to 85 for most sectors. Basically in these profile we see that men take high risk jobs but the pay only slightly reflects the difference in risk the rest can be accounted for under overtime etc. as mentioned above. FYI the Blue Dot top right is hotel staff where tips etc are not considered.

So at this point we have established that there are certain industries where majority of women actively try to avoid working. (Yes I said it. It’s an open secret and numbers don’t lie). But let’s look at the industries where there is no such disparity between men and women in terms of employment. Like I mentioned earlier a truck driver or construction worked is predominantly a male occupation. Just like how Teaching, Nursing, textiles etc. are dominated by females. The below graph show only those occupations where the population ratio of women to men is above 97%.

Almost immediately we can see two trends:-

  • Almost all the jobs are low paying jobs.
  • In some of these industries women actually get paid more than men.

Naturally we also see Sales agent hovering at 58% but then sales is based on commissions and the ability to sell which some people have and others don’t.

Hopefully by now you’re convinced that the problem cannot be adequately described by using 80 Cents to a Dollar. Which I feel is just marketing hype more than anything else. But what if I told you this isn’t even the root cause of my concern.

To understand this I need to first show you where I got my data from. The data for the above charts and analysis comes from the US labor department website. Kudos to them for providing such accurate data, most countries don’t have anything near as complete as this dataset. Now if you look at the link provided in the references section you will see the value for Pay provided is Median not Average.

Why does this matter?

This matters because the median means something totally different and is not an accurate way to measure pay if you’re going to treat the group as a monolith. A mean takes the middle number from a set of values. Which makes sense if the distribution of number follows a Gaussian curve and we have an equal number of people on either side of the median.

This distribution implies that the majority of employees are middle management. But any HR will tell you almost every job follows a pyramid structure.

So we are looking at something like the chart below, but is not the entire part of the Gaussian curve. It should actually be only the right hand side.

 

Take a fictional company where everything else is equal with 16 employees and their pay structure with a total budget of 1M shown as below:-

Notice the median multiplied by the number of employees results in a shortage of 200,000 against the budget while the average doesn’t. So now that I have made this point let’s try and approximate this model to something more realistic. If you remember the first scatter plot you will see the gender ratio for software engineers is 0.25, i.e. for every 4 guys there is one girl.

If we take the above headcount and split it by this ratio (giving the advantage to women where ever possible) we get something like this. Notice I still have only 16 employees and all I have done is added gender as a slicer.

Notice that men and women are given the same salary here (there is not pay gap). However when we calculate the Median notice how Women has smaller number compare to men. How can that be when we know for a fact that the salary is the same? Well it’s because of the number of people population distribution. In fact if you divide 42857 by 50000 you get .85 which is exactly what the first scatter plot shows.

If you are arguing that there is no women entry in the Top Level then know that if I did add that the median would only have gone up. Further proving my point. So what I am trying to say is even when we give men and women the same salary we can see differences due to the way we are calculating things.

Summary

There is a pay gap between genders but it’s not as bad as we think at least until we can get more accurate data. The real problem seems to be the low percentage of women employed in high paying sectors.

Disclaimer

I understand better people than me have dealt with this and probably know things I don’t. Which is why I have transparently included everything I have used to arrive at this conclusion in the link below. This is data set used and it reflects the number published by the US labor department as well.

References

https://www.bls.gov/cps/cpsaat39.htm

Adding a Quintillion

An interesting problem was asked on #Sqlhelp recently about adding very large numbers, in excess of 20 digits in SSAS tabular model. While most of us will find that bigint or decimal datatype is large enough for anything we can imagine storing; it is still an interesting problem. Mostly because as the volume of data increases we are bound to get to a point where even 20 digits might seem insufficient. Until say recently trillions were still considered rare which the case is no longer.

So how do we approach the problem?

Let’s assume for now that the numbers individually are in the very large number scale and we encounter the issue only when we add them. Let’s assume we create a table with a bigint datatype as shown below

The solution here would be simply to change the datatype to decimal (38, 0)

But what If we encounter numbers bigger than this?

No problem because we can store up to 38 digits into the decimal datatype

Let’s go for an even larger number 137654213712812673161231237234125412123132

Now even the decimal datatype is insufficient, do we have another datatype capable of storing even larger numbers?

Unfortunately not, so we need to implement some sort of work around. We could store the data into a varchar(200) data type so that the number is represented as its actual value however at this point we lose all ability to perform math operations on the underlying datatype.

In this approach I have used the string datatype to break the number into smaller sets and then perform the operation on the smaller set first. Here is the result we are trying to verify.

First I have a table in which the numbers are being stored.

CREATE TABLE [dbo].[bignumbers](
	[id] [INT] IDENTITY(1,1) NOT NULL,
	[number] [VARCHAR](100) NULL
) ON [PRIMARY]

 

Then I Insert some random numbers into the table

INSERT INTO bignumbers
SELECT   '123456789012345678901234567890123456.123456789012345678901234567890123456'

 

Finally I run the below script to generate the SUM of the above numbers. I have written the code in a bit of a hurry so please don’t judge me for not making it pretty. Also I am sure it could be rewritten to be a bit more efficient. But I haven’t tried that yet maybe in another post.

The below code has been tested for up to decimal (73, 36), Please feel free to share your thought and let me know if you find an error.

DROP TABLE #results 
 
DECLARE @SigDigits INT =(SELECT  MAX(LEN(CAST(SUBSTRING(number, CHARINDEX('.',number)+1,100) AS DECIMAL(38,0)))) FROM bignumbers)
 
DECLARE @DVal AS VARCHAR(100) 
SELECT @DVal= 
SUM(
	CAST(
			SUBSTRING	(number, 
						CHARINDEX('.',number)+1,
						100
						)+ REPLICATE('0',@SigDigits-LEN(
															SUBSTRING	(number, 
															CHARINDEX('.',number)+1,100))
							)
						AS DECIMAL(38,0))
						)
FROM bignumbers
 
DECLARE @Carryover INT =0
IF LEN(@DVal) > @SigDigits
BEGIN 
SET @Carryover =  LEFT(@DVal,LEN(@DVal)-@SigDigits) 
SET  @DVal= RIGHT(@DVal,@SigDigits)
END 
 
DECLARE @Reaval AS DECIMAL(38,0) 
DECLARE @Realprecision AS INT
 
SELECT  
'Little' =SUM(CAST(CASE 
	WHEN LEN(SUBSTRING(number,1 ,  CHARINDEX('.',number)-1)) >= 37
		THEN RIGHT(SUBSTRING(number,1 ,  CHARINDEX('.',number)-1),37) 
		ELSE 
		SUBSTRING(number,1 ,  CHARINDEX('.',number)-1) 
		END AS DECIMAL(38,0)))+  @Carryover ,  @DVal AS Decimals
INTO #results
FROM bignumbers
 
 
SELECT CAST(Little AS VARCHAR(38))+'.'+CAST(Decimals AS VARCHAR(38)) AS Final
 FROM #results

Proof is in the pudding.

Query to find size of tables in Azure database

Quick and dirty way to find the size of all tables in a azure database.

DROP TABLE #list
DROP TABLE #data
go
CREATE TABLE #data (
name VARCHAR(100),
ROWS VARCHAR(100),
reserved VARCHAR(100),
datasize VARCHAR(100),
indexsize VARCHAR(100),
unused VARCHAR(100))
 
SELECT ' insert into #data exec  sp_spaceused '''+table_schema+'.'+table_name+'''' AS Query INTO #list   FROM INFORMATION_SCHEMA.TABLEs
ALTER TABLE #list 
ADD id INT IDENTITY(1,1)
 
 
DECLARE @COUNT INT =0 
WHILE @count<= (SELECT MAX(id) FROM #list) 
BEGIN
DECLARE @q NVARCHAR(1000) = (SELECT query FROM #list WHERE id = @COUNT)
 
 
EXEC SP_EXECUTESQL @q
SET @COUNT =@COUNT + 1
 
END
 
 
 
SELECT * FROM #data

 

Analyze Actual Execution Plan in SSMS

This might be a hidden gem in SSMS that most people aren’t aware of so I figured I would post about it. I recently used this feature to understand the impact of inaccurate cardinality estimate on the performance of a query. In order to view this option you need to run the query within SSMS while selecting the Actual Execution Plan option. Once the plan is displayed you can right click anywhere within the plan to see the below option.

After this you will see a screen as shown below where you need to select a single execution plan from the results. Mostly this happens because there is a function call or multiple batches within the query. You can analyze only one at a time.

You will notice from the above screenshot there is a tab called Scenarios. Click the tab to explore where all the Cardinality Estimator might be impacted.

An example would look like below

In this case we can see the joins and other operators are impacted by the difference in the number of rows, which could mean either:-

  • Parameter sniffing
  • Stale Statistics
  • Cardinality Estimator issues

And this way we can identify some high level issues within the plan. Please note right now the tool only identifies issues with the estimator but in future release we can expect more robust analysis for other metrics as well. You can also compare differences between plans which becomes really useful for studying the impact of code change or post ETL workloads and Parameter sniffing.

References

https://blogs.msdn.microsoft.com/sql_server_team/new-in-ssms-query-performance-troubleshooting-made-easier/