Finding candidate keys/Primary keys from data that needs to be imported

By | January 3, 2019

This is a continuation from the previous post where we went and inserted data from multiple CSV files into SQL server tables using linked server connection. Often the next step is to identify how the tables are related to each other. While there is no guarantee that the relationships defined here are 100% accurate it will help identify the obvious ones and make the job easier.

First we get a count of rows in the table, we can use this to compare with counts from distinct values of the columns later. The counts are stored in a table so that we can perform other checks later if needed.

CREATE TABLE [dbo].[TblRowCounts](
	[id] [INT] IDENTITY(1,1) NOT NULL,
	[tabname] [VARCHAR](1000) NULL,
	[counts] [INT] NULL
) ON [PRIMARY]
GO
 
INSERT INTO TblRowCounts  ( tabname , counts)
EXEC sp_msforeachtable 'select "?",count(*) from ?';

Now that we have the table row counts, we get the distinct Column values counts for each table. If the column is a candidate we should see the numbers match. First we create a table to store the counts for each column

CREATE TABLE [dbo].[ColRowCounts](
	[id] [INT] IDENTITY(1,1) NOT NULL,
	[tabname] [VARCHAR](1000) NULL,
	[colname] [VARCHAR](100) NULL,
	[counts] [INT] NULL
) ON [PRIMARY]

Next we calculate the columns distinct values count by using a cursor.

DECLARE @tabname VARCHAR(100) 
DECLARE @colname VARCHAR(100)
 
DECLARE loopcol CURSOR FOR 
SELECT TABLE_NAME , COLUMN_NAME FROM INFORMATION_SCHEMA.cOLUMNS
WHERE DATA_TYPE not like '%text%'
 
OPEN loopcol
FETCH NEXT FROM loopcol INTO @tabname, @colName
 
WHILE @@FETCH_STATUS =0
BEGIN
DECLARE @SQL NVARCHAR(4000) = ' insert into ColRowCounts (tabname , colname , counts) select'++CHAR(39)+@tabname+CHAR(39)++' ,'+CHAR(39)+@colname+CHAR(39)+' , count(distinct (['+@colname+'])) from '+@tabname+''
EXEC (@SQL )
 
FETCH NEXT FROM loopcol INTO @tabname , @colname
 
END 
CLOSE loopcol 
DEALLOCATE loopcol

With the row counts inserted into the tables we can now compare the numbers to see columns which have distinct counts the same as the total number of rows in the table , in other words columns with unique values.

SELECT c.tabname , c.colname
FROM [dbo].[TblRowCounts] t
INNER join ColRowCounts c
ON t.tabname = '[dbo].['+c.tabname+']'
and t.counts= c.counts

A slightly modified version of this script allows me to script out the alter table command that will convert the above columns to NOT NULLABLE as should be the case based on the numbers we got from the previous scripts.

WITH cte AS (
SELECT c.tabname , c.colname
FROM [dbo].[TblRowCounts] t
INNER join ColRowCounts c
ON t.tabname = '[dbo].['+c.tabname+']'
and t.counts= c.counts)
 
SELECT 'ALTER TABLE ' +t.tabname + ' ALTER COLUMN ['+colname+'] ' + DATA_TYPE+ CASE WHEN CHARACTER_MAXIMUM_LENGTH  IS NOT NULL THEN 
+'('+CAST(CHARACTER_MAXIMUM_LENGTH AS VARCHAR(100))+')'
ELSE ''
END + ' NOT NULL'
FROM cte  t
INNER join INFORMATION_SCHEMA.COLUMNS c
ON t.tabname= c.TABLE_NAME
and t.colname= c.COLUMN_NAME

With the columns being not nullables they are now ready to have primary keys built on top of them. To build out the script I use the below code block to generate scripts but do not execute them without reviewing the column data first.

WITH cte AS (
SELECT c.tabname , c.colname
FROM [dbo].[TblRowCounts] t
INNER join ColRowCounts c
ON t.tabname = '[dbo].['+c.tabname+']'
and t.counts= c.counts)
 
SELECT 'ALTER TABLE ' +t.tabname + ' ADD CONSTRAINT PK_'+REPLACE(colname,' ','')+' PRIMARY KEY CLUSTERED (['+colname+'])', COUNT(*)
FROM cte  t
GROUP BY t.tabname , colname
ORDER BY t.tabname

With the Primary keys created the next step obviously is to create the Foreign Keys,  in order to do this we rely on the column names being the same in both tables. If the Primary key column name can be found on any other table we assume the two columns are related and go ahead with creating the PK- FK Relationship.

DROP TABLE #fktables
go
SELECT row_number() OVER( ORDER BY c.table_name) AS rid ,  'Alter table ' +c.TABLE_NAME+' Add constraint FK_'+c.TABLE_NAME+REPLACE(c.COLUMN_NAME,' ','')+' FOREIGN KEY (['+c.COLUMN_NAME+']) REFERENCES '+ k.TABLE_NAME+' (['+ k.COLUMN_NAME+']) ' AS def
INTO #fktables
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE k
INNER join INFORMATION_SCHEMA.COLUMNS c
ON c.COLUMN_NAME = k.COLUMN_NAME
and c.TABLE_NAME != k.TABLE_NAME
ORDER BY c.TABLE_NAME
 
DECLARE @cnt INT =1
WHILE @cnt <= (SELECT MAX(rid) FROM #fktables)
BEGIN TRY 
DECLARE @SQL NVARCHAR(2000) = (SELECT def FROM #fktables WHERE rid =@cnt )
EXEC (@SQL)
SET @cnt = @cnt +1 
END TRY
BEGIN CATCH 
SET @cnt =@cnt+1 
END CATCH

 

With this we have taken raw data from CSV files imported them into SQL Server and then established meaningful relationships between previously unrelated data in disjoint tables. The rest of the work is pretty much standard for DB developers e.g creating SPs etc.