This is a continuation from the previous post where we went and inserted data from multiple CSV files into SQL server tables using linked server connection. Often the next step is to identify how the tables are related to each other. While there is no guarantee that the relationships defined here are 100% accurate it will help identify the obvious ones and make the job easier.
First we get a count of rows in the table, we can use this to compare with counts from distinct values of the columns later. The counts are stored in a table so that we can perform other checks later if needed.
CREATE TABLE [dbo].[TblRowCounts]( [id] [INT] IDENTITY(1,1) NOT NULL, [tabname] [VARCHAR](1000) NULL, [counts] [INT] NULL ) ON [PRIMARY] GO INSERT INTO TblRowCounts ( tabname , counts) EXEC sp_msforeachtable 'select "?",count(*) from ?';
Now that we have the table row counts, we get the distinct Column values counts for each table. If the column is a candidate we should see the numbers match. First we create a table to store the counts for each column
CREATE TABLE [dbo].[ColRowCounts]( [id] [INT] IDENTITY(1,1) NOT NULL, [tabname] [VARCHAR](1000) NULL, [colname] [VARCHAR](100) NULL, [counts] [INT] NULL ) ON [PRIMARY]
Next we calculate the columns distinct values count by using a cursor.
DECLARE @tabname VARCHAR(100) DECLARE @colname VARCHAR(100) DECLARE loopcol CURSOR FOR SELECT TABLE_NAME , COLUMN_NAME FROM INFORMATION_SCHEMA.cOLUMNS WHERE DATA_TYPE not like '%text%' OPEN loopcol FETCH NEXT FROM loopcol INTO @tabname, @colName WHILE @@FETCH_STATUS =0 BEGIN DECLARE @SQL NVARCHAR(4000) = ' insert into ColRowCounts (tabname , colname , counts) select'++CHAR(39)+@tabname+CHAR(39)++' ,'+CHAR(39)+@colname+CHAR(39)+' , count(distinct (['+@colname+'])) from '+@tabname+'' EXEC (@SQL ) FETCH NEXT FROM loopcol INTO @tabname , @colname END CLOSE loopcol DEALLOCATE loopcol
With the row counts inserted into the tables we can now compare the numbers to see columns which have distinct counts the same as the total number of rows in the table , in other words columns with unique values.
SELECT c.tabname , c.colname FROM [dbo].[TblRowCounts] t INNER join ColRowCounts c ON t.tabname = '[dbo].['+c.tabname+']' and t.counts= c.counts
A slightly modified version of this script allows me to script out the alter table command that will convert the above columns to NOT NULLABLE as should be the case based on the numbers we got from the previous scripts.
WITH cte AS ( SELECT c.tabname , c.colname FROM [dbo].[TblRowCounts] t INNER join ColRowCounts c ON t.tabname = '[dbo].['+c.tabname+']' and t.counts= c.counts) SELECT 'ALTER TABLE ' +t.tabname + ' ALTER COLUMN ['+colname+'] ' + DATA_TYPE+ CASE WHEN CHARACTER_MAXIMUM_LENGTH IS NOT NULL THEN +'('+CAST(CHARACTER_MAXIMUM_LENGTH AS VARCHAR(100))+')' ELSE '' END + ' NOT NULL' FROM cte t INNER join INFORMATION_SCHEMA.COLUMNS c ON t.tabname= c.TABLE_NAME and t.colname= c.COLUMN_NAME
With the columns being not nullables they are now ready to have primary keys built on top of them. To build out the script I use the below code block to generate scripts but do not execute them without reviewing the column data first.
WITH cte AS ( SELECT c.tabname , c.colname FROM [dbo].[TblRowCounts] t INNER join ColRowCounts c ON t.tabname = '[dbo].['+c.tabname+']' and t.counts= c.counts) SELECT 'ALTER TABLE ' +t.tabname + ' ADD CONSTRAINT PK_'+REPLACE(colname,' ','')+' PRIMARY KEY CLUSTERED (['+colname+'])', COUNT(*) FROM cte t GROUP BY t.tabname , colname ORDER BY t.tabname
With the Primary keys created the next step obviously is to create the Foreign Keys, in order to do this we rely on the column names being the same in both tables. If the Primary key column name can be found on any other table we assume the two columns are related and go ahead with creating the PK- FK Relationship.
DROP TABLE #fktables go SELECT row_number() OVER( ORDER BY c.table_name) AS rid , 'Alter table ' +c.TABLE_NAME+' Add constraint FK_'+c.TABLE_NAME+REPLACE(c.COLUMN_NAME,' ','')+' FOREIGN KEY (['+c.COLUMN_NAME+']) REFERENCES '+ k.TABLE_NAME+' (['+ k.COLUMN_NAME+']) ' AS def INTO #fktables FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE k INNER join INFORMATION_SCHEMA.COLUMNS c ON c.COLUMN_NAME = k.COLUMN_NAME and c.TABLE_NAME != k.TABLE_NAME ORDER BY c.TABLE_NAME DECLARE @cnt INT =1 WHILE @cnt <= (SELECT MAX(rid) FROM #fktables) BEGIN TRY DECLARE @SQL NVARCHAR(2000) = (SELECT def FROM #fktables WHERE rid =@cnt ) EXEC (@SQL) SET @cnt = @cnt +1 END TRY BEGIN CATCH SET @cnt =@cnt+1 END CATCH
With this we have taken raw data from CSV files imported them into SQL Server and then established meaningful relationships between previously unrelated data in disjoint tables. The rest of the work is pretty much standard for DB developers e.g creating SPs etc.