Monthly Archives: February 2016

Using Openrowset to import data from a CSV file

If you are using a 64 bit edition of Windows then you are most likely using a 64 bit edition of Office and as a result of Access too. If you however have a 32 bit version of Office installed you will face issues when trying to create an openrowset connection to a csv file as I did recently. The solution is to use to install the 32 bit access database engine and install the 64 bit access database engine from the below link.

https://www.microsoft.com/en-in/download/details.aspx?id=13255

The link also contains the 32 bit engine as well for vice versa.

Once installed you will be able to query the csv file as shown in the screenshot below

uing openrowset to query CSV file

This problem manifests itself in a number of ways with different error messages so the best way to implement this is to start with a clean slate.

Loading Resumes and searching them using SQL Server – Part 2/2 (PDF+ FileTable + Full Text Search)

In this post we continue after creating our file table and posting some pdf documents into it. For the previous post please click here. Apparently there is very few content about how to achieve this so I am going to be as thorough as possible here.

First we need to download the filters that help SQL Server read and understand the varbinary content within the file table. The good news is when working with MS Office documents there Filters come preinstalled and are available by default. However when working with PDF documents there is a lot of confusion about what and how to go about it.

STEP 1:- Check if you have the IFilter for Adobe PDF Files installed already by running

Use the below query to determine if you have the filter installed.

select document_type, path from sys.fulltext_document_types
where document_type like '%pdf%'

STEP 2:-If it’s not installed then download and install the IFilter (PDFiFIlter64Installer.zip) from the below location.

There is a known bug in the Version 11 IFilter when it comes to SQL 2014 so in this case I am using the version 9. Also the links posted on most websites are not working so I have provided the direct FTP link here which has all the versions hosted. Please select the version needed as per your requirements.

ftp://ftp.adobe.com/pub/adobe/acrobat/win/9.x/

STEP 3:-

Once the above setup file is installed. You will need to set the environment variable path for the iFilter. This is the most important step.

 

If you simply create a Full text index after installing the above filter you will encounter one of the below error messages in your fulltextsearch logs located in the folder

 

C:/Program Files/Microsoft SQL Server/MSSQL12.MSSQLSERVER/MSSQL/Log /SQLFT000XXXXX.LOG

 

Error ‘0x8004fd02: The filter daemon MSFTEFD failed to load an IFilter interface for document, so it can’t be indexed.

 

No appropriate filter was found during full-text index population for table or indexed view

 

This is because SQL Server is not able to find the location for the filter within the environment variables. To correct the above issues, you need to add the path of the bin folder of the IFilter installation into the environment variables.

C:/Program Files/Adobe/Adobe PDF iFilter 9 for 64-bit platforms/bin/

The above path (slashes are actually backslash but my blog doesn’t apply them so replaced with forward slash). Steps shown in screenshot below.

022116_0942_LoadingResu1.png

STEP 4:- Restart Full text search and MS SQL Server Services

 

022116_0942_LoadingResu2.png

STEP 5:- Initialize the full text search modules to detect the new filter

EXEC sp_fulltext_service 'load_os_resources’, 1
EXEC sp_fulltext_service 'verify_signature', 0
EXEC sp_fulltext_service 'restart_all_fdhosts';

Once the above code has been run re-run the instructions in step 1 to verify the filter and the path are reflecting in the results.

STEP 6:- Create the full text index on the filetable by right clicking it and selecting define full text index as shown below:

create a full text index in SQL Server

For steps on how to create a full text index click here.

STEP 7:- Run a Full text search query to verify the functionality is working as expected.

022116_0942_LoadingResu4.png

References

 

https://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf

https://support.microsoft.com/en-us/kb/972650

The above link is not required for most cases but still worth looking into in case you get the error message listed in the blue box.

Loading Resumes and searching them using SQL Server

Recently I came across a requirement where we needed to load Resumes into SQL server and then search these documents for specific keywords. Important things to keep in mind included the use of SQL 2014, the resumes were in PDF format and the IO needed to be efficient. In this series of blogs I will show how we go about the process step by step.

The first step is to create a repository for the resumes, naturally we have a few options within SQL Server to achieve this. We could use a varbinary column but that would work well for cases where resumes where like 1 MB in size but that was an assumption we couldn’t make. Additionally they would add a serious overhead for SQL server LOB datatype search on the table and we wanted to improve the IO activity on the table there was also a case where resumes may be imported in BULK and we need a way to achieve that as easily as possible.

Recently I have been using this approach in a number of cases for everything from resume search to document search as well as browsing crawled pages and much more. It’s a god example of how SQL can work well with unstructured data as well. The latest example is where I downloaded all my emails using GMAIL API and loaded them into SQL Server to find out email accounts where I receive the most spam or contacts I have interacted with professionally as well as personally. Check out the post where I created a network graph within SQL in the below section.

Since we were using SQL Server 2014 we decides to go with Filetables.

First we enabled Filestream

See screenshot below to enabled Filestream  from the configuration manager

 

sp_configure 'filestream access level’, 2
GO
RECONFIGURE

Next we created the database and added a Filestream Filegroup

 

ALTER DATABASE [ResumeDB] ADD FILEGROUP [Resumes] CONTAINS FILESTREAM
GO

Next we add a file (actually a folder) to the File group (replace the below path as needed, my blog doesn’t accept slash hence the word).

ALTER DATABASE [ResumeDB]
ADD FILE (NAME = filestream_data,
FILENAME = 'C: (slash) Program Files (slash) Microsoft SQL Server (slash) DATA (slash) Resumes’) TO FILEGROUP Resumes

Next we give a Filestream directory name and set NON transacted Access to full so that files can be added directly into the folder without having to use SQL Server.

ALTER DATABASE [ResumeDB]
SET FILESTREAM (NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N’Resumes’)

Next we create an actual FileTable within SQL Server, this automatically captures metadata about the file like size, type, attributes etc., and it also keeps track of nested folders etc. and has functions exposed to navigate the filesystem. More importantly we can add files to the table without actually having to use Insert / Update / Delete, direct copy paste to the folder automatically reflects the changes.

CREATE TABLE dbo.ProfileResumes AS FILETABLE
WITH
(
FILETABLE_DIRECTORY = 'Resumes',
FILETABLE_COLLATE_FILENAME = database_default
)
GO

Loading files into the file table (simply copy paste)

loading files into file tables

 

 

Now that we have a place to save our resumes the next step is to allow us to search the actual document content. Which we will explore in the next post.

References

 

http://enabledbusinesssolutions.com/blog/creating-a-network-graph-in-sql-server/

Video on Memory Optimization for SSIS packages

Improving RAM Utilization within SSIS packages using data types. In this video we explore the impact of incorrect data types on how SSIS allocates RAM and the simple and easy fix to the problem.

This video demonstrates how using the incorrect datatypes causes SSIS to overestimate the amount of RAM needed and as a result causes the package to throw out of memory exception. We then go on to show how using the correct datatype helps SSIS overcome the issue.

https://www.youtube.com/watch?v=UlA48CGHtbE

Chinese Hackers(Cracker) and Azure, have I hit the bigtime??

I have a website hosted on Azure which I use for my trainings. Today I received a number of alerts from Azure showing High CPU usage for the Service in question. After investigating the error logs it seems it was Brute for attack for the database. The IP address of the attack is Chinese but no surprise there. The good thing is I already have a number of inbuilt security features enabled within the system so after a number of attacks the server simply shutdown and restarted and that was that.

The good thing was it was a honeypot meant exactly for this purpose and now that I know the site is actively being targeted, I can simply take additional steps to harden the security. The cloud is only as secure as your implementation in fact in most cases its more however that doesn’t mean its immune. So do yourself a favor check if you have implemented at least the following on your environments

  • Enable Firewalls
  • Disabled sa accounts or admin accounts and replace with non-conventional names
  • Use non-standard ports
  • Don’t host the database and the website on the same machine (domain names give IP > IP can be used to connect to databases)
  • Enabled Auditing – Look at the audit logs every once in a while
  • Separate data into parts that do not compromise your business and those that do, follow different processes for each.
  • Try hacking your own site / database – don’t know how? Google it
  • Localize for your geography – Never had customers from China ? then don’t allow Chinese IPs
  • Change your passwords regularly and don’t use the same password everywhere
  • Share as little info on the internet as possible ( esp. social media)

To the question have I hit the bigtime? NO, the fact that Chinese hackers are trying to attack my site doesn’t mean anything, they simply try every site they can lay their hands on. The key to avoiding data theft is to make it a major pain to do so. If it takes me 20 hours to penetrate your network I better be getting paid for it else it’s just not worth my time.

PS: – you don’t need to have a site to get hacked, emails are much more enticing than databases.

 

Airline On-Time Performance and Other Stats

Recently I came across the website for the Bureau of Transportation and downloaded the Airline on Time performance data for the last 3 years. Below is a snapshot of some of the information derived using this data?

After verifying the data with other sources on the internet I can say that we have got most of it right. There is obviously a ton of additional analytics that can be performed on this data set. The total size of the database is 10 GB and most of the reports below come from a Analysis Services Cube.

Biggest Airlines by

    Number of Cities Covered

Row Labels

Count of City

SkyWest Airlines Inc.

204

ExpressJet Airlines Inc.

178

Delta Air Lines Inc.

152

Envoy Air

131

American Airlines Inc.

95

United Air Lines Inc.

92

Southwest Airlines Co.

85

US Airways Inc.

78

Alaska Airlines Inc.

66

JetBlue Airways

61

Frontier Airlines Inc.

61

Spirit Air Lines

34

Virgin America

17

Hawaiian Airlines Inc.

17

Number of Flight operated

    020916_1038_AirlineOnTi1.pngFleet Strength

fleet strength for airlines

Average number of Flights per Aircraft

Airlines data on time performance

Most popular Cities as Destination

Row Labels Sum of Flights
Chicago, IL

369337

Atlanta, GA

348468

Dallas/Fort Worth, TX

239940

Houston, TX

199333

Denver, CO

196134

Los Angeles, CA

195380

New York, NY

194200

San Francisco, CA

148303

Phoenix, AZ

146426

Las Vegas, NV

133811

Least popular Cities as Destination

City Flights
Mammoth Lakes, CA

139

Pago Pago, TT

106

Wilmington, DE

98

Adak Island, AK

96

Hyannis, MA

83

St. Cloud, MN

82

Gustavus, AK

77

Dillingham, AK

77

King Salmon, AK

63

Ithaca/Cortland, NY

12

Cities with most delays (all reasons, all flights)

OriginCityName DelayedMinutes Total Delayed Flights Avg Delay/Flight
Chicago, IL

5452156

160916

33.88200055

Atlanta, GA

3572648

127256

28.07449551

Dallas/Fort Worth, TX

3174609

95349

33.29462291

New York, NY

2790672

70500

39.584

Houston, TX

2659053

85188

31.21393858

Denver, CO

2575669

87203

29.53647237

Los Angeles, CA

2359480

79643

29.62570471

San Francisco, CA

1819348

56541

32.17749951

Las Vegas, NV

1700751

55918

30.41508995

Phoenix, AZ

1542966

58498

26.37638894

Cities with most weather delays

City DelayedMin
Chicago, IL

432091

Atlanta, GA

260577

Dallas/Fort Worth, TX

258031

New York, NY

154452

Houston, TX

106600

Orlando, FL

101593

Denver, CO

79801

Washington, DC

73884

Baltimore, MD

62147

Newark, NJ

55081

Miami, FL

51891

Airlines with most delays

Airline

Dep Delay Minutes

Late Aircraft Delay

Late Arrival Percent

Weather Percent

NAS Percent

Security Percent

Southwest Airlines Co.: WN

13751710

5637097

41%

4%

11%

0%

American Airlines Inc.: AA

7553314

2544657

34%

6%

21%

0%

United Air Lines Inc.: UA

7514417

2433827

32%

4%

18%

0%

Delta Air Lines Inc.: DL

7496463

1881465

25%

7%

21%

0%

ExpressJet Airlines Inc.: EV

6354042

2404925

38%

2%

25%

0%

SkyWest Airlines Inc.: OO

5982698

2545780

43%

4%

19%

0%

Envoy Air: MQ

3549468

1347362

38%

10%

25%

0%

JetBlue Airways: B6

3362118

1219179

36%

3%

27%

0%

Spirit Air Lines: NK

2001802

634822

32%

2%

44%

0%

US Airways Inc.: US (Merged with America West 9/05. Reporting for both starting 10/07.)

1783141

523060

29%

4%

34%

0%

Frontier Airlines Inc.: F9

1396298

570711

41%

1%

38%

0%

Alaska Airlines Inc.: AS

1030534

349084

34%

3%

26%

1%

Virgin America: VX

596614

208708

35%

5%

33%

0%

Hawaiian Airlines Inc.: HA

317553

122332

39%

4%

2%

0%


What is a PK?

Recently someone posted a question on one of the popular groups on Linkedin about one of the most fundamental structures in RDBMS the PK. It was followed by a number of comments and while all the comments were essentially saying the same old things we all know about the PK , it wasn’t being said well ( in my opinion) so here I try so discuss some fundamental things everybody should know about the PK or Primary Key.

Before we get into the Primary key let’s just talk about the meaning of the word Key.

Ok, so what is a key column?

In its essence the key means column(s) with a unique combination of value that can be used to identify a row(s) within a table.

If a key has more than one column it’s called a composite key for example you can have composite primary key which is a PK with more than 1 column.

What if I have two columns either of which can be used to identify rows within a table?

Then the correct term would be you have two candidate key columns either of which independently can be used to identify the rows.

How it that different from a business/natural key?

A business/natural key is a column(s) that your business understands or uses to identify a row within the table e.g. customers may register on a site using their email ids and the business can identify a customer by looking up their email id, however it might not be the best column to use throughout your system since business may choose use the email id differently or they might need to be encrypted etc.

Instead you create an artificial key column that helps isolate the business process from the internals of the database design. This artificial key is called a surrogate key.

So how it this applied for Primary Keys vs foreign keys?

The difference between a Primary key is that the unique combination of values from the key columns can identify only one row in the table while in a foreign key it may point to one or more rows. In addition a FK would only accept values that are already present as part of a PK (or UNQ Constraint) on some other table.

Ahhh, but your forgot to mention that a PK also has a Clustered index on it

No, I didn’t. You don’t have to have a clustered index on a PK Column(s). Checkout the screenshot below, the clustered index is created on the names column but the PK is the ID column also notice that is uses a Unique NonClustered index now.

020716_1357_WhatisaPK1.png

 

So you’re saying we can use a Unique Non Clustered Index to establish referential integrity between tables?

Yes, Tables can be related to each other without having to use an actual Primary key. See the screenshot below, notice that the database Diagram support popups say Primary OR Unique Constraint

Pk vs UNQ index

Pk vs UNQ index

 

But surely it’s got be unique and NOT NULL?

Yes, it has to be unique, e.g. if you have two columns then the values from Col A when combined with the value for Col B should be unique across all rows in the table and the columns cannot accept any NULL values.

Is that it?

Just one more thing, a table can have only one PK, why? It is defined as per ANSI standards that’s all. So it’s not related to the fact that a table can have only one clustered index? Not really, ANSI doesn’t make any recommendation with regards to indexes.

So should every table have a PK?

If you have followed normalization ideally the combination of all the columns within the table would be unique, however it would be a lot easier to refer to just one column than the entire set every time you want to check. It would help simplify validation, joins and lot of internal business logic as well.

Is there any drawback to creating a PK on every table?

Size used by the PK can be a problem for very large tables if the column has not been selected properly. Once implemented it’s very difficult to undo problems with incorrect selection of PK since the columns will be used in many joins and where conditions within the system. Because of the default behavior of creating a Clustered index on the PK Columns there is an IO impact as well if not selected properly.

If am only looking for referential integrity should I bother with PK when I can achieve the same using a UNQ Constraint?

While the end result of either would be similar in most cases you need to keep in mind that the UNQ Constraint allows a single NULL and is implemented using a Non Clustered index which has size and IO implications. So if your data allows it and you have accounted for things like index fragmentation etc. still consider using a PK before a UNIQUE Constraint J

Very large table performance – basic data capture

Every once in a while I see a post on the forums that is similar to the one below:-

Hi

Can anybody provide tips on how to improve performance on a very large table, the table has 20 million rows in it.

Thanks

Most of the first 10 replies are basically just people asking for more info. So I wrote this script to help both the OP and the People providing answers with some context. The number of rows is not an accurate indicator of the size of the table and the possible types of performance issue that might be encountered on it.

DECLARE @TableName VARCHAR(50) ='FactAirline'
 
SELECT DISTINCT @@VERSION AS Verison,t.name AS 'tbl_name', -- table name for obvious reasons
'ix_type'= i.type_desc , -- the format of the data stored in the table heap vs Clustered vs NC
t.OBJECT_ID, -- objectid for the table in case we need it later
max_column_id_used AS '#_of_columns', -- number of columns to figure out if we are dealing with a wide table ( fact ) vs a deep table ( oltp tran table)
ps.partition_id , -- partition id to know if the table is already partitioned and the indexes are aligned
p.ROWS AS '#_of_rows', -- number of rows coz this is subjective by user
ps.used_page_count, -- actual number of pages used to figure out number of rows/page and understand the size of the table ( 1 Million row can occupy less than 10 pages of more than 100 pages)
ps.in_row_used_page_count*8 AS 'data_ix_used_size_in_kb', -- size of the object a true measure of a big table,
CAST(ps.in_row_used_page_count AS VARCHAR(20)) +' | '+ CAST(ps.lob_used_page_count AS VARCHAR(20)) +' | '+ CAST(ps.row_overflow_used_page_count AS VARCHAR(20)) AS in_vs_lob_vs_overflow_page_count, -- number of pages in each category ( helps to identify if the data is traditional rows or lob data)
i.name AS 'ixname',
ip.alloc_unit_type_desc, -- describes the index allocation type being analyzed
ip.index_level, -- the level at which fragmentation occurs , 0= leaf , highest number is the root.
ip.avg_fragmentation_in_percent, -- index fragementation level to identify if its the index thats the cause for the performance issue
stp.Statsinfo -- stats info to identity if the stats are up to date or else it might be the reason for the issue
FROM sys.tables t
LEFT outer join sys.partitions p
ON t.OBJECT_ID= p.OBJECT_ID
LEFT outer join sys.dm_db_partition_stats ps
ON p.partition_id=ps.partition_id
LEFT outer join sys.indexes i
ON ps.index_id =i.index_id
and ps.OBJECT_ID= i.OBJECT_ID
cross apply ( SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(),t.OBJECT_ID,ps.index_id,p.partition_number,'DETAILED' )) ip
LEFT outer join sys.stats st
ON st.OBJECT_ID= t.OBJECT_ID
and st.name like '%'+i.name+'%'
cross apply ( SELECT (SELECT rows_sampled,ROWS, create_date , steps , modification_counter FROM sys.dm_db_stats_properties(t.OBJECT_ID,st.stats_id) dd FOR XML PATH('StatsInfo'), ROOT('Root')) AS Statsinfo ) stp
WHERE t.name = @tableName
ORDER BY ix_type , i.name

Details of why each column was included and how to read it are mentioned below.

Version: – This column indicates the version of SQL, Edition of SQL, Architecture of SQL, Service Pack of SQL and Level of OS. This is probably the most important column since different OS and SQL Combinations have different features and memory and CPU capacity so it will help eliminate solutions that are not relevant e.g. partitioning on Standard edition etc.

Tbl_name:- The Name of the table, useful if meaningful naming conventions are used e.g. Fact_internetSale, CustomerProfiles etc. This can be used to infer or make certain assumptions on the nature of the table.

Ix_type:- Lets you know if the table has an index, if so what type. Pay attention to the type since Columnstore indexes etc. do not display all the info and are treated differently. There will be one row per index

Object_id:- Just the id of the table for future reference

#_of_Columns:- Indicate if we are dealing with a wide table (typical of a DW) or a narrow table (typical of an OLTP system).

Partition_id:- If there are multiple values for the ix_type it indicate the table is partitioned and if the index is aligned, helps avoid suggestions that are already implemented.

#_of_rows:- An estimate of the number of rows in the table, to provide an idea of avg number of rows per page.

Used_page_count:- Number of pages used by this table, an indication of the true size on disk of the table

Data_ix_used_size_in_kb:- Indicate the size of the table / Index on the disk for both data and index. True metric for size of the table.

In_vs_lob_vs_overflow_page_count:- Number of pages allocated as in_row, LOB or Row overflow. Helps to identify if the table is large because of special data types.

Ix_name:- Helps to identify the index being used for the following columns, might contain multiple rows based on number of levels in the index.

Avg_fragmentation_in_percent:– Indicates the level of fragmentation on each level of the B-tree, helps identify if a simple index rebuild or re-or might fix the issue.

Stats Info: – Provides details of the statistics for the index, like number of rows sampled and the modification counter which will indicate how recently the stats were updated.

WARNING: – The script has been tested for SQL Server 2014, certain DMVs do not work for older versions and table with Columnstore indexes modify as needed.

 

Session State request to session state server

Problem

Unable to make the session state request to the session state server. Please ensure that the ASP.NET State service is started and that the client and server ports are the same.  If the server is on a remote machine, please ensure that it accepts remote requests by checking the value of HKEY_LOCAL_MACHINE|SYSTEM|CurrentControlSet|Services|aspnet_state|Parameters|AllowRemoteConnection.  If the server is on the local machine, and if the before mentioned registry value does not exist or is set to 0, then the state server connection string must use either ‘localhost’ or ‘127.0.0.1’ as the server name.

Solution

Asp.net State Service is not running on the machine where the developer is debugging.

Simply go to Start|Run|Services.msc| right click and start the Service ASP.NET State Service.

ASP.Net SessionStateServices

Mostly required when using cookies to store session info.