Monthly Archives: September 2015

Full Text Search on Varbinary (max) Columns and using the Right filters

I love Full text search in MS SQL Server. Although I haven’t seen it being used a lot in client locations, I do use it a lot within my own company as well as for some of my clients who have very specific requirements. One of the more common requirements we find is document search. Typically this would be like searching a resume or a product catalog for products that meet a specific filter criteria. While the solution is a mix of multiple features like FILESTREAM or Filetable and Full Text Search; in this example I am focusing mostly on the Full Text Search aspect of it.

The Important thing to keep in mind when working with FULLTEXT search on varbinary columns is that it is not the same as FTS on a varchar column. Mainly because the data is stored in VARBINARY format FTS requires a file type column to be defined as well. This column helps full text search identify the type of file the content comes from.


CREATE TABLE [dbo].[filecontent](
[fileid] [int] NULL,
[Filecontent] [varbinary](max) NULL,
[filevid] [int] IDENTITY(1,1) NOT NULL,
[doctype] [varchar](10) NULL,
PRIMARY KEY CLUSTERED
(
[filevid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO


For example in the above create table script the doctype column of varchar (10) helps differentiate the contents as coming from a pdf file or a doc file or docx file. Without this column you wouldn’t be able to proceed with creating the full text index on varbinary columns.

Once you have a filetype or doc type column the next thing to keep in mind is that by default FTS doesn’t understand all doc types. To know which doc types FTS understands you can run the below query.


select * from sys.fulltext_document_types


If you do not find the filetype you are looking for, e.g. in this example docx, you can download the filter from the MS website.

http://www.microsoft.com/en-us/download/details.aspx?id=17062

Once the appropriate filter has been installed you need; run the below commands in SSMS


EXEC sp_fulltext_service 'update_languages';
EXEC sp_fulltext_service 'load_os_resources', 1;
EXEC sp_fulltext_service 'restart_all_fdhosts';


And then restart the Full text Service within the service manager console. If you already have a FT index on the table then you need to drop and recreate it.

In order to create the full text index you can simply right click the table and select Full Text Index > Define Full text Index

Make sure you have PK on the table before pressing Next

Select the Unique Index you want to use and then Press Next

In the Type Column select the name of the Column which stores the file type, and select the language to use.

If you want FTS to automatically track changes to the file content select the same and press Next, choose Manual if the file contents are updated infrequently.

Select the FTS catalog you have created else check “Create a new Catalog” and give it a name. Press Next

If you would like the FTS to be populated at a predefined schedule the configure it in the below screen, it’s very similar to scheduling a job. Press Next

You will see a summary

Press Finish

Once the FTS index is created, you can force the index to be populated by – clicking Start Full Population below.

Once the index is populated you can query the FTS catalog as shown below

Some common example where you can use Full text search is to search for digital documents in an online library as I am doing above, I am searching my training content for posts related to SQL Server Analysis Services. Another way this was implemented was for a client of mine who used expensive parsing software to extract resume content and store the data in a varchar column to perform searches, once we migrated to FileTable and full text search we were able to improve search relevance as well as overall ETL performance by a factor of 10.

I recently use a combination of Full text search + SSIS Fuzzy grouping + term extraction to generate a list of Tags for all the blog posts I have uploaded so far. Naturally there are other ways you can use FTS as well a simple example would the search bar on most websites where the html content is saved in a varbinary column so that relevant web pages can be displayed in the search results much like how google indexes web pages.

Have Databases become a glorified versions of EXCEL?

Recently I was having a discussion about how coding and design patterns have changed in RDMBS and if we can still call it RDBMS anymore. Very early in my career while attending an interview I was asked what a relational database is, my answer was “anything that stored data in a row and column format and dependencies can be established between these tables would be a relational database”. The Interviewer laughed at my answer and said by my definition EXCEL would also be an RDBMS to which I replied yes, even though I knew it wasn’t, I didn’t get through, but 10 years later worked for that company as a consultantJ. While the interview obviously let me know that my fundamental understanding of RDBMS was flawed I couldn’t help but ask myself why isn’t Excel an RDBMS, I can certainly do a lot of the stuff I do within SQL Server within it.

Coming back to my discussion we mainly talked about how PK and FKs are rarely implemented anymore, how CRM applications create copies of the tables and how consistency isn’t as important as it was before. As we discussed further I started comparing the databases and Entity models created now with Excel and here are some of the comparison I made:-

-An excel work book is a database it can be copied, migrated, upgraded, it can be backed up and password protected (not exactly TDE but still better than nothing).

– A Worksheet within a work book is like a table, it stores rows and columns and column headers.

– You can have computed columns within the worksheet by adding a formula to the column, data in the tables can be grouped, sorted (order by) and filtered (where condition).

-You can JOIN data from two worksheets using VLOOKUP function

– You can create a view by building a Pivot table on top of the existing data from a worksheet.

– You can create PK and FK type relationships within Excel by using a Drop down list and referencing the dataset off the main worksheet.

-You can have programmatic access to data within the worksheets using MACROS which can be saved just like Stored Procedure, the Macro is specific to the Workbook just like procedures are specific to the database.

– You have checkpoint like behavior with Auto save.

– You have atomicity in Excel by default since DML to any cell within the workbook is all or nothing and each cell is its own transaction.

– Isolation is achieved because when you open a Network shared Excel Workbook it opens in read only mode. Think of the database transactions as being is serializable isolation level.

– Durability is achieved because unlike Auto save the actual commit is implemented by pressing the save button itself, more like an explicit commit in the DB

– Consistency – While not truly following consistency, the data still returns to consistent state if you rollback ( Ctrl +Z or Undo) , circular references and PK and FK issues raise error thus preventing consistency issue to begin with.

With ACID making way for BASE it’s acknowledged that Traditional RDBMS rules have been too strict to work well within the design challenges of a highly concurrent distributed landscape we have today. While Excel isn’t BASE I think the future of databases is Excel-Like data stores that provide developers with the flexibility the need to store any kind of data any way they see fit. In other words the rules of the RDBMS are decided by the developer using the database at design time and not a constraint within which the developer has to code.

References

https://support.office.com/en-in/article/Create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b

 

Why VW emissions fraud is a good thing for climate change.

What does 51700 kilotons have to do with this post?

That is the min total CO2 Emissions/year from the 11 Million cars that VW says had the cheating device.

So what do the below countries have in common?

They each produce an estimated 51700 kt of CO2 emissions annually. VW and the fraud that happened, could easily wipe out the annual CO2 Debt of all of these countries combined.

Lesotho , Kiribati , São Tomé and Príncipe , Marshall Islands , Federated States of Micronesia , Vanuatu , Dominica , Comoros , Tonga , Samoa , Turks and Caicos Islands , Timor-Leste , Solomon Islands , Saint Vincent and the Grenadines , Palau , Guinea-Bissau , Saint Kitts and Nevis , Grenada , Central African Republic , Burundi , Cape Verde , Saint Lucia , Belize , Chad , Gambia , Bermuda , Bhutan , Antigua and Barbuda , Eritrea , Andorra , Djibouti , Cayman Islands , Rwanda , Somalia , Mali , Greenland , Sierra Leone , Seychelles , Faroe Islands , Liberia , French Polynesia , Swaziland , Macau , Maldives , Guinea , Malawi , Fiji , Niger , Barbados , Togo , Burkina Faso , Guyana , Laos , Iceland , Madagascar , Republic of the Congo , Haiti , Mauritania , Aruba , Palestine , Suriname , Zambia

I wonder how many VWs were sold to the countries above. I wonder if they even had a dealership.

So what does this mean?

Looks like the governments can talk all they want about climate change but it’s not going to mean much until the private sector is also equally committed. It’s surprising that one corporate entity had the ability to do so much damage. But when you think about it, it’s not really shocking remember how FB, Google and MS ended up giving the NSA details of almost every internet communication of people around the world. The larger the corporate the larger impact they have on global users too and their moral slant or that of those with power over them can impact everybody else. Remember the BP oil disaster in Gulf of Mexico or the Banks during the Subprime mortgage crisis?

When the global recession happened I was shocked that my job in an IT Firm that had nothing to do with banks could hang in the balance because some Board Members sitting in a Bank decided to take risks they can’t afford. As citizens of the world we need to understand that we are far more connected than ever before (2 weeks ago markets around the world crashed when China devalued their currency). And the CEOs around the world need to listen to Uncle Ben “with great power comes great responsibility”.

So what’s next?

This would be great time for Governments to really make BIG CAR/ BIG OIL start delivering on the promise of clean tech.

After all there is nothing more justified than pseudo moral outrage to teach somebody a lesson or get people who don’t really care to buy in. If you doubt this ask Colin Powell.

The equation is supposed to be (Govt a buffer between you and corporate greed).

YOU ———-à Government ß———— Corporate Greed

But actually it is

Government ß———You ————à Corporate Greed

FB +MS+GOOG    

Privacy/ Clean Air

VW

SubPrime/ Nuclear testing

Savings, health, Clean Water

BP, BHOPAL Gas

                   

What does this have to do with Climate change?

I guess this is one of the most important points being argued in the Climate change talks, if you have heard of “differentiated responsibility climate change” it basically means:-

Common but differentiated responsibilities (CBDR), principle of international environmental law establishing that all states are responsible for addressing global environmental destruction yet not equally responsible. The principle balances, on the one hand, the need for all states to take responsibility for global environmental problems and, on the other hand, the need to recognize the wide differences in levels of economic development between states. These differences in turn are linked to the states’ contributions to, as well as their abilities to address, these problems. CBDR was formalized in international law at the 1992 United Nations Conference on Environment and Development (UNCED) in Rio de Janeiro.

Why does the country Iceland need to adhere to strict emission regulations when a German Car maker doesn’t? Did you know that the German Govt Subsidies VW so should they take over the debt of VM when it comes to the emissions?

Don’t get me wrong I am not accusing the private sector of being reckless with the environment. If anything VW has set back positive efforts by many companies that take CSR seriously and I fear that these companies are going to face the wrath due to no fault of their own. But the line has been crossed question is what are you going to do about it , slap on the wrist like the Subprime crisis or enforce rules penalties so strict that they end up paying for the damage by decades of RnD to make the world a better place.

References

http://www3.epa.gov/otaq/climate/documents/420f14040a.pdf

http://data.worldbank.org/indicator/EN.ATM.CO2E.KT/countries

If you’re wondering why rules is striked out , it’s because that’s what VW just broke.

Video on Performance tuning memory allocation and spill issues

I was inspired by a recent article I read on parameter sniffing to add my own contribution to the topic, I explain how tempdb spools occur due to poor memory grants and they impact of parameter values in determining how much resources are allocated. I then go on to cover two different ways that the DBA and Developer can manipulate the behavior of the Optimizer to ensure that procedures are granted enough memory.

Continue reading

VM and IP Configurations

Recently there was a post on DB mirroring not working on a VM, in the post I had suggested checking the network connectivity and NIC configurations in the VM software since there are different ways that the VM software assigns IP to the Virtual devices. For the purpose of this post I am using VMware as the example but most of it holds true for HyperV as well.

Here is a screenshot of the different types of Network connections that can be established via VMware

As you can see in the above example I have used a bridged connection, this means that the VM is treated like another machine on the same network as the host.

Nat or Network address translation is used when there are limited IP addresses usually you find something similar when using the cloud since the Cloud provider has a limited number of IP addresses available.

In this case the VM translates (catches and redirects) any request from the Guest and passes it on to the Host so that the network sees only the host IP address and not the IP of the Guest OS.

Host Only – A connection shared with the host only in other words this guest had no access to the network. The Guest and the Host share their own private network created using a Virtual DHCP available within the VM software itself.

As you can see depending on how the NIC is configured the Router or DHCP server sees different IP addresses and therefore firewalls and mirroring endpoints need to be configured differently.

Reference

https://www.vmware.com/support/ws4/doc/network_bridged_ws.html

https://www.vmware.com/support/ws3/doc/ws32_network6.html

https://www.vmware.com/support/ws3/doc/ws32_network21.html

 

Suspect MSDB and no Backups

Yesterday I got a call from a client about MSDB being suspect after a drive failure. Unfortunately the client didn’t have any backups. Now the correct and best option is to use a backup and restore the database. This way you don’t lose any of the jobs and other configurations that you may have set over the years. In this case I didn’t really have much choice so after getting permissions to restart SQL Server we shut down the database instance and then copied over a fresh copy of the Data and log files for msdb from the below location.

“C:/Program Files/Microsoft SQL Server/MSSQL12.SQL_14/MSSQL/Binn/Templates/”

Once SQL server service starts it sees the new blank copy of the msdb database and the server is back to normal. Except for the missing jobs and backup history etc.

I would normally try my best to recover the data from the databases, but when it comes to system database there are limited options. It always a good idea to backup system databases yet I see that most companies tend to not give them the same level of importance.

The client had a user database too that was marked suspect and had allocation page errors which were not resolving even after DBCC CHECKDB with Repair allow data loss, in this case I didn’t really have much of a choice as this database was also not being backed up but after resetting the database status I was able to provide the client read access to the database so that important tables could be exported.

 

SQL Server is an awesome product.

While I might seem biased in my choice of wording for the title, the fact is SQL Server is an awesome product.

If you looking for actual numbers click here. MS SQL Server is the top OLTP and among the top DW RDBMS.

More importantly you see that SQL server provide everything you need to manage almost all kinds of data. Along with Office and Windows Server it’s one of those few products MS has got right and keep getting right, be it the focus on Self Service BI when nobody else was paying attention to it or just the foresight to include PolyBase when BIGDATA wasn’t that Big or MDX way before that. MS has shown good thinking on what the market needs well in advance of the need actually arising. Obviously they sometimes get it wrong, remember the Zune or Surface?

Sometime they come from behind like the XBOX or Azure but the one thing I think that contributes to the success of SQL Server has been the massive community behind it. Recently with all the testing I have been doing on SQL 2016 I have been raising my share of connect items and more importantly than me raising the connect items is the wealth of information already out there , some on MSDN others on peoples blogs. As if this wasn’t enough, the guys at MS then go ahead and either resolve the connect bug in a few days or sometime even hours showing that they are just as committed to making sure that these things get addressed. As long as we can keep this up I really think SQL Server is gonna be miles ahead of the competition.

I guess the guys at MS can thank all those DBA out there who have made it a point to be a pain in the A** when it comes to SQL server, we all have a better product because of it.

http://enabledbusinesssolutions.com/blog/proof-that-ms-sql-server-is-the-better-relational-database/

 

The History of War Analyzed

War is an ugly topic, but I wanted to know if we have learnt anything from history and so I downloaded the data for all the wars since 1900 till date to get an understanding of how things have changed.

The below chart shows the duration in years on the x axis and the count of wars for this duration. Overall we see that the overwhelming majority of wars are short duration and typically last between 1 to 2 years.

The below graph explains the trend further showing that civil wars tend to drag on longer, often the protracted wars result in victory to the Main party e.g. India in Insurgency in Jammu Kashmir , Sri Lanka against the LTTE and Spain against ETA.

Maybe we could take hope in the fact that short duration wars reduce the impact of these wars on humanity, if it hadn’t been for the below chart which shows the number of wars being fought had increased.

Some of the popular waring nations, the list includes civil wars (orange) and conventional wars. Interesting to note that countries that have the highest tallies usually have both internal and external wars. Keep in mind the list includes countries that have provided covert support for waring nations as well.

In the last 115 years we have never had a single year when the world was at peace. The below chart shows the number of conflicts by year, as you can see, we live in a more violent time today with more countries and war zones.

List of countries with Win Loss Record (this is only a fraction of the total number of countries)

There are more facts to be displayed but by this point I am too depressed to continue. Interesting thing to note about the countries that lost the wars is that they often seem to end up in turmoil even after the war, while the countries that win the wars typically move higher up the economic ladder.

Below is a world map of Conventional wars.

Below is a world map of Civil wars and insurgencies

In Summary, the only place war hasn’t happened are the places where people don’t exist. Surprisingly, you will see that Europe, Africa and Regions near the Panama Canal seem to have the largest amount of strife. Europe has obviously settled down after the world wars but not much help for Africa.

The Cold war has obviously raised its share of tensions as well the Middle East Conflicts. Please note the data does contain some issues with mappings of country names mainly due to change in name such as Soviet Union to Russia etc. Almost every country in the United Nations has been at War in the last 100 years and things obviously has gotten better in limiting the scope and damage caused due to war but we are no closer to eliminating it all together.

Geographic Isolation does provide a measure of comfort since countries that have barriers like Oceans and mountains between their neighbors tend to fare better with fewer enemies and more wins during conflicts; example include United States, India and Australia. The United States tends to fare better in all-out war than aiding insurgencies while India seems to be able to tackle domestic unrest more easily. Also if you’re thinking most of the countries have a war against them because of the world wars then think again, almost every country has had a war either before or after the world war too.

So why did I write this post, despite its depressing content? As a database professional I am fortunate in being able to view and work with data in different ways, after analyzing my phone bill (a real eye opener), bank statements (saved me a ton of money), climate change data (got rid of certain misconceptions) and now war data (reinforced my assessment of the world) I look forward to the next enlightening dataset and any surprises it might have for me.