SQL Server Scrapbook: 2014

Sunday, December 7, 2014

Helping SQL Server with more statistics – Filtered Statistics

Existing System: We generally have the SQL Servers to automatically create statistics. So for every column it would create a statistics with the required information. When SQL creates statistics, it’s able to create it with about 200 sample points (Or distinct values). So if there are about 100K distinct names in a column then SQL Server would just be able to accurately store the information of about 200 names and stores the information of other names in ranges. For e.g., if the first sample point is of name “Adam” and the second sample point is of name “Ben” then it accurately stores the information as to how many records in the table are with name “Adam” and “Ben” in the statistics histogram. Along with this it also stores the number of records that are between “Adam” and “Ben” and also the number of distinct value.

Problem Statement: With the above theory it would always be happening that SQL Server is always using very skewed statistics at various times and hence must be using a not so optimized query plans. Let’s take the above samples to understand this better. Consider that the following few names are used in a table from “Adam” to “ben” in alphabetical order:

Name	Number of Rows
Adam	200,000
Alan	25,000
Ariez	500,000
Bacary	50,000
Bane	25,000
Ben	200,000

For the above sample of data, SQL Server would be storing the accurate information of the sample points (“Adam” – 200K and “Ben” – 200K). However it stores an approximate value for the values that are in between these sample points. For e.g., It would just be having an information that there are 4 distinct names between “Adam” and “Ben” and the number of records are 600K so it would estimate that there are about 150K records for each distinct name. Hence when in queries we search for the name “Ariez”, SQL Server would estimate that there are about 150K records with the name “Ariez” but in actual it’s having about 500K records which make it to use some inefficient query plan instead of the most optimized one.

Solution: With the above anomaly in place, we should be providing SQL Servers with some more statistics for it to take better decisions as to which plans are to be used. This is where Filtered indexes come into picture. In the above sample we should be creating filtered statistics on the values between “Adam” and “Ben”. SQL Server in such case would produce another statistic with about 200 sample points (Or lesser if there are less number of distinct values) and hence it would be able to better inform SQL Server that there are about 500K records with the name “Ariez” and other possible names.

Saturday, October 25, 2014

Basic IO Test with SQL Server’s Performance Oriented Features

Microsoft has been releasing at-least 1 major performance boosting feature as part of its Enterprise Editions in recent years. Following are a few:

SQL Server 2008: Row level and Page level compressions which reduce the IO flow from the disk. This needs to be tested properly to make sure that the increased CPU load to decompress the data is really worth it as compared to the decreased IO load.

SQL Server 2012: ColumnStore Indexes which improves the performance of the queries drastically as it is based on the implementation of Column wise data storage instead of the traditional Row based storage of the data. The disadvantage was that the Table needs to be read-Only if they are supposed to have a ColumnStore indexes. However, there was still a work around by switching the partitions In and Out of these Read-Only tables

SQL Server 2014: Updateable ColumnStore Indexed Tables and also InMemory OLTP Engine to keep the data in Memory without the fear of them being pages out if it gets old.

As part of a comparative test I wanted to see on how these features behave with a very simple query for experimental purpose.

Note: The “AdventureWorks2014” DB was used for testing purpose. Table: “SalesOrderHeader” Deciding Column: “OrderDate”

All the 5 Scenarios are exclusive to each other and were run isolation.

Following query will be run under each scenario:

Select COUNT(*) FROM Sales.SalesOrderHeader SOH WITH (NOLOCK) Where OrderDate > '2013-01-01'

Scenario 1: The table is not having any Index on the deciding column

Table 'SalesOrderHeader'. Scan count 1, logical reads 688, physical reads 0, read-ahead reads 693, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Scenario 2: The table has Page compressions applied to it

Table 'SalesOrderHeader_Mod'. Scan count 1, logical reads 330, physical reads 0, read-ahead reads 344, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Scenario 3: The table has a NonClustered Index on the deciding column

Table 'SalesOrderHeader'. Scan count 1, logical reads 61, physical reads 0, read-ahead reads 59, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Scenario 4: The table has a ColumnStore NonClustered Index on the deciding column

Table 'SalesOrderHeader'. Scan count 1, logical reads 13, physical reads 0, read-ahead reads 21, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Scenario 5: The table is created InMemory hence the IO Cost in the execution plan can be seen as Nil as all the data is present in Memory itself.

Summary:

Test Case No	Scenario	IO Pages Read from the disk	Read – Ahead Reads
1	The table is not having any Index on the deciding column	688	693
2	The table has Page compressions applied to it	330	344
3	The table has a NonClustered Index on the deciding column	61	59
4	The table has a ColumnStore NonClustered Index on the deciding column	13	21
5	The table is created with InMemory Optimization and	0	0

Observation:

While working with some of the data, I did see that in some tests NonClustered Index was performing better than the ColumnStore Indexes which made me think twice and search on it more as to in which cases do ColumnStore Indexes perform much better any day. It did make me think because even the NonClustered Index just contain the columns like the ColumnStore Indexes then in which cases could one outweigh the other. I finally tumbled on the following blog post which said that “NonClustered indexes were still going to be more effective for specific types of queries (for example – singleton operations, smaller range scans, etc.).”

http://www.sqlskills.com/blogs/joe/comparing-query-performance-when-using-an-ideal-nonclustered-index-versus-columnstore-index/

ColumnStore Indexes perform way better that NonClustered Indexes when the data is huge and ColumnStore index execution leveraged parallelism as

Thursday, June 26, 2014

Data Cleansing using Data Quality Services – SQL Server 2012

Cleansing of the data from Source
ETL processes would contain data from vast number of sources. We need to cleanse the data from all the sources to form a common clean data by comparing it with a knowledge base which was built before. Building a knowledge base is an incremental process. For e.g., we can take the following excel file which now needs to be cleansed up for its City column by comparing it with a Knowledge base:

Steps to cleanse the above source file
1.) Open “SQL Server Data Quality Services” client and click on “New Data Quality Project” to start the process of cleansing:

2.) Enter a name for the project and also select the knowledge base which should be used as a reference to cleaning the source data and select the cleansing activity:

3.) Select the source file in the next window and also the column which needs to be cleaned up. We also need to select the domain of the knowledge base against which we need to refer. Here we have selected “City” domain of “customer” Knowledge base which was built in the previous blog post:

4.) Click on Start process to start cleansing. The process would start analyzing the source file against the domain and either correct it by itself or give a suggestion:

5.) Here we see that it identified one record as a new value for “City” because it did not find any such value in the Knowledge base and also it was not able to associate this new value to any existing cities in the domain as it would have tried to correct it by itself or would had given a suggestion that it might be referring to which city. But in this case its shown as a new value and now it’s the job of the validator to really accept it or it can be mapped to any of the existing cities itself. I know that this is a spelling mistake “Dilhi” and should be referring to “Delhi” instead so I enter the correct value in “Correct to” field and approve it. This would then goto the corrected Tab:

6.) We also see here that a record is already present in the corrected tab. The city given in source was “Bangaloore” and Data quality service analyzed it and mapped it to the corrected value of “Bangalore” with a confidence level of 87% which is pretty high. The operator still has the control whether to accept this suggestion or not by approving or rejecting it.

7.) The corrected data can then be exported to either excel or to a SQL server table

8.) The final corrected source:

Conclusion:
Any source can be passed through a data quality service and cleansed it against a knowledge base before passing it to the next stage of an ETL process or any independent process. We can also use the same source files to act as a new learning process for Knowledge base and add the new values from it to the knowledge base if we want to. I will cover this in another blog post.

Tuesday, June 24, 2014

An Introduction to Data Quality Services - SQL Server 2012 - Creating first Knowledge Base

Introduction to Data cleansing
Any database tables or data stored in excel files can start getting discrepant once they start coming from different sources. They might refer to some common object but due to some differences in spellings would lead to be detected as completely separate items. For e.g., Excel1 might contain Customer1 city as “Bangalore” and Excel2 might state it as “Banglore”. Even though they meant to point to the same city, due to a manual error they would not be seen by any system as different cities and invalid as well. So before we can feed these sources into any system, we are required to cleanse them so that there is no data discrepancy around. Microsoft SQL Server 2012 has introduced a great feature for this purpose. It’s called “SQL Server Data Quality Services”.
Installation SQL Server Data Quality Services
This feature is not installed by default but its installer. We need to run its installer “Data Quality Server Installer” which would be present in the SQL Server directories or in the programs. The service would get installed in about 10 minutes and we should be ready to start using it.
Everyone needs a Knowledge Base
For any data cleansing activities, we would really need to know as to what the correct value should be instead. So we actually need to create a knowledge base here, which could be used as a reference to the source to clean them. This is called a Knowledge base.
Sample source for Knowledge Base:
We need a source to create a Knowledge base. We can also create the knowledge base by manually entering the value. But for now I would use a source as it would be quicker than manually typing in the base. Consider the following list of states and cities which we would use as our knowledge base and clean other sources containing customer data using this.

Steps to create a New Knowledge Base
1.) Go to Programs and start the “SQL Server Data Quality Services”. It would open a UI and then select “New Knowledge base” from it after selecting the server where this services is installed. The following window will open to create a new knowledge base

2.) Input a name for this Knowledge base and select “Knowledge Discovery” to start building the base. I the next window we need to define the Source of base and also map the columns as universal domains which could be used as a reference to cleanse these types of columns

3.) Click o the new domain icon besides “Mappings” to create a domain. Here we need to create a city domain and then similarly a State domain:

4.) Once the domains are created, map the column names to the domains to start discovering the values

5.) On the next window we can click the “Start”button to start analyzing the source and checking in the domain if there are any new vales in it. Since this is the first time we are creating the knowledge base, all the values will be identifies as New values for the domain:

6.) On client “Next”, we can see that both the domains and all the values that has now been added into them from the excel. We also see that its automatically corrected once of the state values (“Andra Pradesh”) to another (“Andhra Pradesh”). It corrected it to “Andhra Pradesh as that came first in the excel list”.

7.) Select “Error” from drop down to see all the values that were rectified to a new value:

8.) Once the data discovery is done, we need to publish the Knowledge base so that other projects can start using this base to cleanse the sources. Click on “Publish” and then go through other windows to publish it:

9.) Once the Knowledge base is published, you can re-open it and check. We can also add new values to it manually or again select some source and try adding newly detected values from it to increase the knowledge base.

Conclusion
So now we are ready with the knowledge base and in future blogs I will experiment on how we can use this base to cleanse the Sources while transforming it to the destination.