Wednesday, May 20, 2015

Few of the SQL Server Storage Definitions

Page - Basic Unit of SQL Server Storage - 8KB

Extent: 8 Consecutive pages - 64KB

Extent Types:

  • Uniform Extent: If all the 8 Pages of the extent belong to the same IAM(Object)
  • Mixed Extent: If the pages in the extent belong to more than 1 IAM(Object). So a maximum of 8 different IAMs can share an extent.

When is a page allocated from a 
  • Mixed Extent: If the table size is very small then the pages are allocated from a Mixed Extent. First 8 pages of the table are stored in Mixed Extents.
  • Uniform Extent: If the table size grows beyond 8 Pages then a Uniform extent is allocated to the new pages and the free pages in the Uniform extent are reserved for that particular IAM.

GAM Page: Global Allocation Map: This page keeps information of various Extents in the system and signifies whether it is in use extent(1) or an Unallocated extent(0)
A single GAM page can keep track of 64,000 Extents = Data of almost 8GB. So a single GAM page is used to keep track of 8GB worth of Data.

SGAM Page: Shared GAM: This keeps track of whether the Mixed Extent contains free pages or not.
  • if SGAM bit = 1 then the particular extent is a mixed extent and has pages available to be used up
  • if SGAM bit = 0 then the particular extent is either Uniform Extent or a mixed extent with no free pages

Page Structure:
  • First 96 Bytes is the Header of the Page and gives a lot of information about the type of page and other metadata
  • Then there are Data record
  • Then there are Slots which gives the information of various record's offsets within that particular page.
More information about the storage could be found in the blogs written by Paul Randal: http://www.sqlskills.com/blogs/paul/inside-the-storage-engine-gam-sgam-pfs-and-other-allocation-maps/

Wednesday, May 6, 2015

Controlled IO for Heavy Processes using SQL Server 2014 Resource Governor

You would have frequently experienced that whenever few queries run, they tend to slow down the whole server and hence have an impact on the business as the other applications do not respond in a timely manner. This hugely happens because there could be some type of queries which need a lot of IO operations to happen for communicating with the disks and hence that puts a huge constraint on the disk which results in ever increasing IO queues. There are other queries which go into the wait state as they are waiting for the IO sub-system to complete the request of the IO intensive queries which had started before.

Microsoft had introduced a new feature called "Resource Governor" since SQL Server 2008 which helped to control the utilization of specific resources (CPU and Memory) by different types of query groups. This feature is made available in only the Enterprise, Developer, and Evaluation editions of SQL Server. Now, as part of SQL Server 2014, Microsoft has enhanced this feature and added another type of resource which could be well controlled and that's the IO sub system. This is a major functionality released by Microsoft and its usage is going to be big as administrators have always wanted something to control the amount of IOs that could be taken by certain types of queries so that they do not bog down the whole server.

Enabling and Implementing IO Utilization using Resource Governor:

1.) Enable Resource Governor by right clicking on “Resource Governor” from Management folder of SSMS Object Explorer and selecting “Enable”

2.) Create a resource Pool for IO which consists of MIN and MAX IOPS for it
 
USE master;
go
CREATE resource pool restrictediopool 
WITH ( max_iops_per_volume = 30, min_iops_per_volume = 1 ); 
go

3.) Create a resource Group which could be directed to the above resource pool and hence have a controlled IO utilization. By default there is a “Default” group which contains all the types of queries and an “Internal” group which is used by SQL Server for its internal processes.

USE master; 
go
CREATE workload GROUP restrictediogroup using restrictediopool; 
go

4.) Create a Resource classifier function to group the queries into Resource Groups. This classifier function will always be called whenever a query enters the system and it would be classified into a particular resource group. So this feature does have a bit of negative performance overhead with it. In this example I check that every query that comes into the system under the Login of “Backups” will go into the newly created group. I will make sure that every resource intensive query which I want to govern will be forced to run under this particular login:

USE master;

go

CREATE FUNCTION dbo.Restrictediousage()
returns SYSNAME
WITH schemabinding
AS
  BEGIN
      DECLARE @GroupName SYSNAME

      IF Suser_name() = 'Backups'
        BEGIN
            SET @GroupName = 'RestrictedIOGroup'
        END
      ELSE
        BEGIN
            SET @GroupName = 'default'
        END

      RETURN @GroupName;
  END

go 


5.) Once the above components are created, we need to map the Resource Governor to the above created Classification function:
USE master;
GO

ALTER resource governor WITH (classifier_function = dbo.restrictediousage);
ALTER resource governor RECONFIGURE; 

Demo to check the IOPS counter when we run a resource intensive operation

I will run a consistency check operation of the AdventureWorks database and see how many IOPS are getting generated for this operation as I will be running this under the login “Backups” and hence it would automatically be sent to the RestrictedIOGroup which in turn would be directed to the RestrictedIOPool
DBCC CHECKDB(AdventureWorks2012)

Add the following counters from the Perfmon window to view the specific Counters of MSSQL$SQL2014_Instance_Name:Resource Pool Stats Object:

1.) Disk Read IO/Sec - Default Instance
2.) Disk Read IO/Sec - RestrictedIOpool Instance


clip_image002

Here you can clearly see that the max IOPS is always around 30 as we had set initially to be the max resource for this particular pool. In this way any query under this pool would not take the complete IO for its request and hence other queries would be able to process simultaneously:



clip_image004

Now let us increase the MAX IOPS for this Resource pool and see how it behaves:

ALTER resource pool restrictediopool WITH (max_iops_per_volume=300);
ALTER resource governor RECONFIGURE;go 

Here we can now see that the IOPS is going well beyond 30IOPS to as much as required to satisfy the request without any upper bound:

clip_image006

This is a very valuable feature released by Microsoft which should help the administrators to well control their IO subsystems between different type of user groups or other type of groups.

Reducing the Page Splits – The Hidden IO intensive operation

Existing system: We generally have a system in place which checks the fragmentation levels of various indexes and accordingly reorganizes or rebuilds them. I have noticed that there are various tables which get rebuilt everyday due to their fragmentation being greater than 40%.
Assumption: I would assume that since the Index pages are getting so highly fragmented on a very regular basis, throughout the day it might be happening that several pages must be getting split to accommodate new data. This page split operation is a very resource intensive operation and hence must be a constant hidden reason of heavy server load. The page split might be happening because we keep the index pages fully filled. This is determined by the fill factor which is applied at the server level and if there is no explicitly defined fill-factor at the index level then server settings get used by default. Server setting default Fill-factor is set to 100% so all the indexes must be getting filled completely.
Investigation: Once the above assumption is confirmed then we can try changing the fill-factor of various indexes to about 80 or 90% to start with so that there is enough space in the pages to accommodate the new data and hence page splits would get drastically reduced. We could then analyze the page splitting for next 3 to 4 months then again increase the fill factor based on our further analysis.
Few important points:
1.) If the clustered index is on Identity_key then there would not be any page splits as the data is entered in orderly fashion and hence there would not be any data to be inserted in between which would cause page splits. Hence the above theory would be applicable on any other type of indexes which is not incremental in nature (It could be Clustered or Non-Clustered Indexes)
2.) Be cautious when you add the fill factor to an existing index as I would believe that the size of the table would increase in accordance with the fill factor. So for e.g., if there are about 10K pages and if we add a fill –factor of 20% then the new index size would become about 12K.
3.) Another point to investigate is that since the number of pages would increase by adding the fill-factor then definitely the queries that might be accessing these tables would have to access more number of pages and hence there would be an increase in IO. So we would need to weigh this against the page split operations that would happen if pages are 100% filled up and see which one would be better option. I would believe that Page-splits would need to be drastically reduced as definitely it is a very heavy operation and hence needs to be reduced greatly




Sunday, December 7, 2014

Helping SQL Server with more statistics – Filtered Statistics

Existing System: We generally have the SQL Servers to automatically create statistics. So for every column it would create a statistics with the required information. When SQL creates statistics, it’s able to create it with about 200 sample points (Or distinct values). So if there are about 100K distinct names in a column then SQL Server would just be able to accurately store the information of about 200 names and stores the information of other names in ranges. For e.g., if the first sample point is of name “Adam” and the second sample point is of name “Ben” then it accurately stores the information as to how many records in the table are with name “Adam” and “Ben” in the statistics histogram. Along with this it also stores the number of records that are between “Adam” and “Ben” and also the number of distinct value.

Problem Statement: With the above theory it would always be happening that SQL Server is always using very skewed statistics at various times and hence must be using a not so optimized query plans. Let’s take the above samples to understand this better. Consider that the following few names are used in a table from “Adam” to “ben” in alphabetical order:

Name

Number of Rows

Adam

200,000

Alan

25,000

Ariez

500,000

Bacary

50,000

Bane

25,000

Ben

200,000

For the above sample of data, SQL Server would be storing the accurate information of the sample points (“Adam” – 200K and “Ben” – 200K). However it stores an approximate value for the values that are in between these sample points. For e.g., It would just be having an information that there are 4 distinct names between “Adam” and “Ben” and the number of records are 600K so it would estimate that there are about 150K records for each distinct name. Hence when in queries we search for the name “Ariez”, SQL Server would estimate that there are about 150K records with the name “Ariez” but in actual it’s having about 500K records which make it to use some inefficient query plan instead of the most optimized one.

Solution: With the above anomaly in place, we should be providing SQL Servers with some more statistics for it to take better decisions as to which plans are to be used. This is where Filtered indexes come into picture. In the above sample we should be creating filtered statistics on the values between “Adam” and “Ben”. SQL Server in such case would produce another statistic with about 200 sample points (Or lesser if there are less number of distinct values) and hence it would be able to better inform SQL Server that there are about 500K records with the name “Ariez” and other possible names.

Saturday, October 25, 2014

Basic IO Test with SQL Server’s Performance Oriented Features

Microsoft has been releasing at-least 1 major performance boosting feature as part of its Enterprise Editions in recent years. Following are a few:

SQL Server 2008: Row level and Page level compressions which reduce the IO flow from the disk. This needs to be tested properly to make sure that the increased CPU load to decompress the data is really worth it as compared to the decreased IO load.

SQL Server 2012: ColumnStore Indexes which improves the performance of the queries drastically as it is based on the implementation of Column wise data storage instead of the traditional Row based storage of the data. The disadvantage was that the Table needs to be read-Only if they are supposed to have a ColumnStore indexes. However, there was still a work around by switching the partitions In and Out of these Read-Only tables

SQL Server 2014: Updateable ColumnStore Indexed Tables and also InMemory OLTP Engine to keep the data in Memory without the fear of them being pages out if it gets old.

As part of a comparative test I wanted to see on how these features behave with a very simple query for experimental purpose.

Note: The “AdventureWorks2014” DB was used for testing purpose. Table: “SalesOrderHeader” Deciding Column: “OrderDate”

All the 5 Scenarios are exclusive to each other and were run isolation.

Following query will be run under each scenario:

Select COUNT(*) FROM Sales.SalesOrderHeader SOH WITH (NOLOCK) Where OrderDate > '2013-01-01'

Scenario 1: The table is not having any Index on the deciding column

Table 'SalesOrderHeader'. Scan count 1, logical reads 688, physical reads 0, read-ahead reads 693, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

clip_image002

Scenario 2: The table has Page compressions applied to it

Table 'SalesOrderHeader_Mod'. Scan count 1, logical reads 330, physical reads 0, read-ahead reads 344, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

clip_image004

Scenario 3: The table has a NonClustered Index on the deciding column

Table 'SalesOrderHeader'. Scan count 1, logical reads 61, physical reads 0, read-ahead reads 59, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

clip_image006

Scenario 4: The table has a ColumnStore NonClustered Index on the deciding column

Table 'SalesOrderHeader'. Scan count 1, logical reads 13, physical reads 0, read-ahead reads 21, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

clip_image008

Scenario 5: The table is created InMemory hence the IO Cost in the execution plan can be seen as Nil as all the data is present in Memory itself.

clip_image010

Summary:

Test Case No

Scenario

IO Pages Read from the disk

Read – Ahead Reads

1

The table is not having any Index on the deciding column

688

693

2

The table has Page compressions applied to it

330

344

3

The table has a NonClustered Index on the deciding column

61

59

4

The table has a ColumnStore NonClustered Index on the deciding column

13

21

5

The table is created with InMemory Optimization and

0

0

Observation:

While working with some of the data, I did see that in some tests NonClustered Index was performing better than the ColumnStore Indexes which made me think twice and search on it more as to in which cases do ColumnStore Indexes perform much better any day. It did make me think because even the NonClustered Index just contain the columns like the ColumnStore Indexes then in which cases could one outweigh the other. I finally tumbled on the following blog post which said that “NonClustered indexes were still going to be more effective for specific types of queries (for example – singleton operations, smaller range scans, etc.).”

http://www.sqlskills.com/blogs/joe/comparing-query-performance-when-using-an-ideal-nonclustered-index-versus-columnstore-index/

ColumnStore Indexes perform way better that NonClustered Indexes when the data is huge and ColumnStore index execution leveraged parallelism as

Thursday, June 26, 2014

Data Cleansing using Data Quality Services – SQL Server 2012

Cleansing of the data from Source
ETL processes would contain data from vast number of sources. We need to cleanse the data from all the sources to form a common clean data by comparing it with a knowledge base which was built before. Building a knowledge base is an incremental process. For e.g., we can take the following excel file which now needs to be cleansed up for its City column by comparing it with a Knowledge base:
clip_image002
Steps to cleanse the above source file
1.) Open “SQL Server Data Quality Services” client and click on “New Data Quality Project” to start the process of cleansing:
clip_image004
2.) Enter a name for the project and also select the knowledge base which should be used as a reference to cleaning the source data and select the cleansing activity:
clip_image006
3.) Select the source file in the next window and also the column which needs to be cleaned up. We also need to select the domain of the knowledge base against which we need to refer. Here we have selected “City” domain of “customer” Knowledge base which was built in the previous blog post:
clip_image008
4.) Click on Start process to start cleansing. The process would start analyzing the source file against the domain and either correct it by itself or give a suggestion:
clip_image010
5.) Here we see that it identified one record as a new value for “City” because it did not find any such value in the Knowledge base and also it was not able to associate this new value to any existing cities in the domain as it would have tried to correct it by itself or would had given a suggestion that it might be referring to which city. But in this case its shown as a new value and now it’s the job of the validator to really accept it or it can be mapped to any of the existing cities itself. I know that this is a spelling mistake “Dilhi” and should be referring to “Delhi” instead so I enter the correct value in “Correct to” field and approve it. This would then goto the corrected Tab:
clip_image012
6.) We also see here that a record is already present in the corrected tab. The city given in source was “Bangaloore” and Data quality service analyzed it and mapped it to the corrected value of “Bangalore” with a confidence level of 87% which is pretty high. The operator still has the control whether to accept this suggestion or not by approving or rejecting it.
clip_image014
7.) The corrected data can then be exported to either excel or to a SQL server table
clip_image016
8.) The final corrected source:
clip_image018
Conclusion:
Any source can be passed through a data quality service and cleansed it against a knowledge base before passing it to the next stage of an ETL process or any independent process. We can also use the same source files to act as a new learning process for Knowledge base and add the new values from it to the knowledge base if we want to. I will cover this in another blog post.




















Tuesday, June 24, 2014

An Introduction to Data Quality Services - SQL Server 2012 - Creating first Knowledge Base

Introduction to Data cleansing
Any database tables or data stored in excel files can start getting discrepant once they start coming from different sources. They might refer to some common object but due to some differences in spellings would lead to be detected as completely separate items. For e.g., Excel1 might contain Customer1 city as “Bangalore” and Excel2 might state it as “Banglore”. Even though they meant to point to the same city, due to a manual error they would not be seen by any system as different cities and invalid as well. So before we can feed these sources into any system, we are required to cleanse them so that there is no data discrepancy around. Microsoft SQL Server 2012 has introduced a great feature for this purpose. It’s called “SQL Server Data Quality Services”.
Installation SQL Server Data Quality Services
This feature is not installed by default but its installer. We need to run its installer “Data Quality Server Installer” which would be present in the SQL Server directories or in the programs. The service would get installed in about 10 minutes and we should be ready to start using it.
Everyone needs a Knowledge Base
For any data cleansing activities, we would really need to know as to what the correct value should be instead. So we actually need to create a knowledge base here, which could be used as a reference to the source to clean them. This is called a Knowledge base.
Sample source for Knowledge Base:
We need a source to create a Knowledge base. We can also create the knowledge base by manually entering the value. But for now I would use a source as it would be quicker than manually typing in the base. Consider the following list of states and cities which we would use as our knowledge base and clean other sources containing customer data using this.
clip_image002[7]
Steps to create a New Knowledge Base
1.) Go to Programs and start the “SQL Server Data Quality Services”. It would open a UI and then select “New Knowledge base” from it after selecting the server where this services is installed. The following window will open to create a new knowledge base
clip_image004[7]
2.) Input a name for this Knowledge base and select “Knowledge Discovery” to start building the base. I the next window we need to define the Source of base and also map the columns as universal domains which could be used as a reference to cleanse these types of columns
clip_image006[6]
3.) Click o the new domain icon besides “Mappings” to create a domain. Here we need to create a city domain and then similarly a State domain:
clip_image008[6]
4.) Once the domains are created, map the column names to the domains to start discovering the values
clip_image010[6]
5.) On the next window we can click the “Start”button to start analyzing the source and checking in the domain if there are any new vales in it. Since this is the first time we are creating the knowledge base, all the values will be identifies as New values for the domain:
clip_image012[6]
clip_image014[6]
clip_image016[6]
6.) On client “Next”, we can see that both the domains and all the values that has now been added into them from the excel. We also see that its automatically corrected once of the state values (“Andra Pradesh”) to another (“Andhra Pradesh”). It corrected it to “Andhra Pradesh as that came first in the excel list”.
clip_image018[6]
7.) Select “Error” from drop down to see all the values that were rectified to a new value:
clip_image020[6]
8.) Once the data discovery is done, we need to publish the Knowledge base so that other projects can start using this base to cleanse the sources. Click on “Publish” and then go through other windows to publish it:
clip_image022[6]
clip_image024[6]
9.) Once the Knowledge base is published, you can re-open it and check. We can also add new values to it manually or again select some source and try adding newly detected values from it to increase the knowledge base.
clip_image026[6]
clip_image028[6]
clip_image030[6]
Conclusion
So now we are ready with the knowledge base and in future blogs I will experiment on how we can use this base to cleanse the Sources while transforming it to the destination.