Sunday, December 7, 2014

Helping SQL Server with more statistics – Filtered Statistics

Existing System: We generally have the SQL Servers to automatically create statistics. So for every column it would create a statistics with the required information. When SQL creates statistics, it’s able to create it with about 200 sample points (Or distinct values). So if there are about 100K distinct names in a column then SQL Server would just be able to accurately store the information of about 200 names and stores the information of other names in ranges. For e.g., if the first sample point is of name “Adam” and the second sample point is of name “Ben” then it accurately stores the information as to how many records in the table are with name “Adam” and “Ben” in the statistics histogram. Along with this it also stores the number of records that are between “Adam” and “Ben” and also the number of distinct value.

Problem Statement: With the above theory it would always be happening that SQL Server is always using very skewed statistics at various times and hence must be using a not so optimized query plans. Let’s take the above samples to understand this better. Consider that the following few names are used in a table from “Adam” to “ben” in alphabetical order:

Name

Number of Rows

Adam

200,000

Alan

25,000

Ariez

500,000

Bacary

50,000

Bane

25,000

Ben

200,000

For the above sample of data, SQL Server would be storing the accurate information of the sample points (“Adam” – 200K and “Ben” – 200K). However it stores an approximate value for the values that are in between these sample points. For e.g., It would just be having an information that there are 4 distinct names between “Adam” and “Ben” and the number of records are 600K so it would estimate that there are about 150K records for each distinct name. Hence when in queries we search for the name “Ariez”, SQL Server would estimate that there are about 150K records with the name “Ariez” but in actual it’s having about 500K records which make it to use some inefficient query plan instead of the most optimized one.

Solution: With the above anomaly in place, we should be providing SQL Servers with some more statistics for it to take better decisions as to which plans are to be used. This is where Filtered indexes come into picture. In the above sample we should be creating filtered statistics on the values between “Adam” and “Ben”. SQL Server in such case would produce another statistic with about 200 sample points (Or lesser if there are less number of distinct values) and hence it would be able to better inform SQL Server that there are about 500K records with the name “Ariez” and other possible names.

No comments: