Thursday, June 26, 2014

Data Cleansing using Data Quality Services – SQL Server 2012

Cleansing of the data from Source
ETL processes would contain data from vast number of sources. We need to cleanse the data from all the sources to form a common clean data by comparing it with a knowledge base which was built before. Building a knowledge base is an incremental process. For e.g., we can take the following excel file which now needs to be cleansed up for its City column by comparing it with a Knowledge base:
clip_image002
Steps to cleanse the above source file
1.) Open “SQL Server Data Quality Services” client and click on “New Data Quality Project” to start the process of cleansing:
clip_image004
2.) Enter a name for the project and also select the knowledge base which should be used as a reference to cleaning the source data and select the cleansing activity:
clip_image006
3.) Select the source file in the next window and also the column which needs to be cleaned up. We also need to select the domain of the knowledge base against which we need to refer. Here we have selected “City” domain of “customer” Knowledge base which was built in the previous blog post:
clip_image008
4.) Click on Start process to start cleansing. The process would start analyzing the source file against the domain and either correct it by itself or give a suggestion:
clip_image010
5.) Here we see that it identified one record as a new value for “City” because it did not find any such value in the Knowledge base and also it was not able to associate this new value to any existing cities in the domain as it would have tried to correct it by itself or would had given a suggestion that it might be referring to which city. But in this case its shown as a new value and now it’s the job of the validator to really accept it or it can be mapped to any of the existing cities itself. I know that this is a spelling mistake “Dilhi” and should be referring to “Delhi” instead so I enter the correct value in “Correct to” field and approve it. This would then goto the corrected Tab:
clip_image012
6.) We also see here that a record is already present in the corrected tab. The city given in source was “Bangaloore” and Data quality service analyzed it and mapped it to the corrected value of “Bangalore” with a confidence level of 87% which is pretty high. The operator still has the control whether to accept this suggestion or not by approving or rejecting it.
clip_image014
7.) The corrected data can then be exported to either excel or to a SQL server table
clip_image016
8.) The final corrected source:
clip_image018
Conclusion:
Any source can be passed through a data quality service and cleansed it against a knowledge base before passing it to the next stage of an ETL process or any independent process. We can also use the same source files to act as a new learning process for Knowledge base and add the new values from it to the knowledge base if we want to. I will cover this in another blog post.




















Tuesday, June 24, 2014

An Introduction to Data Quality Services - SQL Server 2012 - Creating first Knowledge Base

Introduction to Data cleansing
Any database tables or data stored in excel files can start getting discrepant once they start coming from different sources. They might refer to some common object but due to some differences in spellings would lead to be detected as completely separate items. For e.g., Excel1 might contain Customer1 city as “Bangalore” and Excel2 might state it as “Banglore”. Even though they meant to point to the same city, due to a manual error they would not be seen by any system as different cities and invalid as well. So before we can feed these sources into any system, we are required to cleanse them so that there is no data discrepancy around. Microsoft SQL Server 2012 has introduced a great feature for this purpose. It’s called “SQL Server Data Quality Services”.
Installation SQL Server Data Quality Services
This feature is not installed by default but its installer. We need to run its installer “Data Quality Server Installer” which would be present in the SQL Server directories or in the programs. The service would get installed in about 10 minutes and we should be ready to start using it.
Everyone needs a Knowledge Base
For any data cleansing activities, we would really need to know as to what the correct value should be instead. So we actually need to create a knowledge base here, which could be used as a reference to the source to clean them. This is called a Knowledge base.
Sample source for Knowledge Base:
We need a source to create a Knowledge base. We can also create the knowledge base by manually entering the value. But for now I would use a source as it would be quicker than manually typing in the base. Consider the following list of states and cities which we would use as our knowledge base and clean other sources containing customer data using this.
clip_image002[7]
Steps to create a New Knowledge Base
1.) Go to Programs and start the “SQL Server Data Quality Services”. It would open a UI and then select “New Knowledge base” from it after selecting the server where this services is installed. The following window will open to create a new knowledge base
clip_image004[7]
2.) Input a name for this Knowledge base and select “Knowledge Discovery” to start building the base. I the next window we need to define the Source of base and also map the columns as universal domains which could be used as a reference to cleanse these types of columns
clip_image006[6]
3.) Click o the new domain icon besides “Mappings” to create a domain. Here we need to create a city domain and then similarly a State domain:
clip_image008[6]
4.) Once the domains are created, map the column names to the domains to start discovering the values
clip_image010[6]
5.) On the next window we can click the “Start”button to start analyzing the source and checking in the domain if there are any new vales in it. Since this is the first time we are creating the knowledge base, all the values will be identifies as New values for the domain:
clip_image012[6]
clip_image014[6]
clip_image016[6]
6.) On client “Next”, we can see that both the domains and all the values that has now been added into them from the excel. We also see that its automatically corrected once of the state values (“Andra Pradesh”) to another (“Andhra Pradesh”). It corrected it to “Andhra Pradesh as that came first in the excel list”.
clip_image018[6]
7.) Select “Error” from drop down to see all the values that were rectified to a new value:
clip_image020[6]
8.) Once the data discovery is done, we need to publish the Knowledge base so that other projects can start using this base to cleanse the sources. Click on “Publish” and then go through other windows to publish it:
clip_image022[6]
clip_image024[6]
9.) Once the Knowledge base is published, you can re-open it and check. We can also add new values to it manually or again select some source and try adding newly detected values from it to increase the knowledge base.
clip_image026[6]
clip_image028[6]
clip_image030[6]
Conclusion
So now we are ready with the knowledge base and in future blogs I will experiment on how we can use this base to cleanse the Sources while transforming it to the destination.