Tuesday, June 24, 2014

An Introduction to Data Quality Services - SQL Server 2012 - Creating first Knowledge Base

Introduction to Data cleansing
Any database tables or data stored in excel files can start getting discrepant once they start coming from different sources. They might refer to some common object but due to some differences in spellings would lead to be detected as completely separate items. For e.g., Excel1 might contain Customer1 city as “Bangalore” and Excel2 might state it as “Banglore”. Even though they meant to point to the same city, due to a manual error they would not be seen by any system as different cities and invalid as well. So before we can feed these sources into any system, we are required to cleanse them so that there is no data discrepancy around. Microsoft SQL Server 2012 has introduced a great feature for this purpose. It’s called “SQL Server Data Quality Services”.
Installation SQL Server Data Quality Services
This feature is not installed by default but its installer. We need to run its installer “Data Quality Server Installer” which would be present in the SQL Server directories or in the programs. The service would get installed in about 10 minutes and we should be ready to start using it.
Everyone needs a Knowledge Base
For any data cleansing activities, we would really need to know as to what the correct value should be instead. So we actually need to create a knowledge base here, which could be used as a reference to the source to clean them. This is called a Knowledge base.
Sample source for Knowledge Base:
We need a source to create a Knowledge base. We can also create the knowledge base by manually entering the value. But for now I would use a source as it would be quicker than manually typing in the base. Consider the following list of states and cities which we would use as our knowledge base and clean other sources containing customer data using this.
clip_image002[7]
Steps to create a New Knowledge Base
1.) Go to Programs and start the “SQL Server Data Quality Services”. It would open a UI and then select “New Knowledge base” from it after selecting the server where this services is installed. The following window will open to create a new knowledge base
clip_image004[7]
2.) Input a name for this Knowledge base and select “Knowledge Discovery” to start building the base. I the next window we need to define the Source of base and also map the columns as universal domains which could be used as a reference to cleanse these types of columns
clip_image006[6]
3.) Click o the new domain icon besides “Mappings” to create a domain. Here we need to create a city domain and then similarly a State domain:
clip_image008[6]
4.) Once the domains are created, map the column names to the domains to start discovering the values
clip_image010[6]
5.) On the next window we can click the “Start”button to start analyzing the source and checking in the domain if there are any new vales in it. Since this is the first time we are creating the knowledge base, all the values will be identifies as New values for the domain:
clip_image012[6]
clip_image014[6]
clip_image016[6]
6.) On client “Next”, we can see that both the domains and all the values that has now been added into them from the excel. We also see that its automatically corrected once of the state values (“Andra Pradesh”) to another (“Andhra Pradesh”). It corrected it to “Andhra Pradesh as that came first in the excel list”.
clip_image018[6]
7.) Select “Error” from drop down to see all the values that were rectified to a new value:
clip_image020[6]
8.) Once the data discovery is done, we need to publish the Knowledge base so that other projects can start using this base to cleanse the sources. Click on “Publish” and then go through other windows to publish it:
clip_image022[6]
clip_image024[6]
9.) Once the Knowledge base is published, you can re-open it and check. We can also add new values to it manually or again select some source and try adding newly detected values from it to increase the knowledge base.
clip_image026[6]
clip_image028[6]
clip_image030[6]
Conclusion
So now we are ready with the knowledge base and in future blogs I will experiment on how we can use this base to cleanse the Sources while transforming it to the destination.

































No comments: