Cleansing of the data from Source
ETL processes would contain data from vast number of sources. We need to cleanse the data from all the sources to form a common clean data by comparing it with a knowledge base which was built before. Building a knowledge base is an incremental process. For e.g., we can take the following excel file which now needs to be cleansed up for its City column by comparing it with a Knowledge base:
Steps to cleanse the above source file
1.) Open “SQL Server Data Quality Services” client and click on “New Data Quality Project” to start the process of cleansing:
2.) Enter a name for the project and also select the knowledge base which should be used as a reference to cleaning the source data and select the cleansing activity:
3.) Select the source file in the next window and also the column which needs to be cleaned up. We also need to select the domain of the knowledge base against which we need to refer. Here we have selected “City” domain of “customer” Knowledge base which was built in the previous blog post:
4.) Click on Start process to start cleansing. The process would start analyzing the source file against the domain and either correct it by itself or give a suggestion:
5.) Here we see that it identified one record as a new value for “City” because it did not find any such value in the Knowledge base and also it was not able to associate this new value to any existing cities in the domain as it would have tried to correct it by itself or would had given a suggestion that it might be referring to which city. But in this case its shown as a new value and now it’s the job of the validator to really accept it or it can be mapped to any of the existing cities itself. I know that this is a spelling mistake “Dilhi” and should be referring to “Delhi” instead so I enter the correct value in “Correct to” field and approve it. This would then goto the corrected Tab:
6.) We also see here that a record is already present in the corrected tab. The city given in source was “Bangaloore” and Data quality service analyzed it and mapped it to the corrected value of “Bangalore” with a confidence level of 87% which is pretty high. The operator still has the control whether to accept this suggestion or not by approving or rejecting it.
7.) The corrected data can then be exported to either excel or to a SQL server table
8.) The final corrected source:
Conclusion:
Any source can be passed through a data quality service and cleansed it against a knowledge base before passing it to the next stage of an ETL process or any independent process. We can also use the same source files to act as a new learning process for Knowledge base and add the new values from it to the knowledge base if we want to. I will cover this in another blog post.
ETL processes would contain data from vast number of sources. We need to cleanse the data from all the sources to form a common clean data by comparing it with a knowledge base which was built before. Building a knowledge base is an incremental process. For e.g., we can take the following excel file which now needs to be cleansed up for its City column by comparing it with a Knowledge base:
Steps to cleanse the above source file
1.) Open “SQL Server Data Quality Services” client and click on “New Data Quality Project” to start the process of cleansing:
2.) Enter a name for the project and also select the knowledge base which should be used as a reference to cleaning the source data and select the cleansing activity:
3.) Select the source file in the next window and also the column which needs to be cleaned up. We also need to select the domain of the knowledge base against which we need to refer. Here we have selected “City” domain of “customer” Knowledge base which was built in the previous blog post:
4.) Click on Start process to start cleansing. The process would start analyzing the source file against the domain and either correct it by itself or give a suggestion:
5.) Here we see that it identified one record as a new value for “City” because it did not find any such value in the Knowledge base and also it was not able to associate this new value to any existing cities in the domain as it would have tried to correct it by itself or would had given a suggestion that it might be referring to which city. But in this case its shown as a new value and now it’s the job of the validator to really accept it or it can be mapped to any of the existing cities itself. I know that this is a spelling mistake “Dilhi” and should be referring to “Delhi” instead so I enter the correct value in “Correct to” field and approve it. This would then goto the corrected Tab:
6.) We also see here that a record is already present in the corrected tab. The city given in source was “Bangaloore” and Data quality service analyzed it and mapped it to the corrected value of “Bangalore” with a confidence level of 87% which is pretty high. The operator still has the control whether to accept this suggestion or not by approving or rejecting it.
7.) The corrected data can then be exported to either excel or to a SQL server table
8.) The final corrected source:
Conclusion:
Any source can be passed through a data quality service and cleansed it against a knowledge base before passing it to the next stage of an ETL process or any independent process. We can also use the same source files to act as a new learning process for Knowledge base and add the new values from it to the knowledge base if we want to. I will cover this in another blog post.