by Marty Shaw CSM, PSM
Major Accounts - Global Solutions
As with most complex topics, beginning with a definition of terms is often a good place to begin, as many often define the same terms somewhat differently.
What do you mean by “data cleansing”?
The term “data cleansing” derives from the processes and procedures with the goal of having high quality data, or a superior level of “data quality”. Now, that begs the question, what is “data quality”? In general terms, data quality can be defined as having data in a form that is fit for its intended purpose. For example, if you are looking to put a piece of printed matter in the mail stream to an address in London, UK, it is important that the address is the correct address, in the correct format and, of course, deliverable. The same would be true of an email address; it should be formatted correctly and deliverable to the intended recipient. Pretty well any data element must be seen as fit for its intended purpose to be deemed high quality data. The way to achieve high quality data is by utilizing proper data capture and data cleansing techniques, helping assure the data is fit for the purpose you intended.
What are the best practices for data cleansing?
Best practices can vary based on the use case, but ideally begins with the discussion, design and implementation of an overall data cleansing strategy. The strategy will help guide the tactics which will lead to the intended outcomes when tracked closely throughout the process and adjusted as needed.
Let’s say, for example, your organization’s desired outcome is the integration of disparate data sets spread around your global data warehouses in numerous countries.
The best practices for integrating data silos fall into the following steps:
- Data gathering
- Data parsing
- Postal address hygiene
- Email address, phone number, and other data hygiene (if available)
- Matching and merging (duplicate reconciliation)
- Managing metadata
- Creating a system of reference
For a closer look into the best practices for each of the steps listed above, see “What Are The Best Practices For Data Silo Integration?” located here.
The 7-step best practices may vary somewhat depending on your particular use case, though likely employing some, if not all of the steps noted. As mentioned earlier, the overall data cleansing strategy will help guide the best practices to use to help assure your data is fit for its intended purposes.
What is data cleansing in statistics, its importance and benefits?
The subject matter expertise of the data scientist is key to valid data analysis. These experts apply years of experience to benefit your particular data quality challenges. The “science” part of the data scientists’ toolbox includes, amongst other things, statistical analysis of the data. What hypotheses are in mind related to the data set? What patterns emerge as the data is analyzed? Data scientists look at where the biggest bang for the buck can be realized in a data cleansing project. Ok, they’d call it “frequency distribution” or “long tail” analysis, where the focus is on improving as much of the data as possible toward the “head” of the data, less so in the “tail”. For a deeper dive into “long tail” this Wikipedia article can help. The importance of statistical analysis is focusing on where the biggest data “pain points” exist, and the benefit is getting the highest return on your data quality investment.
Why do we cleanse data?
To use the classic phrase; “Garbage in, garbage out.” Your organization’s data is an asset, from which you generate much of your revenue. If your data is better, your decisions are better, and your resulting revenue is greater. As with any asset, it is important to maintain data at its optimal peak performance level to assure it has a long and productive life. So, how do you implement the best data cleansing services? The answer is in the framework identified above, with the nuances to be crafted specific to your organization’s individual technical and business objectives.