Cleaning data is no different from many types of maintenance - just as disorganization creates disruptions in life, working with unclean data can be a recipe for disaster for an enterprise. Wasting time searching for clean data, making bad decisions, inefficiently processing functions, and receiving regulatory fines are just a few of the mishaps that can confront an organization with dirty data. A company cannot be nimble enough to react to sudden customer requests when it’s busy searching for what is not so easy to find. Time is money, and wasted effort is a lost business opportunity.
So how does one know when this data is dirty? Dirty data can be laced with minor mistakes - it can have missing values, misspellings, word transpositions, and misfielded values. Dirty data can also be repetitive, such as two copies of the same folder, but in different places - making each piece unique and inaccurate. Additionally, data can be dirty if it’s improperly lumped together. Two sources that have the same last name could be considered identical, which can both be a life-saver or be detrimental.
Similar to dust ending up all over the house, bad data can accumulate as it migrates from place to place. By embedding and forcing good data hygiene, a company can prevent errors that affect the speed and accuracy of business decisions and processes.
When data cleaning was previously done by the human eye, dirty data was easier to recognize. With the addition of data cleaning software, some of these dirty spots might not get caught, but it’s better than doing all of the cleaning by hand.
Data cleaning software has the following goals:
- Parse, standardize, and correct data from any source, domain, or type
- Validate data according to business rules and requirements
- Enrich data with internal or external data sources to fill gaps within data already in possession
- Match and consolidate data by embedding data duplication checks directly into workflows or applications
- Perform data quality checks on data sets any time, in real time, before analyzing, moving, or integrating data
With this many objectives, one can derive cleaning all data is not only a waste of time, it’s nearly impossible.
Even with the largest IT department imaginable, the data cleaning job has to be a company-wide initiative. Self-servicing data cleaning gets the data to the people who know what it should look like. The IT department shouldn’t be cleaning the sales department’s data - the sales department should be cleaning and monitoring the sales-related data with the right tools and expertise.
Additionally, one data cleaning process won’t work for all data, the organization must employ self-service governance in order to determine what needs to be “cleaned.”
Prioritization is also key to a cleaning process – know what needs to be cleaned each week and which can wait a few months. A company should pick a down-time of year to check in with where each department stands. Are things running like they should? Is everything easy to find, or are discrepancies laced with clean data?
Data should be high-quality from the very beginning. It needs to be relevant, accurate, complete, and consistent. Keeping it of quality minimizes room for errors or loss of information, which need to be corrected as soon as they are discovered.
With data, it’s easiest to clean in real-time so that no one re-cleans things that are already spick-and-span. Validation and correction are crucial factors in data cleaning.
Some good rules of thumb:
- Always clean “data” after or during data entry, before running a report, moving data in or out of the data warehouse, when updating master data, etc.
- Clean “data” when it’s noticeably dirty.
- Always clean “data” before consuming.
- Don’t spread “unclean data” onto other people.
- Refrain from spreading dirty “data” to other areas of the organization.
- Avoid touching anything with unclean “data.”
Dirty data is harmful to business and potentially unsafe, but with due diligence and proper data management, a company can stop dirty data from accumulating - enhancing the workflow of all lines of business. With each department knowing their own data, processes are streamlined and IT is free to take care of immediate tech issues. It’s a clean win for all.