The ultimate purpose for data is to drive decisions. But data isn’t as reliable or accurate as we want to believe. This leads to a most undesirable result: Bad data means bad decisions. As a data professional, part of our mission is to make data “good enough” for use by others. We spend time scrubbing and cleaning data to make it consumable by other teams. It may seem like a never-ending task and that’s because it is.
The Origins of Dirty Data
The root cause of dirty data is simple: human intervention. If not for humans, data would be clean, perfect, and clear. We change standards and business requirements and forget why we were collecting some metrics in the first place. Some of the code we write will fail at some point, and we fat-finger data entry from time to time.
As a result, we end up with the following situations: Duplicate data caused by a single event recorded and entered twice into a dataset, missing data due to fields that should contain values but don’t, and invalid data due to information that isn’t entered correctly or isn’t maintained. Other situations include bad data due to typos, transpositions, and variations in spelling or formatting (say hello to Unicode!), and inappropriate data due to data entered in the wrong field.
Data Cleaning Techniques
Remember, all data is dirty—you won’t be able to make all of it perfect. Your focus should be on making it good enough to pass along to the next person. The first step is to examine the data and ask yourself, “Does this data make sense?” Data should tell a story or answer a question. Make sure your data does, too. Then, before you do anything else, make a copy—or backup—of your data before you make the smallest change. I can’t stress this enough.
Depending on your data and the specific issues within your dataset, here are some data cleaning techniques you might use.
Identify and remove duplicate data: Tools such as Microsoft Excel and Microsoft Power BI make this easy. Of course, you’ll need to know whether the data is duplicated or two independent observations. For relational databases, we often use primary keys to enforce this uniqueness of the records. But such constraints aren’t available for every system logging data.
Remove data that doesn’t fit: If the data doesn’t help you answer the question you’re asking, remove it. For example, if you’re analyzing store sales, you might want to focus on sales of a specific item or category of items.
Identify and fix issues with spelling, etc.: There are many ways to manipulate strings to help get your data formatted and looking pretty. For example, you could use the TRIM function to remove spaces from the text in a column and then sort the data and look for details such as capitalization and spelling. Keep in mind the use of regional terms, though, such as calling a sugary beverage “pop” versus the correct term, “soda.”
Normalize data: Set a standard for the data. If the data is a number, make sure it’s a number, not text. If it’s categorical, make sure the entries apply for the category. Spelling, capitalization, etc., are all ways to set standards and normalize data to some degree.
Remove outliers: But only when it makes sense to do so! If the outlier was due to poor collection, it could be safe to remove. Hammond’s Law states, “Ninety percent of the time, the next measurement will fall outside the 90% confidence interval.” Be mindful that outliers and database anomalies are innocent until proven guilty—and if you do remove the outlier, be aware that some statistics are resistant and others aren’t.
Fix missing: You have two options here. You can either remove the record or update the missing value. Yes, this is how we get faux null values. You’re going to need to make some hard choices, so make certain your document what you’re doing. A great way to do this is to use Jupyter notebooks. For categorical data, you might decide to use a word such as “missing.” For numerical data, you might want to use a 0 or an average or perhaps generate a random distribution of data. Again, track every change you make. I avoid using faux nulls for any data unless it makes sense to note the absence of information collected.
A Never-Ending Process
We all fall into the same trap: We don’t have time to do it right in the first place but somehow think there will be time to fix it later. When it comes to data, it’s a never-ending process of curating, consuming, and cleaning. And no matter how much you scrub your data, it’ll never be clean, but it can be good enough. Maybe we didn’t go to school to become a data janitor, but here we are. We understand that while all data is dirty, data is still useful. It just needs a good cleaning.