Just because data is dirty does not mean it's garbage.
There is a common saying among people who work regularly with data that "80% of the time spent working with data is spent cleaning it" and indeed, it is often an arduous task that is not only time-consuming but can also be error prone. The CLEAN step of the Data Pipeline works as a group with the VERIFY and ANALYSE steps—while simple datasets may only need to go through VERIFY, CLEAN and ANALYSE once, more complex datasets will see you go back and forth between the three. Specifically, you may find yourself:
Data cleaning can be broken down into three activities:
1. Data tidying is the process of cleaning the structure of the data without touching its content and following the princples of tidy data.
2. Data editing is the process of modifying the content of the data to prepare it for your analysis. This involves correcting problems with the values stored in the dataset.
3. Data consolidation is the process of adding complementary data to your main dataset. This step provides an this is an opportunity to complement or extend the dataset that you have collected, verified and cleaned. The goal may be to produce a more complete analysis thanks to the addition of a new variable, or to simply consolidate in one dataset all the data that you will need to answer all your research questions.
There are two general kinds of problems with the data that you might encounter when cleaning it:
1. Formatting problems - problems related to HOW the dataset is structured. These are usually resolved in the data tidying step.
2. Content problems - problems related to WHAT is written or stored in the dataset. These are usually resolved in the data editing step.
Examples of formatting problems include:
Type of problem | Problem | Example | Suggested solution |
---|---|---|---|
Formatting problem | A single data point is stored in a multiple columns. | Address is saved as multiple columns of BARANGAY, MUNICIPALITY, and PROVINCE but you need it to be a single string | Concatenate values into a single cell |
Formatting problem | Separate data points are stored in a single column. | Address is saved as a single string of |
Split the column into several columns |
Formatting problem | Multi-line or merged headers | Population data divided by age and sex where one row is used for age categories and another row is used for sex categories. NOTE: This might be useful for making data more human readable but it can easily become problematic for analysis purposes especially when using the data in applications that expect a single line header. |
Convert the headers into a single line |
Examples of content problems include:
Type of problem | Problem | Suggested solution |
---|---|---|
Content problem | Extra white spaces in the cell value | Trim/remove the extra white spaces. |
Content problem | Wrong or inconsistent use of case (lower, UPPER, CamelCase) | Convert the text to the case of choice (lowercase, UPPERCASE, Proper Case). |
Content problem | Spelling mistakes | Correct the spelling. You can use a spell checker. |
Content problem | Missing, incorrect, or blank values | Find, verify, and add the correct values. |
A lot of the data work done by with data deals with small datasets, processed in spreadsheets. Although spreadsheets are convenient, they do not provide a history of the modifications applied to the dataset, leading to problems if a mistake was made in the process. An easy way to address this is to create a new tab every time an important modification is applied to the data. This technique is especially useful for data cleaning.
More tech-savvy data practitioners might use a version control system like GitHub or GitLab for the same purpose.