Written in Livemark
(2022-06-22 02:56)

Verify

How bad is the data?

All data are, in one way or another, bad. Similar to how rare it is to find data that just be magically available to analyze, it is rarer to find good quality data that automatically fits your needs (unless you are the one who collects the data but that can also be problematic).

The verify step involves knowing how appropriate the data is for your project which you can do by checking the internal consistency, checking for outliers, and consulting the metadata when available.

It is easy to get caught up and excited about doing your analysis the moment you get your data but you should always try to take the time first to include at least some verification. This simple action will help you avoid wasting time on bad data or, worse, publishing incorrect results.

Important: Do not delete data at this stage!

Things to consider

Data trustworthiness, completeness, and quality

Data trustworthiness refers to how much we can trust the data to represent what it says it does.

How do you know if you can rely on what the data says? Data generated by a low quality or faulty sensor may not be trustworthy. A dataset collected by a professional survey enumerator may be more trustworthy than one collected by an amateur. Whether it is sensors or humans behind the data, assessing the trustworthiness of the data requires an understanding of the methodology and choices behind the creation of the dataset.

Data completeness refers to how much the data covers the reality it tries to represent.

A dataset may be incomplete because of inconsistent data collection practices or due to key elements being absent (a social survey dataset not recording gender will potentially deprive the data analysts of a real understanding of the social dynamics that the survey was trying to capture). Once again, understanding the methodology behind the data collection as well as the topic the dataset covers is essential to assess if the data is complete (enough).

Data quality refers to how well the dataset is structured and documented.

An example of a well-structured dataset is one that follows tidy data principles. Other principles will apply for non-tabular data such as JSON but someone working with this kind of data will most likely already know what they're doing.

Quality can also refer to the documentation of the data. The minimum requirement for well-documented data is for it to have metadata. Data dictionaries, data inventories, and documentation about the collection and processind methodologies used are also relevant here. The more documented the data, the higher its quality since a properly documented dataset will allow you to check for its completeness and trustworthiness.

Th metadata, data dictionary, and data inventory

The metadata is the "data about the data". It can include every relevant contextual information about the dataset—from the author, to the date of creation, to the expected format of the various values (text, numbers etc). A data dictionary can also be a part of the metadata. The metadata can be stored alongside the data (for example in a different tab of a spreadsheet) or shared alongside the dataset (ideally as a .json file, but most often as a .doc or .pdf file).

A data dictionary or codebook is a document describing the meaning of all columns and values in a dataset. This is especially relevant for datasets that use abbreviated column headers and non-standard values. Sometimes the dictionary may skip obvious elements, such as the date column, although it may include it in order to describe the expected formatting of the values (e.g. DD-MM-YY or YYYY.MM.DD).

A data inventory (or registry) is a document listing all the datasets owned by an organisation which can include all the datasets made publicly available by that organisation. A data inventory can also be thematic. In such a case, it may list down all relevant data sources across multiple organizations. For example, a data inventory of procurement-related data in the Philippines.

The four verification methods

The four most common methods to perform verification of data are:

1. Asking the source: the ones who produced or published the data are most likely the best experts on it. You should take advantage of this. Too often new data practitioners see themselves as having a one-on-one battle with the data. This should not be the case. It is not only highly inefficient but it also blinds you to your possible biases and prevents you from getting other perspectives about the data.

2. Asking experts: data practitioners from the civic sector often get to work on datasets that pertain to many different topics with some topics being complex or outside their domain of expertise. It is important to remember that even if you might not be an expert on the topic, there is probably someone who is that you can reach and who can help you better contextualise the dataset for you. This is an essential step to take for any serious data project. Working with data is never an individual effort.

3. Statistical checks: when exploring a large new dataset, diving in right away will probably result in confusion. Instead you’ll want to make sure you understand what each column header means, what type of value to expect in the data, and if it fits what you have in mind. A common approach is to create a statistical summary of your data - calculating the mean, median, maximum (max), and minimum (min) values and standard deviation for key columns should give you a good idea of what the data looks like. You can also do some exploratory data analysis to look for possible errors or outliers in your data.

4. Common sense check: This is probably the most important of all. It represents the ability to identify weird patterns or values in the dataset—a sudden spike in population, negative values for the bid amount, a start date that is later than and end date. This relies on having a general sense for reading data but also good background knowledge about the context that the data is based on. This is why data skills are not sufficient to work with any and all kinds of data: the data needs to make sense to you and your context so that you won’t miss important insights.

Common issues

No verification process

Ideally, every new dataset must go through at least one of the four verification methods before it is used in the project. Of course, this isn't always followed. A lack of rigor and commitment to this process of data verification can lead teams to miss some inconsistencies in the data. At best, this leads to a painstaking backtracking process through the data pipeline. At worst, your team will just assume that the inconsistencies can be ignored.

You may refer to The Quartz guide to bad data for an exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

Clean »

« Get