Written in Livemark
(2022-06-19 13:39)

Machine-readable data

In the context of working with data, machine-readable does not simply mean openable by a computer. For files or formats to be considered machine-readable, they should allow for easy extraction, processing, and analysis of the data that they contain. The most common application of this is with tabular data. A simple test for machine-readability is if you can automatically compute for the average of tabular data stored on the file. You can do this easily with spreadsheets but not so much with word documents, PDFs, or images.

Common examples of machine-readable formats for tabular data are:

Why is machine-readability important?

Having machine-readable data allows data users to focus on creating knowledge, providing insight, and building solutions with the data instead of spending a lot of time converting data from one format to another. If also facilitates reusability of data and replicability of results. Machine-readability is a prerequisite to having open and reusable data.

Converting non-machine-readable into machine-readable

When working with data, it is often necessary to convert non-machine-readable files into machine-readable ones. This includes activities such as:

This steps are usually done in the get phase of the Data Pipeline.

Tidy data

The tidy data principles state that for data to be tidy, it must be stored such that:

  1. Each variable forms a column, and that column contains one "type" of data
  2. Each observation forms a row
  3. Each type of observational unit forms a table

There are multiple ways by which data can become untidy, Hadley Wickham's paper on the subject matter identify these as:

  1. Column headers contain values, rather than names
  2. Multiple variables are stored in a single column
  3. Variables are stored in both rows and columns
  4. Multiple observational types are stored in a single table
  5. A single observational unit is stored in multiple tables.

Ensuring that data is tidy will help you easily identify:

  1. The types or categories of data points, with one data point per column. Each type of information is described across multiple observations.
  2. The individual observations, with one observation per row. An observation is a collection of data points made about a specific thing.
Learn about open data, how to work with data, how to do better data-driven projects, and how to improve your data literacy.