Written in Livemark
(2022-06-19 13:39)

Machine-readable data

In the context of working with data, machine-readable does not simply mean openable by a computer. For files or formats to be considered machine-readable, they should allow for easy extraction, processing, and analysis of the data that they contain. The most common application of this is with tabular data. A simple test for machine-readability is if you can automatically compute for the average of tabular data stored on the file. You can do this easily with spreadsheets but not so much with word documents, PDFs, or images.

Common examples of machine-readable formats for tabular data are:

Comma-Separated Value files os CSV (.csv)
Other Delimited Text Files (.tsv)
Spreadsheet files (.ods, .xls, .xlsx)
JavaScript Object Notation or JSON (.json)
Databases

Because of recent advancements in technology, the line between traditional machine-readable and non-machine-readable file formats for tabular data is slowly disappearing. There are now applications that make it possible to directly extract and process tabular data from PDFs and images albeit not as easily as it is with spreadsheets.

Machine-readable can have different meanings in other contexts. A file or format that's not normally considered machine-readable in one context may be considered machine-readable in another. For example, image files are not normally considered machine-readable if the purpose is to extract a table from the image but the same image may be considered as machine-readable for purposes of image analysis or pattern recognition.

Why is machine-readability important?

Consider this scenario:
Juan and Pedro, persons of similar skill and capabilities, are both tasked with analysing the procurement activities of Procuring Entity A for the past 10 years.

Juan is provided a PDF document containing the tables of A’s procurement activities.
Pedro was given the same dataset but in spreadsheet format.

Who do you think will be able to provide answers faster and more accurately, Juan or Pedro?

Having machine-readable data allows data users to focus on creating knowledge, providing insight, and building solutions with the data instead of spending a lot of time converting data from one format to another. If also facilitates reusability of data and replicability of results. Machine-readability is a prerequisite to having open and reusable data.

Converting non-machine-readable into machine-readable

When working with data, it is often necessary to convert non-machine-readable files into machine-readable ones. This includes activities such as:

Extracting a table from a PDF
Getting a table from a webpage using web scraping tools
Digitizing hard copy documents and extracting the data from them

This steps are usually done in the get phase of the Data Pipeline.

Tidy data

The tidy data principles state that for data to be tidy, it must be stored such that:

Each variable forms a column, and that column contains one "type" of data
Each observation forms a row
Each type of observational unit forms a table

There are multiple ways by which data can become untidy, Hadley Wickham's paper on the subject matter identify these as:

Column headers contain values, rather than names
Multiple variables are stored in a single column
Variables are stored in both rows and columns
Multiple observational types are stored in a single table
A single observational unit is stored in multiple tables.

Ensuring that data is tidy will help you easily identify:

The types or categories of data points, with one data point per column. Each type of information is described across multiple observations.
The individual observations, with one observation per row. An observation is a collection of data points made about a specific thing.

Data standards »

« What can data not do?