In the context of working with data, machine-readable does not simply mean openable by a computer. For files or formats to be considered machine-readable, they should allow for easy extraction, processing, and analysis of the data that they contain. The most common application of this is with tabular data. A simple test for machine-readability is if you can automatically compute for the average of tabular data stored on the file. You can do this easily with spreadsheets but not so much with word documents, PDFs, or images.
Common examples of machine-readable formats for tabular data are:
Having machine-readable data allows data users to focus on creating knowledge, providing insight, and building solutions with the data instead of spending a lot of time converting data from one format to another. If also facilitates reusability of data and replicability of results. Machine-readability is a prerequisite to having open and reusable data.
When working with data, it is often necessary to convert non-machine-readable files into machine-readable ones. This includes activities such as:
This steps are usually done in the get phase of the Data Pipeline.
The tidy data principles state that for data to be tidy, it must be stored such that:
There are multiple ways by which data can become untidy, Hadley Wickham's paper on the subject matter identify these as:
Ensuring that data is tidy will help you easily identify: