Where is the data located and is it accessible?
This step entails knowing where to look for your data, finding it, and knowing how accessible it is. This is a step of varying difficulty depending on how well you defined your data problem.
Finding data also depends on your creativity and critical thinking. When data seems hard to find, you can consider looking at proxy indicators—an indirect measure or sign that indicates a phenomenon in the absence of a direct measure or sign.
It is impossible for a single person to know where to find all the data that you need which is why experience, contextual knowledge, and having contacts in the relevant fields are key assets that will help you find the right dataset for your project.
Be mindful of the fact that several sources may maintain similar datasets where one dataset is a better fit for some projects than others. Your task is to understand the precise data needs of your project in order to compare it with all the available data that you find. This step is important as it may lead your team to review the scope or research question of the project.
When looking for data, you can:
There are a lot of tools, techniques, and data sources that can help you in finding data both online and offline. These include:
Different types of digital files use different structures to hold information. For example, a text file is structured differently than an image file, which is structured differently than a web page. At the same time, most computer applications can only open a few file types since they are programmed to work only with specific structures—i.e. a word processor cannot open a spreadsheet file. It is important that you know about different file formats and how they relate to the data that you require so that you can better plan a strategy on how to get the data.
Some of the most common file formats/file extensions that you might encounter when working with data include:
.txt - TXT is the extension for basic text files. It is not a structured data format per se, but it is possible to write data in a text file and have the right software recognize the structure despite the .txt extension such as in delimited text files (see CSV/TSV below).
.csv/tsv - CSV stands for Comma Separated Value and is used for storing tabular data: data arranged in rows and columns. An alternative is the TSV file format which uses tabs instead of commas to separate the values. Both of them are simply text arranged in a structured way The character used to separate values in the file (e.g. comma for CSV and tab for TSV) is known as the delimiter, hence the general name given to these kinds of files—delimited text files. The .csv or .tsv extension indicates to the software how to read the file but many applications can automatically detect the tabular structure in delimited text files even without the .csv or .tsv extension.
.xls/.xlsx - proprietary formats used by Microsoft Office to store its spreadsheets. Spreadsheets store tabular data similarly to CSV files but they also include information that is not purely data (e.g. the formulas used to compute cell values) and can store multiple tables in one file. As a consequence, spreadsheets are usually heavier (in terms of file size and the computing power needed to open them) than CSVs. Additionally, being a proprietary format might make .XLS and .XLSX less suitable for data sharing even if they have widespread use because of people's familiarity with Excel.
.ods - an open file format for spreadsheets developed and maintained as part of the Open Document Format for Office Applications (ODF). It is widely used in both free and open source office applications (LibreOffice, ONLYOFFICE) as well as propietary ones (MS Office). Being an open format means that there is no need to purchase a proprietary application in order to open and use it.
.doc/.docx - proprietary formats used by Microsoft Office to store word documents. They store more information (e.g. text formatting, images, links) than simple text files which makes them heavier to use. Similar to .XLS/.XLSX, they may be less suitable for data sharing even if they have widespread use because of people's familiarity with Word.
.odt - an open file format for word documents developed and maintained as part of the Open Document Format for Office Applications (ODF).
JSON - JavaScript Object Notation is designed to be lightweight, web-native, easy to read by programming languages, and easy to share through APIs. While JSON is an international standard like CSV, they store different types of data structures: CSV is designed for tabular data while JSON structures its data in a tree-like structure. The Open Contracting Data Standard uses a JSON schema as its data model.
GeoJSON - JSON with accompanying location information—i.e. coordinates (latitude and longitude), vector data model (point, line, polygon). It is useful for working with geospatial information on the web.
shapefile - a proprietary format for storing spatial vector data for ESRI products. Similar to .XLS, it is a well-known and widely-used format even if it has limitations.
geopackage - an open format by the Open Geospatial Consortium for storing spatial data (both vector and raster) which has can overcome the limitations of shapefiles.
When looking for data that you need, it is not uncommon for you to find multiple datasets and sources pertaining to the same data. Try to avoid the temptation to settle on the first dataset you find and adjust the project based on that without investigating further if there are better options.
Sometimes it may be more useful to create the needed dataset out of several quality datasets rather than settling for the obvious choice.