Data Quality Control and Assurance Starts Now!

Introduction
In the previous post I provided an overview of the database development stage of the data pipeline, and emphasized that database design is largely driven by the data acquisition protocols. In this post I’ll 1) provide an overview of the third stage in the data pipeline: Data Acquisition, and 2) discuss why data Quality Assurance and Control (QAQC), stage four in the pipeline, should begin now.

Data Acquisition
Data acquisition is the act of obtaining data. In ecology and the environmental sciences this typically means going out into the field to collect data. Data acquisition may also include acquiring preexisting data sets from colleagues or public data repositories. In the data acquisition phase you’ll need to polish the data acquisition protocols you first prepared in the planning phase, decide how data will be recorded, train and calibrate your field crews, and develop and implement field data QAQC procedures.
Data QAQC is the next stage in the pipeline. However, the Data Acquisition stage includes elements of Data QAQC, as indicated by the red arrow looping back into the Data Acquisition block in Figure 1. Data QAQC in the Data Acquisition stage is focused on preventing data errors through protocol development, data validation procedures, and field crew training and calibration; and catching data errors early on before those errors are incorporated into the project database.
Protocols and Recording Data
The first steps in the data acquisition process occur before setting foot in the field. Data acquisition protocols will need to be fleshed out and standard operating procedures drafted. You’ll also need to decide how the data will be recorded. Traditionally, field data in ecology and the environmental sciences was recorded on paper data sheets. However, with the growing availability of smart phones and tablets, especially over the past 10 years, electronic data capture has become increasingly popular.

Electronic data capture may include something as simple as recording data in a spreadsheet app, to a custom built app designed specifically for your project. Options for app development include 1) developing your own custom apps if you, or someone on your data management, are a software developer; 2) commercially available options like Fulcrum; or 3) open source options like Open Data Kit.
However, as discussed in the previous post, regardless of how you decide to record data, bear in mind that stage 2 in the pipeline, database development, is inextricably linked to the data acquisition phase. Thus your paper field forms or electronic data capture applications will need to mirror your project database.
Field Crew Training and Calibration
The next steps in data acquisition, field crew training and calibration, also occur before setting foot in the field, or early on in your teams field deployment. Field crew training is a critical step in the data acquisition process, and one that will pay out later in more efficient and productive field crews, and higher quality data. Training steps may include:
- Having your team read through the protocols,
- Meeting with your team to discuss the protocols and address questions,
- Getting your team familiar with the paper field forms, or for electronic data capture, ensuring your team knows how to use the mobile device and data collections apps, and
- Performing a live field training at a local park or green space, or on the first field day.
Field crew training will pay out later in more efficient and productive field crews, and higher quality data.
Field crew calibration is another important step in the data acquisition process. Calibration, is the act of calibrating, which as used here refers to the process of aligning all members of a field crew on the protocols, and maximizing data precision between individuals and crews. Calibration typically occurs in the field, and involves all crews recording data at the same sampling location, while discussing the protocols and comparing results.
Early Error Detection
Field data QAQC is all about catching and correcting errors in the data early on. Field data QAQC should occur on a daily basis, typically in the evenings, and also on days when foul weather may force your field crews to stay inside. If you are recording data digitally, field QAQC can be automated to some degree by building data validation checks into the data collection apps, and using domain lists to constrain the list of allowable values for a field. An additional benefit to digital data acquisition is that field data QAQC can be conducted digitally in a spreadsheet, data analysis program like R, or in a local copy of a database on the field laptop. Daily checkins with all field crew members, especially early on in a field trip, can help resolve protocol questions that can result in data errors and inconsistencies. If you’re managing a large, complex project with many field crews, you may also want to assign a dedicated field data QAQC specialist whose job it is to oversee field data QAQC.

Lastly, field data should always be backed up daily. For paper data forms this may include taking digital photos of the forms, or scanning the forms to PDF using a small, wireless scanner. Digital photos should be offloaded from the cameras and saved to a laptop or external hard drive. Data acquired digitally can be backed up by saving backup files on a laptop or, in more sophisticated workflows, uploading the data to a database on a remote server.
Next Time on Elfinwood Data Science Blog
In this post I provided an overview of stage 3 in the data pipeline: Data Acquisition. In the next post I’ll provide an overview of Data QAQC, stage 4 of the data pipeline.
3 thoughts on “The Data Pipeline: Data Acquisition”