How data acquisition protocols drive database design.
In the previous post I made the distinction between data stewardship and data management, introduced the data pipeline (Figure 1), provided an overview of the planning stage of the pipeline, and made a call to action for everyone involved in data management to think like a data steward. In this post I’ll provide an overview of the second stage in the data pipeline: Database Development.
Database development is the act of creating a database system. A database system is an integrated set of software that provides the storehouse for the data you plan to acquire, and the tools for accessing and interacting with the data. During this phase the data steward(s) works with the IT professionals and subject matter experts that are part of the data management team to decide on which data management software to use, how users will access the data, and how the database schema will be organized. The database development team will rely on the data model, prepared during the planning phase, to guide the database development process. The database development stage is inextricably linked to the data acquisition phase, as illustrated by the double headed arrow connecting these two stages in Figure 1. This is because the purpose of the database system is storage of the data that will be acquired during the data acquisition phase.
The database development stage is inextricably linked to the data acquisition phase.
In practice what this means is that each data element that will be recorded during the data acquisition phase must be assigned a data type (e.g., integer) and a storage location within the database (e.g., data column within a data table). Additionally, each categorical data attribute will require a table in the database that contains the code and title of allowable values (i.e., data domain or domain list), and which is linked to the associated data column using a foreign key constraint. Thus the data acquisition protocols in large part drive the database design (Figure 1).
The database system must also be designed to allow database users to access the data. For single user applications, data access is simple, and the database can be stored on a single computer or network. Data access for database systems designed for two to a couple 100 users located in the same geographic region may be as simple as a database on a local file system or server, using an out of the box front end. For larger applications where 1000s to millions of users are spread across the globe, database access may necessitate running an instance of the database on a cloud server with a custom web frontend.
Figure 1. The upper portion of the first page of the U.S. Army Corp of Engineers (USACE) wetland determination field data form. The field form is used to record data in the field, and reflects the USACE protocols for conducting wetland determinations under Section 404 of the Clean Water Act. A database designed to store the data recorded on this form would require a data column for each data attribute (e.g., Project/Site). Data domain tables would be required for categorical data attributes. For instance, the attribute “Local relief” the domain table would include the values ‘concave’, ‘convex’, and ‘none’.
The Data QAQC and Analysis & Reporting phases should also be considered during the database development stage. For QAQC this may include designing tables and columns in the database for use in tracking progress/completion of data QAQC tasks (e.g., data reviewer initials and timestamp) and including audit tables that automatically track changes made to the data on a row-by-row level. For data analysis and reporting the database should be designed to integrate as much as possible with the common analyses and reporting workflows associated with these stages in the pipeline. For instance, outlier analysis is a common first step in many data analysis workflows. To integrate the analysis with the database, a data column could be included for flagging records that are determined to be outliers. This column could then be used in a data filter to withhold those records from subsequent analyses.
Additional database design elements could include data attributes that are automatically calculated (for instance, using a trigger function) based on logical criteria that are programmed into the database. For instance, in Figure 1 under “SUMMARY OF FINDINGS” there are the data attributes “Hydrophytic Vegetation Present?”, “Hydric Soils Present?”, and “Wetland Hydrology Present?”. Per the USACE protocols, the data attribute “Is the Sample Area within a Wetland” should be checked as ‘Yes’ only if all three of the above data attributes are checked as ‘Yes.’ In the database the data attribute “Is the Sample Area within a Wetland” could be designed to be automatically calculated depending on the values attributed to the data attributes “Hydrophytic Vegetation Present?”, “Hydric Soils Present?”, and “Wetland Hydrology Present?”. A major benefit to an approach like this is that if, in the process of the reviewing the data in the office it is determined that one or more of the three were assigned incorrectly in the field (thus resulting in a change to those data attributes in the office), then the calculated field is updated automatically. Thus one less change that needs to be made manually in the database.
For the reporting stage, design the database with consideration for the reporting elements, for instance tables and figures, that will be included in the manuscripts and reports that will be produced from this data. For instance, if your reports include tables that contain codes, titles, and descriptions of categories from your domain lists, then a data column could be included in the domain tables with a column to store descriptions of each category. The descriptions could then be exported from the database to construct the report table.
Next Time on Elfinwood Data Science Blog
In this post we provided an overview of stage 2 in the data pipeline: database development. In the next post I’ll provide an overview of data acquisition, stage 3 of the data pipeline.