“By failing to prepare, you are preparing to fail.” Benjamin Franklin
In the last post I provided an overview of the final stage in the data pipeline: Metadata and Archiving. The first stage in the data pipeline is planning, and I provided an overview of this stage in my third post “The Data Pipeline: Planning.” In this post I’ll cover the planning stage in more detail, specifically I’ll be discussing Data Management Plans (DMPs). Since there is a surprising amount of information regarding data management plans available online, in this post I will summarize the common themes from across several sources, including government agencies and universities.
“I think you should always bear in mind that entropy is not on your side.“Elon Musk
Recall from my 3rd post that a data management plan provides the foundation for the data pipeline, and lays out the framework for how the data management team will complete each step. This is important, because as Elon Musk, owner of Tesla and SpaceX, once said “I think you should always bear in mind that entropy is not on your side.” In my experience, such is true in all aspects of life, and no less so when it comes to data management. On a more practical note, some project sponsors require a data management plan be prepared at the proposal stage (e.g., U.S. National Science Foundation [NSF]) or as one of the first deliverables due after beginning a project.
Writing a Data Management Plan
I reviewed information on DMPs from the following university and government sources, and the literature: U.S. Fish and Wildlife Service, U.S. Geological Service, Cornell University, Massachusetts Institute of Technology, NSF, Data Observation Network of Earth (DataONE), and Michener (2015). I then compiled a list of common elements of DMPs from each of the sources, which is provided with annotations, below.
Common Themes of DMPs
Basic project information and purpose
Every DMP should include a basic project description, including the purpose, specific project objectives, project team, contact information, and timeline.
Roles and Responsibilities
The DMP should include a description of the roles and responsibilities of the project team, including the assignment of the data steward(s) and managers.
The DMP should include a data management budget estimate. When preparing the estimate be sure to budget for each stage in the data pipeline.
The DMP should include a listing and brief description of the data to be collected (e.g., tabular, photographs, video, audio, physical samples, geospatial), data sources (e.g., new data collection vs. existing), the data types (e.g., character varying), and the estimated volume (e.g., gigabytes).
Data Organization and Storage
The DMP should include a description of how the data will be organized and stored. This may include database schemas, file structures, code repositories, and, for physical samples that will be maintained over time, a curation plan. This section should also include information on how the data will be backed up periodically to ensure it is not lost.
Data Quality Assurance
The DMP should include a description of the data quality assurance and control (QA/QC) process. This may include standard operating procedures, flow charts, and a description of database control structures.
Data Processing and Workflows
At the core of this element of DMPs are the concepts of data lineage and provenance. Wikipedia (2020) defined data lineage as the data origin, what happens to it and where it moves over time; and data provenance as records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. In essence, the DMP should include information about the data from the time it is first obtained; through the processes of QA/QC review, analysis, and reporting; and to archiving. Topics in this section of the DMP may include:
- Flow charts illustrating the flow of data through the stages in the data pipeline
- Standard operating procedures
- Code repositories
- Descriptions of anticipated data transformations
The DMP should include a description of the process of preparing metadata (i.e., data about the data), including the minimum standards for complete metadata. See The Data Pipeline: Metadata & Archiving for more details.
Data Access and Sharing
The DMP should include a section describing data access while the project is active, and after project completion. Will the data will be submitted to a public data repository? If so, then which repository will be used? This section may also include information regarding data ownership, intellectual property rights, attribution, and licensing (e.g. creative commons).
Lastly, the DMP should include a description of how the data will be archived and maintained over time after the project is complete.
Tools for Preparing DMPs
There are a variety of tools available for preparing DMPs. Below is a listing of said resources.
- DMP Tool: Build your own data management plan
- ezDMP: ezDMP helps you build the Data Management Plans for your NSF Grant Applications.
- NSF DMP Requirements for proposals
- Ten Simple Rules for Creating a Good Data Management Plan (Michener 2015)
- University of Michigan Guidelines for Effective Data Management Plans
- University of Minnesota DMP examples from UMN researchers
- U.S. Geological Survey Data Management Plan Checklist
Next Time on Elfinwood Data Science Blog
In this post I provided a detailed review of data management plans. In the next post I’ll provide an overview of data management software. If you like this post then please consider subscribing to this blog (see below) or following me on social media.
Wikipedia contributors. (2020, July 18). Data lineage. In Wikipedia, The Free Encyclopedia. Retrieved 18:56, July 20, 2020, from https://en.wikipedia.org/w/index.php?title=Data_lineage&oldid=968340793
Follow My Blog
Get new content delivered directly to your inbox.
Copyright © 2020, Aaron Wells