The Data Pipeline: Planning

And why we should all try to think like data stewards.

Photo by Pixabay on

In the previous post I discussed some of the challenges of, and common misconceptions about, data management. In this post I’ll cover the difference between data management and stewardship, an overview of the seven steps of data pipeline, and make a call to action for all of us to think like data stewards when it comes to the data we collect.

Data Stewardship vs. Data Management

First, let’s clarify the distinction between data management and data stewardship.

  • Data Stewardship: The act of overseeing data through all stages of the data pipeline.
  • Data Management: A broad term that encompasses all of the processes and actions taken to complete one or more steps in the data pipeline.

Data stewards are those who oversee the data pipeline and coordinates the actions of the data managers (Figure 1). Data managers include anyone charged with working on one or more of the processes or actions required to move data through the data pipeline. A data management team is everyone involved in data management for a given project, and includes the data steward(s) and data managers.

The Data Pipeline

Figure 1 shows the seven stages of the data pipeline. An overview of the planning stage is provided below. In next few blog posts we will provide an overview of the remaining six stages.

Figure 1. The seven stages of the data pipeline.


Data planning is the act of preparing a timeline and identifying the resources, including adequate funding, personnel, and infrastructure, required to complete the remaining six stages in the data pipeline. The centerpiece of this stage is the data management plan, in which the data management team, including the data steward(s) and individual data managers, is identified, and roles and responsibilities are defined. The data management plan provides the foundation for the data pipeline, and lays out the framework for how the data management team will complete each step. The data stewards are actively involved in preparing the plan, and they consult with the individual data managers as needed on specific technical elements.

A data management plan provides the foundation and framework for the remaining steps of the data pipeline.

Data acquisition protocols, if not already available, and a database model will be prepared that specifies the details for each data element from the protocols, and how each will be stored in the project database. You’ll also need to decide how the data will be recorded; for instance, on paper forms or electronically on a mobile device? This is also the stage to budget for data management activities. Thus, if you are preparing a funding proposal, you’ll need at a minimum an outline of your data management plan. The outline will allow you to prepare a reasonable estimate of anticipated data management costs, including the costs of preparing a formal data management plan if your proposal is successful.

A Call to Action

In the previous post we covered some of the challenges of data management, and some of the costs of poor data management. In this post we’ve introduced the data pipeline and a brief overview of the first stage, planning. By now it should be clear that 1) data management isn’t optional, and 2) forethought and planning are required in advance of data acquisition, if we want to get the most out of the data we acquire, and optimize project budgets and timelines.

My call to action is for everyone charged with managing data to think like a data steward, and take the long view on data management.

The data pipeline provides a conceptual model to guide us through the process of data management. The data steward is charged with overseeing the data through each stage in the pipeline and coordinating the data management. Data stewards must keep the bigger picture in mind, while simultaneously having the capacity to zoom into any stage along the pipeline and work with data managers on the details. They recognize the interconnectedness of each stage of the data pipeline, and consequently the importance of having an integrated data management team.

My call to action is then for everyone charged with managing data to think like a data steward, and take the long view on data management. Keep the bigger picture in mind at all time, and recognize that the end result of each stage in the pipeline depends on the individual actions taken in the steps before.

Additional Resources

Next Time on Elfinwood Data Science Blog

In the next post I’ll provide an overview of database development, stage 2 of the data pipeline.

5 thoughts on “The Data Pipeline: Planning

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: