The Challenges of Data Management
It’s OK. Just let it all out. Let those big tears flow out like a river of denial down those pouty cheeks of yours. OK, there now, are you done? Maybe now we can talk about why data management matters, and why it can be challenging to develop good data management habits?
Seriously though, in my experience as an ecologist over the past 20 years, data management seems to be the persona non grata of the environmental science workflow of study planning -> field work -> analysis -> reporting. Now don’t get me wrong, there have always been those who have taken data management seriously. How else, without a dedicated group of data management experts, software engineers, and data scientists, would we have an amazing tool like the Structured Query Language (SQL) standard; database software like PostgreSQL, MySQL, and MongoDB; and publicly available databases like the Global Biodiversity Information Facility (GBIF) biodiversity database?
The rules of the game (SQL standard) and the tools (database software) for data management have been available for some time, so why does data management in the environmental sciences seem to be undervalued? It’s likely that data management isn’t entirely undervalued; it’s also likely that part of the problem is that, as environmental scientists, we may not currently have the “tools in our tool chest” to become good data managers. After all, we’re field scientists, not data scientists, right? I would argue that we need to be both.
For the remainder of this post I’ll discuss several misconceptions that contribute to the undervaluing and poor execution of data management. I point out these misconceptions not to judge, but as someone who was susceptible to them early in my career.
Misconceptions about Data Management
Below are 3 misconceptions about data management. I’ll cover each in more detail below.
- Misconception #1: It will take more time and money to properly manage my data then to not.
- Misconception #2: Data management begins only after I’ve acquired data.
- Misconception #3: College courses in statistics and data science are substitutes for coursework and training in data management.
Misconception #1: It will take more time and money to properly manage my data then to not.
This misconception is particularly insidious because of the costs, in both dollars and lost opportunities, associated with poor data management. The truth is that not managing your data properly will ultimately cost you more time and money. Redman (2016) reports that bad data cost the U.S. $3 trillion dollars in 2016. That’s an astounding amount of money, an amount equal to ~20% of the U.S. gross domestic product (GDP) in 2016!
To put this in perspective for scientists, we can use the above figures to roughly estimate the annual cost of bad data relative to the total dollar amount the National Science Foundation (NSF) awarded in 2016. In 2016, NSF awarded $7.5 billon in total funding (NSF 2017). Thus, the cost of bad data to scientists in 2016 was roughly $1.5 billion ($7.5 billion * 0.2). It’s important to note that this is likely a minimum estimate as it includes only NSF funded projects, and not projects funded by government agencies, NGOs, and private companies.
In 2016 alone, the cost of bad data to scientists in the U.S. was approximately $1.5 billion.
This is a startling large number, and while it’s clearly a very rough estimate, it does begin to give us a sense of the magnitude of the costs resulting from poor data management. For instance, to estimate the total cost of poor data management for yourself, just take 20% of your annual project budgets, and that’s how much you’re wasting on poor data management every year.
It’s been said that “time is money, and money is time”, and anyone who’s had to retain a lawyer know’s there is much truth to that saying. However, in addition to monetary costs, there are very real costs expressed as a reduction in quality of life and a decrease in productivity resulting from time spent dealing with poorly managed data. When you invest your time and money in good data management you’re working smarter, not harder. Your investment will yield dividends in an improved quality of life, increased productivity, and a reduction in cost overruns on your project budgets.
Misconception #2: Data management begins only after I’ve acquired data.
Another common misconception about data management is that it only begins after you have data in hand. The truth is that data management begins in the planning and budgeting phase of a project, well before any data are acquired.
Data management begins in the planning and budgeting phase of a project, well before any data are acquired.
In practical terms this means that when a project is first conceived you need to think through and budget for all aspects of data management. This includes preparing a data management plan, developing a project database and data recording system, data quality control and assurance (QAQC) review, preparing metadata, and data archiving. If the project is a multi-year project then you’ll also need to budget for the costs of periodic database maintenance. I’ll cover each of the above aspects of data management in future posts.
Thinking about data management after collecting data is kinda like departing on an across the country road trip without a road map. You likely have an end destination in mind, and you’ll eventually get there, but you’ll inevitably take many wrong turns and dead ends along the way. In our personal lives this may be just what we want and/or need, a free-spirited jaunt across the country, an adventure. However, it’s my experience that adventures in data management are typically costly.
Misconception #3: College courses in statistics and data science are substitutes for coursework and training in data management.
A third common misconception is that college courses in statistics and data science are substitutes for classes in data management. That is unless a particular college course in one of these topics specifically covers data management concepts, implementation, and best practices .
Statistics courses typically focus on the theories behind the numbers, proper study design, and specific statistical methods and analyses. Courses in data science will often provide an overview of relational databases, including theory, how to create a database, and accessing data using SQL and other programming languages. This is great because it provides the background for designing statistically rigorous scientific studies, understanding database systems, and provides the tools for creating a database and accessing the data within. However, there is more to data management than what many of these courses typically cover, and the students completing them are left with an incomplete picture of what data management entails.
Ideally a collegiate-level course in data management would, in addition to providing an overview of relational databases and SQL, include topics ranging from preparing a data management plan, and developing a project database and data recording system; to data quality control and assurance review, preparing metadata, and data archiving. I propose that this is more than enough content to justify stand-alone college courses in data management. Ideally, a course in data management should be a prerequisite to all courses in statistics and data science.
Next Time on Elfinwood Data Science Blog
In the next post I’ll cover the difference between data management and stewardship, introduce the data pipeline, provide an overview of the planning stage of the data pipeline, and make a call to action for all of us to think like data stewards when it comes to the data we collect.
National Science Foundation (NSF). 2017. Agency Financial Report FY 2016. National Science Foundation, Alexandria, Virginia. https://www.nsf.gov/pubs/2017/nsf17002/pdf/nsf17002.pdf
Redman, T.C. 2016. Bad Data Costs the U.S. $3 Trillion Per Year. Harvard Business Review. September 22, 2016. https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year