About Elfinwood Data Science Blog.

Introduction
Elfinwood is a term used to describe stunted forests characteristic of most subalpine and alpine regions world-wide. Also referred to as krummholtz, these miniature forests occur at the upper altitudinal limits of trees where the environmental conditions, namely extremely cold temperatures and strong winds, impose restrictions on the trees physiology, and forces them to adapt. The most notable adaption is the change in growth form from tall and erect, to short and squat, or even prostrate. As a vegetation ecologist, elfinwood is to me a physical manifestation of ecology, the branch of biology that deals with the relationship of organisms to one another and the environment. When I see stands of elfinwood it tells me something about the environment, that I’m at the transition zone between subalpine and alpine physiography, and about the plant species that I’m likely to encounter.
In the digital age in which we now live, data management and analysis has gone through a transition of its own. Over the past 2 decades, computational power and connectivity have increased exponentially, and the term “data science” was coined.
For an outline of current blog posts with hyperlinks jump to: Next Time on Elfinwood Data Science Blog
Data Science
Data science has been defined in various way, of which the following definition by Irizarry (2020) sums it up nicely:
“Data science is an umbrella term to describe the entire complex and multistep processes used to extract value from data.”
I interpret the term “value” here to refer to data driven, actionable insights or improvements in the understanding of a system gained by analyzing data and synthesizing the results.
Irizarry (2020) lists 3 areas of expertise:
- Data Engineer: Deals with the hardware, efficient computing, and data storage infrastructure.
- Data Science Software Developer: Develops data science software.
- Data Analyst: Analyzes and explores data, fit models and applies machine learning algorithms, and presents the results.
I propose a fourth area of expertise, that of Data Steward. Data stewards oversee all elements of data management from preparing a data management plan, to developing and maintaining the database infrastructures, and overseeing data quality assurance and control (QAQC) and archiving. Going forward, I’ll refer to the 4 groups above collectively as “data scientists”.
It is my experience that the role of the data steward, and the importance of properly managing data in the environmental sciences has been, until relatively recently, undervalued. Our colleges and universities do a good job of teaching us the philosophies that underlie our scientific disciplines, the field methods and protocols necessary for conducting field surveys, and the statistical underpinnings and data analysis techniques necessary to plan an effective study design and summarize and synthesize our data. However, it seems that rarely, if ever, are we taught how to properly manage and curate the field data that we collect. Perhaps this is in part because data management isn’t sexy. Field work is sexy, and analyzing data and published those results in scientific journals is sexy, but data management…definitely not.
Additionally, with the vast computational power available to us today, data scientists are capable of rapidly analyzing massive amounts of data. Take for instance, Google Earth Engine (GEE), Google’s cloud computing geospatial software that combines petabytes (i.e. thousands of terabytes) of satellite and geospatial imagery, with planetary-scale analysis capabilities, all freely available on the web. Google Earth Engine gives just about anyone whose willing to learn a little Javascript the power to perform spatial analysis at the scale of the entire globe. To fully benefit from these advances in technology it’s more imperative than ever that we properly manage and share our data. To facilitate this, at a minimum we need to coordinate our database schemas and domain lists within, and between, our respective disciplines.
Elfinwood Data Science Blog
The objective of this blog is to help improve data management in general, with examples given from the biological and environmental sciences. To this end, I’ll present an introduction to data management and a database schema model with the intent of moving us towards a more integrated approach to data management. The materials presented will be equivalent to that of a graduate level course in data management. Here is a preliminary list of topics that I plan to cover:
- Why data management matters
- The Data Pipeline
- Preparing a data management plan
- Recording data in the field
- Field data management
- Data management software
- Version control
- PostgreSQL: An Overview
- Data types
- Schema models
- Data tables
- Data columns
- Reference tables and referential integrity
- Superplot/plot/subplot concepts
- Single visit sample units
- Multiple visit sample units
- Spatial data
- Missing data
- Managing voucher specimen and lab sample data
- Data completeness tiers and minimum dataset
- Database views
- Metadata
Next Time on Elfinwood Data Science Blog
In the next post I’ll discuss several misconceptions that contribute to the undervaluing and poor execution of data management. Below is an outline of current blog posts grouped by topic.
- The next 6 posts after that, beginning with The Data Pipeline: Planning are a series of posts on the 6 stages of data management from planning to metadata and archiving.
- After that are 2 posts on Data Management Plans and Databases and Data Management Systems
- Next, beginning with Version Control: An Overview, are a series of 5 posts with hands on lessons on version control using Git and GitHub.
- The post Learning Data Science describes how to use the learning-data-science GitHub repository for learning data analysis and management in association with this blog.
- The post Creating a PostgreSQL Database begins a series of hands on lessons on data management and analysis using PostgreSQL and R.
Literature Cited
Irizarry, R. A. (2020). The Role of Academia in Data Science Education. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.dd363929
Copyright © 2020, Aaron Wells