Elfin…what?

About Elfinwood Data Science Blog.

A stand of elfinwood, including lodgepole pine (*Pinus contorta ssp. contorta*) and subalpine fir (*Abies lasiocarpa*), in the northern Wind River Range, WY.

Introduction

Elfinwood is a term used to describe stunted forests characteristic of most subalpine and alpine regions world-wide. Also referred to as krummholtz, these miniature forests occur at the upper altitudinal limits of trees where the environmental conditions, namely extremely cold temperatures and strong winds, impose restrictions on the trees physiology, and forces them to adapt. The most notable adaption is the change in growth form from tall and erect, to short and squat, or even prostrate. As a vegetation ecologist, elfinwood is to me a physical manifestation of ecology, the branch of biology that deals with the relationship of organisms to one another and the environment. When I see stands of elfinwood it tells me something about the environment, that I’m at the transition zone between subalpine and alpine physiography, and about the plant species that I’m likely to encounter.

In the digital age in which we now live, data management and analysis has gone through a transition of its own. Over the past 2 decades, computational power and connectivity have increased exponentially, and the term “data science” was coined.

For an outline of current blog posts with hyperlinks jump to: Next Time on Elfinwood Data Science Blog

Data Science

Data science has been defined in various way, of which the following definition by Irizarry (2020) sums it up nicely:

“Data science is an umbrella term to describe the entire complex and multistep processes used to extract value from data.”

I interpret the term “value” here to refer to data driven, actionable insights or improvements in the understanding of a system gained by analyzing data and synthesizing the results.

Irizarry (2020) lists 3 areas of expertise:

Data Engineer: Deals with the hardware, efficient computing, and data storage infrastructure.
Data Science Software Developer: Develops data science software.
Data Analyst: Analyzes and explores data, fit models and applies machine learning algorithms, and presents the results.

I propose a fourth area of expertise, that of Data Steward. Data stewards oversee all elements of data management from preparing a data management plan, to developing and maintaining the database infrastructures, and overseeing data quality assurance and control (QAQC) and archiving. Going forward, I’ll refer to the 4 groups above collectively as “data scientists”.

It is my experience that the role of the data steward, and the importance of properly managing data in the environmental sciences has been, until relatively recently, undervalued. Our colleges and universities do a good job of teaching us the philosophies that underlie our scientific disciplines, the field methods and protocols necessary for conducting field surveys, and the statistical underpinnings and data analysis techniques necessary to plan an effective study design and summarize and synthesize our data. However, it seems that rarely, if ever, are we taught how to properly manage and curate the field data that we collect. Perhaps this is in part because data management isn’t sexy. Field work is sexy, and analyzing data and published those results in scientific journals is sexy, but data management…definitely not.

Additionally, with the vast computational power available to us today, data scientists are capable of rapidly analyzing massive amounts of data. Take for instance, Google Earth Engine (GEE), Google’s cloud computing geospatial software that combines petabytes (i.e. thousands of terabytes) of satellite and geospatial imagery, with planetary-scale analysis capabilities, all freely available on the web. Google Earth Engine gives just about anyone whose willing to learn a little Javascript the power to perform spatial analysis at the scale of the entire globe. To fully benefit from these advances in technology it’s more imperative than ever that we properly manage and share our data. To facilitate this, at a minimum we need to coordinate our database schemas and domain lists within, and between, our respective disciplines.

Elfinwood Data Science Blog

The objective of this blog is to help improve data management in general, with examples given from the biological and environmental sciences. To this end, I’ll present an introduction to data management and a database schema model with the intent of moving us towards a more integrated approach to data management. The materials presented will be equivalent to that of a graduate level course in data management. Here is a preliminary list of topics that I plan to cover:

Why data management matters
The Data Pipeline
Preparing a data management plan
Recording data in the field
Field data management
Data management software
Version control
PostgreSQL: An Overview
Data types
Schema models
Data tables
Data columns
Reference tables and referential integrity
Superplot/plot/subplot concepts
Single visit sample units
Multiple visit sample units
Spatial data
Missing data
Managing voucher specimen and lab sample data
Data completeness tiers and minimum dataset
Database views
Metadata

Next Time on Elfinwood Data Science Blog

In the next post I’ll discuss several misconceptions that contribute to the undervaluing and poor execution of data management. Below is an outline of current blog posts grouped by topic.

The next 6 posts after that, beginning with The Data Pipeline: Planning are a series of posts on the 6 stages of data management from planning to metadata and archiving.
After that are 2 posts on Data Management Plans and Databases and Data Management Systems
Next, beginning with Version Control: An Overview, are a series of 5 posts with hands on lessons on version control using Git and GitHub.
The post Learning Data Science describes how to use the learning-data-science GitHub repository for learning data analysis and management in association with this blog.
The post Creating a PostgreSQL Database begins a series of hands on lessons on data management and analysis using PostgreSQL and R.

Literature Cited

Irizarry, R. A. (2020). The Role of Academia in Data Science Education. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.dd363929