In the last post I provided an overview of data management plans: Data Management Plans. In this post I’ll be presenting on databases and data management systems (DBMS), including a discussion about what constitutes a database, the basics of relational databases, and an overview of several popular DMBS software options.
What is a Database?
Oracle, a developer of database software, defines databases as follows:
A database is an organized collection of structured information, or data, typically stored electronically in a computer system.
A database is typically controlled by a database management system (DBMS), which Oracle defines as:
A DBMS is a comprehensive database software program that serves as an interface between the database and its end users or programs, allowing users to retrieve, update, and manage how the information is organized and optimized.
Using Spreadsheets as Databases
For many people, particularly when first starting out in the natural sciences, spreadsheets are the go to tool for data management. I am no exception. When I first started graduate school spreadsheets were my thing. Little did I know that they were soon to become my enemy! According to the above definition a spreadsheet that contains data would be considered a database. However, a spreadsheet is not a DBMS, and I do not recommend using spreadsheets as a database. Below is a listing of several reasons why you should avoid spreadsheets for data management.
- Lack of Scalability: For small, simple datasets spreadsheets can work OK, but for large, complex datasets spreadsheets are ineffective for data management.
- Sorting: Sorting data in spreadsheets is error prone.
- Referential Integrity: Spreadsheets lack the inherent ability to establish referential integrity, i.e., structured, enforceable relationships between data elements.
- Version Control: Spreadsheets are prone to version control issues.
- Querying: Querying data in spreadsheets is generally limited to filtering, sorting, and pivot tables. Whereas querying options in a DBMS are virtually endless.
- Data Types: Data are classified based on the values they can take on (e.g., integer). The different classes of data are referred to as data types. Data type support is available in spreadsheets, but is generally limited to a small number of basic types.
Recent advances in cloud-based spreadsheets (e.g., Google Sheets) have made spreadsheets more flexible and powerful. For instance, multiple users can access and simultaneously edit a Google Sheet (thus improving scalability), a version history is maintained for each file (thus improving version control), and filter views are just a straight up cool concept. However, I still would not recommend using them as a primary data management tool unless your dataset is small and simple. Never the less, if you are dead set on using a spreadsheet as a database here are a few tips to help improve your experience.
- Use cloud-based spreadsheets to help maintain version control and improve collaboration.
- Use data validation as a check on the data that is entered into columns. See how to do this in Excel and Google Sheets.
- Create and save filter views for data filtering.
Referential integrity is important because it helps ensure data integrity, and is a feature any good DBMS. DBMSs that support referential integrity are called Relational Database Management Systems (RDBMS). Referential integrity is a database construct in which relationships between data elements are structured and enforced through the use of database constraints. There are several types of database constraints that can be employed, the most common are primary keys, unique keys, foreign keys, and check constraints.
- Primary keys: Defines a column or group of columns that are used as a unique identifier within and between tables. Null values are not allowed in any columns included in a primary key. Single column primary keys require that the values in each row are unique, while multi-column primary keys require that the combination of values across all columns in a row are unique.
- Unique keys: Ensures that the values in each row of a column, or across a set of columns are unique.
- Foreign keys: Defines a relationship between values in columns from 2 different tables such that the value in the column of the first table must be present in the related column of the second table.
- Check constraint: Defines an allowable range of values that can be included in a column.
DBMSs typically structure data using a data model, the most common of which organizes data into columns and rows within tables. Data models are often expressed as flow charts termed “Entity Relationship Diagrams” or ERDs that display the tables, columns, primary keys, and relationships between tables. Figure 1 presents an example ERD for a hypothetical avian mist net database that contains data tables (i.e., mist_net, net_conditions) and reference tables (i.e., ref_mist_net, ref_species) that contain the allowable values for the related columns in the data tables.
There are many DBMS software options available today. For the remainder of this post I’ll present a brief overview of several popular options. However, first we need to make the distinction between database systems that can be deployed on a server (“database servers”), and those that cannot (“non-server databases”). In the most basic terms a server is a computer that provides resources to one or more computers (termed “clients”) on a network. The resources often include files and file systems, databases, and software. In essence, database servers are databases that can be configured on a server and accessed simultaneously by multiple users across a network. Non-server databases are run on a single computer and can only be accessed by one person at a time.
Microsoft Access is a non-server RDBMS software product that comes as part of Microsoft 365. Of the 4 DBMSs that I present here, Access is the only one that must be purchased, the others are all free, open source software programs. Access is a Windows based program that combines a General User Interface (GUI) with a stand-alone database. Access can also be configured as a “frontend” to a “backend” database server like PostregreSQL (see below), meaning that Access is used to view and enter data, but the data is actually stored in a database on a server. Access offers the ability to write queries using Structured Query Languange (SQL), custom reports, and a wide variety of customizations using VisualBasic.
MySQL is an open-source RDBMS database server owned by Oracle Corporation. MySQL is free and open-source software under the terms of the GNU General Public License, and is also available under a variety of proprietary licenses (Wikipedia Contributors 2020). When Oracle purchased MySQL the developers, based on concerns that MySQL would someday be made proprietary, created a copy of the underlying code and created MariaDB which is intended to remain free and open-source software in perpetuity under GNU. MySQL and MariaDB generally remain similar, with the exception of some recent new features. MySQL and MariaDB both conform to the SQL standard, and can be accessed from the command line, and from a variety of GUIs, for instance dbForge Studio, DBeaver, and HeidiSQL.
SQLite is a “small, fast, self-contained, high-reliability, full-featured, SQL database engine.” SQLite is also free and open source, and is cross-platform compatible, making it extremely flexible. SQLite can be used as a stand-alone RDBMS, for cross-platform data transfers, and as the backend database for websites, webforms, and apps. SQLite is a non-server database. However, for free software SQLite is incredibly powerful and versatile. A listing of appropriate uses for SQLite, and situations where a client/server RDBMS may work better is provided on the SQLite website here: https://www.sqlite.org/whentouse.html.
PostgreSQL is an open-source RDBMS database server with a 30 year development history. PostgreSQL conforms to the SQL standard, and can be accessed from the command line, and from a variety of GUIs, for instance pgAdmin, DBeaver, and dbForge Studio. In future posts I’ll go into more detail about PostgreSQL, including providing code for a database template that will be used in future posts for learning how to manage data in PostgreSQL.
Next Time on Elfinwood Data Science Blog
In this post I provided an overview of databases and data management systems. In the next post I’ll provide an overview of version control. If you like this post then please consider subscribing to this blog (see below) or following me on social media.
Wikipedia contributors. (2020, August 24). GNU General Public License. In Wikipedia, The Free Encyclopedia. Retrieved 18:40, September 11, 2020, from https://en.wikipedia.org/w/index.php?title=GNU_General_Public_License&oldid=974617467
Wikipedia contributors. (2020, September 7). MariaDB. In Wikipedia, The Free Encyclopedia. Retrieved 18:41, September 11, 2020, from https://en.wikipedia.org/w/index.php?title=MariaDB&oldid=977207761
Wikipedia contributors. (2020, September 3). MySQL. In Wikipedia, The Free Encyclopedia. Retrieved 18:11, September 11, 2020, from https://en.wikipedia.org/w/index.php?title=MySQL&oldid=976518000
Follow My Blog
Get new content delivered directly to your inbox.
Copyright © 2020, Aaron Wells