In the previous 2 posts I covered SQL select statements, and we looked at the Lesson_007 and Lesson_008 files in the learning-data-science GitHub repository. Lesson 007 covered the basics of SQL selects statements, and Lesson 008 covered LIMIT and OFFSET, column aliases, data type casting, creating columns on-the-fly, mathematical operations.
In this post, Lesson 009 we’ll go into more detail about filtering in select statements using the WHERE clause, and I’ll provide examples from my_first_database.
Since data analysis and management are best learned by doing, the objective of this and future posts is to get readers hands on experience using PostgreSQL and R. This lesson builds on earlier posts so, if this is your first time reading this blog and you’d like to follow along on your computer, then please go back and start from the post Learning Data Science and work through the lessons up to this point.
Lesson 9 Video
Check out the accompanying video on the Elfinwood Data Science YouTube channel here: Elfinwood Data Science Blog Lesson 9 Video.
For this post we’ll refer to files in the Lesson 9 folder in the learning-data-science repository. If you have not already forked the learning-data-science repository, and cloned the fork to your computer, please do so now following the instructions provided in the post Learning Data Science. If you have previously forked and cloned the repository, but don’t have the files for this lesson, then follow the instructions here to update your fork with the latest files.
If you are not familiar with version control using Git and GitHub, I recommend going back and working through the hands on tutorials in my series on version control starting with “Version Control: An Overview.”
The WHERE clause is used to filter rows in SELECT statements. The WHERE clause is placed after the FROM clause and is followed by one or more boolean expressions combined using AND or OR operators. Boolean expressions are written using comparison functions or operators and evaluate to true or false.
For example, the following select statement returns all columns and only those rows from the public.site table where the soil_restrict_layer_code column is equal to permafrost (Figure 1A).
SELECT * FROM public.site WHERE soil_restrict_layer_code = 'permafrost'
Note the use of single quotes around the value permafrost. This is standard SQL syntax when referring to values in columns with text or character varying data types.
Here’s another example using the greater than or equal to operator, the following select statement will return only those rows in the public.sample table where the expression elevation_m >= 1000 is true (i.e., plots with an elevation_m greater than or equal to 1000 meters). The results are displayed in Figure 1B.
SELECT study_value, sample_id, elevation_m FROM public.sample WHERE elevation_m >= 1000;
Filtering using AND
Here’s another example using a combination of boolean expressions strung together using AND:
SELECT study_value, sample_id, elevation_m FROM public.sample WHERE elevation_m >= 1000 AND elevation_m <= 1200 ORDER BY elevation_m ASC;
The above select statement will return the rows in public.sample with an elevation_m greater than or equal to 1000 meters and less than or equal to 1200 meters and sorted in ascending order by elevation_m (Figure 1C). The above boolean expression can be expressed in mathematical notation as 1000 ≤ elevation_m ≤ 1200.
Filtering using BETWEEN
The above select statement could also be written using BETWEEN to express the range of values in the boolean expression of the WHERE clause. For instance, the select statement below will return the same results as above:
SELECT study_value, sample_id, elevation_m FROM public.sample WHERE elevation_m BETWEEN 1000 AND 1200 ORDER BY elevation_m ASC;
BETWEEN is one of several comparison predicates that are useful for writing boolean expressions. Comparison predicates are similar to operators but require special syntax mandated by the SQL standard (PostgreSQL 2021).
Filtering Using Other Types of Predicates
Other commonly used comparison predicates include IS NULL, IS NOT NULL, IS TRUE, and IS FALSE. The first two predicates test for the NULL condition. NULL is a special type of missing data and indicates that data does not exist. NULL is represented as a blank cell in a database table. I’ll cover missing data in more detail in a future post.
The following select statement provides an example of using the IS NOT NULL predicate to return only those rows in the public.site table for which a site_field_note exists, i.e., rows for which the site_field_note column is not null (Figure 2).
What the Null?
In database lingo NULL means that data does not exist, i.e., there is no value. Because there is no value, standard comparison operators (e.g., =, >, <) won’t work with NULL. Instead, to filter for NULL use the predicates IS NULL, IS NOT NULL, IS DISTINCT FROM, or IS NOT DISTINCT FROM.
SELECT study_value, sample_id, site_field_note FROM public.site WHERE site_field_note IS NOT NULL
The second two predicates, IS TRUE and IS FALSE, are used with boolean (i.e., true/false) fields to test for the expressed condition, e.g., the expression column1 IS TRUE will return all rows where the value of column1 is true.
Other Options for Filtering Boolean Fields
There are many ways to filter boolean fields in addition to the IS TRUE and IS FALSE predicates. The equal sign operator can also be used in conjunction with the following strings (that are not case sensitive) to represent true and false:
- TRUE: t, true, y, yes, on, 1
- FALSE: f, false, n, no, off, 0
Thus, the following WHERE clauses are equivalent to WHERE column1 IS TRUE:
- WHERE column1 = ‘t’
- WHERE column1 = ‘true’
- WHERE column1 = ‘TRUE’
- WHERE column1 = ‘yes’
- WHERE column1 = ‘on’
- WHERE column1 = ‘1’
And the following are equivalent to WHERE column1 IS FALSE:
- WHERE column1 = ‘false’
- WHERE column1 = ‘FALSE’
- WHERE column1 = ‘no’
- WHERE column1 = ‘off’
- WHERE column1 = ‘0’
Read more about the boolean data type here: https://www.postgresql.org/docs/current/datatype-boolean.html
Filtering using OR
Here’s an example of using the OR operator in the WHERE clause:
SELECT study_value, sample_id, physiography_value FROM public.site WHERE physiography_value = 'a' OR physiography_value = 'u'
The above select statement will return the rows in the public.site table for which the physiography_value column is equal to either a or u (Figure 3A).
Filtering using IN
The above select statement could also be written using the IN operator. The IN operator is a type of subquery expression. Subquery expressions are a special kind of boolean expression that evaluate a query within a query (i.e., a subquery) to true or false. We’ll cover subqueries in more detail in a future post. For now, suffice it to know that IN is used with the WHERE clause and uses a set of values as the evaluation criteria. Rows that contain one of the values in the set will be returned in the query. The set of values is provided as a comma separate list within parentheses after the IN keyword. For instance, the following select statement is equivalent to the above select statement:
SELECT study_value, sample_id, physiography_value FROM public.site WHERE physiography_value IN ('a','u')
The IN operator can be used with the NOT operator to filter for rows with the values do not match the list of values in the subquery expression. For instance, the below query will select all rows where physiography_value does not equal a or u:
SELECT study_value, sample_id, physiography_value FROM public.site WHERE physiography_value NOT IN ('a','u')
Filtering using LIKE
The LIKE expression and its converse NOT LIKE are types of pattern matching boolean expressions which evaluate to true if a string in the target column matches the user supplied pattern. LIKE expressions are constructed by providing a target column, followed by the LIKE operator, followed by a combination of literal characters (i.e., the pattern of interest) and wildcard characters.
There are 2 types of wildcard characters used with LIKE:
- Percent sign (%): matches any sequence of zero or more characters
- Underscores ( _ ): matches any single character
The relative position of the wildcard characters with respect to the literal characters is used to specify which part of the search string is the wildcard (i.e., zero or more characters).
Here’s an example of LIKE using the percent sign:
SELECT study_value, sample_id FROM public.sample WHERE sample_id LIKE 'GAAR%'
The above select statement will return all rows where the value in the sample_id field starts with GAAR followed by zero or more characters (Figure 4A). Placing the percent sign before the literal characters will return all rows with the literal characters at the end of the string. For instance, the following select statement will return rows that end with 2007 and are preceded by zero or more characters (Figure 4B).
SELECT study_value, sample_id FROM public.sample WHERE sample_id LIKE '%2007'
And the following will return all rows that contain T01 anywhere in the string (Figure 4C):
SELECT study_value, sample_id FROM public.sample WHERE sample_id LIKE '%T01%'
Here’s an example using the underscore wildcard character:
SELECT study_value, sample_id, geomorph_value FROM public.site WHERE geomorph_value LIKE '_b_'
The above select statement will return all rows with a geomorph_value that is 3 characters long, with b as the second character (Figure 5A).
The percent sign and underscore wildcard characters can also be used in combination. For example, the following select statement will return those rows in which microtopo_value starts with a single character followed by an r and ending with zero or more characters (Figure 5B):
SELECT study_value, sample_id, microtopo_value FROM public.site WHERE microtopo_value LIKE '_r%'
If you want to match a literal underscore or percent sign without matching other characters then the character of interest must be preceded with an escape character. The default escape character is a backslash. For example, the following select statement will find only those rows in the public.site table where the site_field_note column contains a percent sign (Figure 5C):
SELECT study_value, sample_id,site_field_note FROM public.site WHERE site_field_note LIKE '%\%%'
Whereas the following will return all rows where the site_field_note contains any number of characters:
SELECT study_value, sample_id,site_field_note FROM public.site WHERE site_field_note LIKE '%%%'
Filtering using Regular Expressions
Regular expressions are an advanced type of pattern matching expression that can be used in select statements similar to LIKE. However, regular expression are much more powerful than LIKE and can be used to finding complex patterns. This can be especially useful for querying unstructured data, for instance a free-form notes field.
Here’s one example of using regular expressions. The following query will select all rows in which the site_field_note column contains a stand alone zero (0) followed by one or more numbers between zero and 9, followed by one or more spaces, followed by one or more upper or lower case letters. When using regular expressions with WHERE the tilde character (~) is placed between the column name and regular expression and is approximately equivalent to LIKE.
SELECT sample_id, site_field_note FROM public.site WHERE site_field_note ~ '((?<!\d)0-[0-9]+)\s+([A-Za-z]+)'
Here are the parts (called “tokens”) of the regular expression broken down and explained:
- Parentheses : Capture groups. All of the tokens within represent a substring of interest.
- (?<!\d) : Negative look behind. Specifies a group that can not match before the expression of interest. In this case I’m specifying the substring of interest can not be preceded by any digit (\d)
- 0 : matches the number zero (0)
- – : matches a dash character
- [0-9] : matches any number from zero (0) to nine (9)
- + : match one or more of the preceding characters
- \s+ : match one or more white spaces
- [A-Za-z]+ : match one or more capital or lower case letters, i.e., A through Z or a through z
The results of the above select statement are presented in Figure 6.
As I’m no expert in regular expressions, a deep dive into them is beyond the scope of this blog. If you’d like to learn more about regular expressions and practice writing them please refer to regular-expressions.info, RegexOne, RegExr, and Regular Expressions 101.
Order of Precedence
SQL boolean expressions follow an order of operations or precedence rules similar to mathematical expressions. SQL operator precedence are described in detail in the Lexical Structure section of the PostgreSQL documentation. In brief, boolean expressions within parentheses take precedence over boolean expressions outside parentheses and AND takes precedence over OR. You should keep these rules in mind when constructing complex boolean statements because the results of a select statement will differ depending on if parentheses are used or not. For example the following 2 select statements will return different results:
Select Statement 1: Without Parentheses
SELECT sample_id, physiography_value, site_ec_us_cm, site_ph FROM public.site WHERE physiography_value = 'l' AND site_ec_us_cm >= 300 OR site_ph >=7.0 ORDER BY sample_id
Select Statement 2: With Parentheses
SELECT sample_id, physiography_value, site_ec_us_cm, site_ph FROM public.site WHERE physiography_value = 'l' AND (site_ec_us_cm >= 300 OR site_ph >=7.0) ORDER BY sample_id
This is because in the first statement the AND is processed first, and the OR is processed second. Therefore the result of the first query includes all rows where physiography_value equals ‘l’ and site_ec_us_cm is greater than or equal to 300, plus all the rows (regardless of physiography_value) with a site_ph greater than or equal to 7.0.
Alternatively, in the second statement the expression within the parentheses is evaluated first, followed by the expression before the AND operator. Therefore the second query will first find all the rows where site_ec_us_cm is greater than or equal to 300 or site_ph is greater than or equal to 7.0, and then within that set will return only those rows that satisfy the physiography_value criteria, a much more limited number of rows.
PostgreSQL. 2021. Documentation – PostgreSQL 14. Available online: https://www.postgresql.org/docs/current/functions-comparison.html (Accessed 2022-01-01)
Next Time on Elfinwood Data Science Blog
In this post, we learned about SQL select statements and looked at a number of examples of select statements to query data from a single table my_first_database. In the next post I’ll cover querying data from more than one table in a single select statment using SQL joins.
Please support this blog
This blog is provided free of charge and free of advertisements. The best way to provide support is to follow this blog and share it with a friend. Several options exist for supporting this blog.
- If you like this post then please consider subscribing to this blog (see below) or following me on social media.
- If you’d like to follow this blog, but you don’t want to follow by subscribing through WordPress then a couple of options exist:
- Subscribe to our YouTube channel and watch the latest lesson videos here: Elfinwood Ecology YouTube Channel
- Lastly, consider voluntarily contributing to the cause and help me pay for the website hosting and domain fees ($102 annually), and some of the time I spend preparing content (see below).
If you find value in the content of this blog then please consider voluntarily contributing to the cause and help me pay for the website hosting and domain fees ($102 annually), and some of the time I spend preparing content. Click on the PayPal button below to contribute using your PayPal account, or click on one of the credit card icons to contribute using a credit card.
Follow My Blog
Subscribe to get the latest updates!
Copyright © 2022, Aaron Wells