Lesson 009: Advanced filtering using SQL WHERE

A brilliant yellow stand of subalpine larch (Larix lyallii) near Cutthroat Pass in the North Cascades, WA.

Introduction

In the previous 2 posts I covered SQL select statements, and we looked at the Lesson_007 and Lesson_008 files in the learning-data-science GitHub repository. Lesson 007 covered the basics of SQL selects statements, and Lesson 008 covered LIMIT and OFFSET, column aliases, data type casting, creating columns on-the-fly, mathematical operations.

In this post, Lesson 009 we’ll go into more detail about filtering in select statements using the WHERE clause, and I’ll provide examples from my_first_database.

Since data analysis and management are best learned by doing, the objective of this and future posts is to get readers hands on experience using PostgreSQL and R. This lesson builds on earlier posts so, if this is your first time reading this blog and you’d like to follow along on your computer, then please go back and start from the post Learning Data Science and work through the lessons up to this point.

Lesson 9 Video
Check out the accompanying video on the Elfinwood Data Science YouTube channel here: Elfinwood Data Science Blog Lesson 9 Video.

For this post we’ll refer to files in the Lesson 9 folder in the learning-data-science repository. If you have not already forked the learning-data-science repository, and cloned the fork to your computer, please do so now following the instructions provided in the post Learning Data Science. If you have previously forked and cloned the repository, but don’t have the files for this lesson, then follow the instructions here to update your fork with the latest files.

If you are not familiar with version control using Git and GitHub, I recommend going back and working through the hands on tutorials in my series on version control starting with “Version Control: An Overview.”

Filtering Rows

The WHERE clause is used to filter rows in SELECT statements. The WHERE clause is placed after the FROM clause and is followed by one or more boolean expressions combined using AND or OR operators. Boolean expressions are written using comparison functions or operators and evaluate to true or false.

For example, the following select statement returns all columns and only those rows from the public.site table where the soil_restrict_layer_code column is equal to permafrost (Figure 1A).

SELECT * FROM public.site
	WHERE soil_restrict_layer_code = 'permafrost'

Note the use of single quotes around the value permafrost. This is standard SQL syntax when referring to values in columns with text or character varying data types.

Here’s another example using the greater than or equal to operator, the following select statement will return only those rows in the public.sample table where the expression elevation_m >= 1000 is true (i.e., plots with an elevation_m greater than or equal to 1000 meters). The results are displayed in Figure 1B.

SELECT
  study_value,
  sample_id,
  elevation_m
FROM public.sample
WHERE elevation_m >= 1000;
Filtering using AND

Here’s another example using a combination of boolean expressions strung together using AND:

SELECT
  study_value,
  sample_id,
  elevation_m
FROM public.sample
WHERE elevation_m >= 1000 AND elevation_m <= 1200
ORDER BY elevation_m ASC;

The above select statement will return the rows in public.sample with an elevation_m greater than or equal to 1000 meters and less than or equal to 1200 meters and sorted in ascending order by elevation_m (Figure 1C). The above boolean expression can be expressed in mathematical notation as 1000 ≤ elevation_m ≤ 1200.

Filtering using BETWEEN

The above select statement could also be written using BETWEEN to express the range of values in the boolean expression of the WHERE clause. For instance, the select statement below will return the same results as above:

SELECT
  study_value,
  sample_id,
  elevation_m
FROM public.sample
WHERE elevation_m BETWEEN 1000 AND 1200
ORDER BY elevation_m ASC;

BETWEEN is one of several comparison predicates that are useful for writing boolean expressions. Comparison predicates are similar to operators but require special syntax mandated by the SQL standard (PostgreSQL 2021).

Filtering Using Other Types of Predicates

Other commonly used comparison predicates include IS NULL, IS NOT NULL, IS TRUE, and IS FALSE. The first two predicates test for the NULL condition. NULL is a special type of missing data and indicates that data does not exist. NULL is represented as a blank cell in a database table. I’ll cover missing data in more detail in a future post.

The following select statement provides an example of using the IS NOT NULL predicate to return only those rows in the public.site table for which a site_field_note exists, i.e., rows for which the site_field_note column is not null (Figure 2).

What the Null?

In database lingo NULL means that data does not exist, i.e., there is no value. Because there is no value, standard comparison operators (e.g., =, >, <) won’t work with NULL. Instead, to filter for NULL use the predicates IS NULL, IS NOT NULL, IS DISTINCT FROM, or IS NOT DISTINCT FROM.

SELECT study_value, sample_id, site_field_note
FROM public.site
WHERE site_field_note IS NOT NULL

The second two predicates, IS TRUE and IS FALSE, are used with boolean (i.e., true/false) fields to test for the expressed condition, e.g., the expression column1 IS TRUE will return all rows where the value of column1 is true.

Figure 2. The first 15 rows of the results from the select statement using the IS NOT NULL predicate to return only those rows where a note exists in the site_field_note column.
Other Options for Filtering Boolean Fields

There are many ways to filter boolean fields in addition to the IS TRUE and IS FALSE predicates. The equal sign operator can also be used in conjunction with the following strings (that are not case sensitive) to represent true and false:

  • TRUE: t, true, y, yes, on, 1
  • FALSE: f, false, n, no, off, 0

Thus, the following WHERE clauses are equivalent to WHERE column1 IS TRUE:

  • WHERE column1 = ‘t’
  • WHERE column1 = ‘true’
  • WHERE column1 = ‘TRUE’
  • WHERE column1 = ‘yes’
  • WHERE column1 = ‘on’
  • WHERE column1 = ‘1’

And the following are equivalent to WHERE column1 IS FALSE:

  • WHERE column1 = ‘false’
  • WHERE column1 = ‘FALSE’
  • WHERE column1 = ‘no’
  • WHERE column1 = ‘off’
  • WHERE column1 = ‘0’

Read more about the boolean data type here: https://www.postgresql.org/docs/current/datatype-boolean.html

Filtering using OR

Here’s an example of using the OR operator in the WHERE clause:

SELECT study_value, sample_id, physiography_value
FROM public.site
WHERE physiography_value = 'a' OR physiography_value = 'u'

The above select statement will return the rows in the public.site table for which the physiography_value column is equal to either a or u (Figure 3A).

Filtering using IN

The above select statement could also be written using the IN operator. The IN operator is a type of subquery expression. Subquery expressions are a special kind of boolean expression that evaluate a query within a query (i.e., a subquery) to true or false. We’ll cover subqueries in more detail in a future post. For now, suffice it to know that IN is used with the WHERE clause and uses a set of values as the evaluation criteria. Rows that contain one of the values in the set will be returned in the query. The set of values is provided as a comma separate list within parentheses after the IN keyword. For instance, the following select statement is equivalent to the above select statement:

SELECT study_value, sample_id, physiography_value
FROM public.site
WHERE physiography_value IN ('a','u')

The IN operator can be used with the NOT operator to filter for rows with the values do not match the list of values in the subquery expression. For instance, the below query will select all rows where physiography_value does not equal a or u:

SELECT study_value, sample_id, physiography_value
FROM public.site
WHERE physiography_value NOT IN ('a','u')
Filtering using LIKE

The LIKE expression and its converse NOT LIKE are types of pattern matching boolean expressions which evaluate to true if a string in the target column matches the user supplied pattern. LIKE expressions are constructed by providing a target column, followed by the LIKE operator, followed by a combination of literal characters (i.e., the pattern of interest) and wildcard characters.

There are 2 types of wildcard characters used with LIKE:

  • Percent sign (%): matches any sequence of zero or more characters
  • Underscores ( _ ): matches any single character

The relative position of the wildcard characters with respect to the literal characters is used to specify which part of the search string is the wildcard (i.e., zero or more characters).

Here’s an example of LIKE using the percent sign:

SELECT study_value, sample_id
FROM public.sample
WHERE sample_id LIKE 'GAAR%'

The above select statement will return all rows where the value in the sample_id field starts with GAAR followed by zero or more characters (Figure 4A). Placing the percent sign before the literal characters will return all rows with the literal characters at the end of the string. For instance, the following select statement will return rows that end with 2007 and are preceded by zero or more characters (Figure 4B).

SELECT study_value, sample_id
FROM public.sample
WHERE sample_id LIKE '%2007'

And the following will return all rows that contain T01 anywhere in the string (Figure 4C):

SELECT study_value, sample_id
FROM public.sample
WHERE sample_id LIKE '%T01%'

Here’s an example using the underscore wildcard character:

SELECT study_value, sample_id, geomorph_value
FROM public.site
WHERE geomorph_value LIKE '_b_'

The above select statement will return all rows with a geomorph_value that is 3 characters long, with b as the second character (Figure 5A).

The percent sign and underscore wildcard characters can also be used in combination. For example, the following select statement will return those rows in which microtopo_value starts with a single character followed by an r and ending with zero or more characters (Figure 5B):

SELECT study_value, sample_id, microtopo_value
FROM public.site
WHERE microtopo_value LIKE '_r%'

If you want to match a literal underscore or percent sign without matching other characters then the character of interest must be preceded with an escape character. The default escape character is a backslash. For example, the following select statement will find only those rows in the public.site table where the site_field_note column contains a percent sign (Figure 5C):

SELECT study_value, sample_id,site_field_note
FROM public.site
WHERE site_field_note LIKE '%\%%'

Whereas the following will return all rows where the site_field_note contains any number of characters:

SELECT study_value, sample_id,site_field_note
FROM public.site
WHERE site_field_note LIKE '%%%'
Filtering using Regular Expressions

Regular expressions are an advanced type of pattern matching expression that can be used in select statements similar to LIKE. However, regular expression are much more powerful than LIKE and can be used to finding complex patterns. This can be especially useful for querying unstructured data, for instance a free-form notes field.

Here’s one example of using regular expressions. The following query will select all rows in which the site_field_note column contains a stand alone zero (0) followed by one or more numbers between zero and 9, followed by one or more spaces, followed by one or more upper or lower case letters. When using regular expressions with WHERE the tilde character (~) is placed between the column name and regular expression and is approximately equivalent to LIKE.

SELECT sample_id, site_field_note
FROM public.site
WHERE site_field_note ~ '((?<!\d)0-[0-9]+)\s+([A-Za-z]+)'

Here are the parts (called “tokens”) of the regular expression broken down and explained:

  • Parentheses : Capture groups. All of the tokens within represent a substring of interest.
  • (?<!\d) : Negative look behind. Specifies a group that can not match before the expression of interest. In this case I’m specifying the substring of interest can not be preceded by any digit (\d)
  • 0 : matches the number zero (0)
  • – : matches a dash character
  • [0-9] : matches any number from zero (0) to nine (9)
  • + : match one or more of the preceding characters
  • \s+ : match one or more white spaces
  • [A-Za-z]+ : match one or more capital or lower case letters, i.e., A through Z or a through z

The results of the above select statement are presented in Figure 6.

Figure 6: The results of filtering using a regular expression. Notice that the total 8 rows that are returned meet the requirements of the regular expression, i.e., contains a stand alone zero (0) followed by one or more numbers between zero and 9, followed by one or more spaces, followed by one or more upper or lower case letters.

As I’m no expert in regular expressions, a deep dive into them is beyond the scope of this blog. If you’d like to learn more about regular expressions and practice writing them please refer to regular-expressions.info, RegexOne, RegExr, and Regular Expressions 101.

Order of Precedence

SQL boolean expressions follow an order of operations or precedence rules similar to mathematical expressions. SQL operator precedence are described in detail in the Lexical Structure section of the PostgreSQL documentation. In brief, boolean expressions within parentheses take precedence over boolean expressions outside parentheses and AND takes precedence over OR. You should keep these rules in mind when constructing complex boolean statements because the results of a select statement will differ depending on if parentheses are used or not. For example the following 2 select statements will return different results:

Select Statement 1: Without Parentheses

SELECT sample_id, physiography_value, site_ec_us_cm, site_ph
FROM public.site
WHERE physiography_value = 'l' AND site_ec_us_cm >= 300 OR site_ph >=7.0
ORDER BY sample_id

Select Statement 2: With Parentheses

SELECT sample_id, physiography_value, site_ec_us_cm, site_ph FROM public.site
WHERE physiography_value = 'l' AND (site_ec_us_cm >= 300 OR site_ph >=7.0)
ORDER BY sample_id

This is because in the first statement the AND is processed first, and the OR is processed second. Therefore the result of the first query includes all rows where physiography_value equals ‘l’ and site_ec_us_cm is greater than or equal to 300, plus all the rows (regardless of physiography_value) with a site_ph greater than or equal to 7.0.

Alternatively, in the second statement the expression within the parentheses is evaluated first, followed by the expression before the AND operator. Therefore the second query will first find all the rows where site_ec_us_cm is greater than or equal to 300 or site_ph is greater than or equal to 7.0, and then within that set will return only those rows that satisfy the physiography_value criteria, a much more limited number of rows.

Literature Cited

PostgreSQL. 2021. Documentation – PostgreSQL 14. Available online: https://www.postgresql.org/docs/current/functions-comparison.html (Accessed 2022-01-01)

As a supplement to this post, I encourage you to read more about and practice writing Regular expressions by referring to the following websites: regular-expressions.info, RegexOne, and RegExr.

Next Time on Elfinwood Data Science Blog

In this post, we learned about SQL select statements and looked at a number of examples of select statements to query data from a single table my_first_database. In the next post I’ll cover querying data from more than one table in a single select statment using SQL joins.

Please support this blog

This blog is provided free of charge and free of advertisements. The best way to provide support is to follow this blog and share it with a friend. Several options exist for supporting this blog.

  1. If you like this post then please consider subscribing to this blog (see below) or following me on social media.
  2. If you’d like to follow this blog, but you don’t want to follow by subscribing through WordPress then a couple of options exist:
  3. Subscribe to our YouTube channel and watch the latest lesson videos here: Elfinwood Ecology YouTube Channel
  4. Lastly, consider voluntarily contributing to the cause and help me pay for the website hosting and domain fees ($102 annually), and some of the time I spend preparing content (see below).

Consider Contributing

If you find value in the content of this blog then please consider voluntarily contributing to the cause and help me pay for the website hosting and domain fees ($102 annually), and some of the time I spend preparing content. Click on the PayPal button below to contribute using your PayPal account, or click on one of the credit card icons to contribute using a credit card.

$5.00


Follow My Blog

Subscribe to get the latest updates!

Join 18 other followers

Copyright © 2022, Aaron Wells

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: