How Regular Expression Alternation Works
2020-08-18 - By Robert Elder
This article is part of Series On Regular Expressions.
What is 'Alternation'?
Here is an example of a regular expression that uses an alternation.
cat|dog
This will match match a piece of text that contains either the word 'cat' or the word 'dog'. You could say that 'alternation' is used whenever there is more than one alternative for the string you want to match.
A Pet Food Example
Here is a diagram that shows the control flow graph of how a regular expression engine could interpret this regex. In this diagram, the regular expression matcher starts at the 'BEGIN' node and the moves to the 'SPLIT' node where it effectively splits control into two possible paths. First it tries to match the path with the word 'cat'. If this fails, it tries the other path with the word 'dog'. If neither path has a match, control moves to the 'FAILURE' node and the search will either re-start and continue at next character, or fail completely if it's already reached the end of the text being searched.
You can also specify more than two possible alternatives by separating them all with the vertical line or 'pipe' symbol:
cat|dog|mouse|snake|bird
Since alternation can be thought of as a list of alternatives, it's worth comparing it to the character class. As we saw in the first part of this series, the character class can be thought of as a list of possible characters that can appear at a specific position in our string. Regardless of what you put inside a character class, the entire class can only match a single character in your search string. With alternation, each possible alternative on either side of the pipe symbol can be any number of characters.
Because the alternatives in an alternation can be a variable number of characters, it's common to use parentheses to define the start and end position for the list of alternatives. For example, this regular expression:
For sale: (cat|dog|mouse|snake|bird) food
will match any of the following strings:
"For sale: cat food"
"For sale: dog food"
"For sale: mouse food"
"For sale: snake food"
"For sale: bird food"
However, if we didn't use the parentheses and wrote this regex instead:
For sale: cat|dog|mouse|snake|bird food
The first and last choices in the alternation would have included the very start and the very end of the regular expression, which isn't what we want here. The presence of spaces does not break up the alternation since they're treated as just another part of the pattern that will be searched for. Here is a list of strings that would match this alternation:
"For sale: cat"
"dog"
"mouse"
"snake"
"bird food"
Any part of a regular expression that's inside parentheses is commonly referred to as a sub-expression or also as a group. Sub-expressions also have a number of other important features and uses that will not be discussed at this time.
A very common use case for alternation is to match suffixes or prefixes in words. For example, this regular expression will match several different variations of the word 'employ'.
employ(er|ee|ment|ing|able)
Watch Out For Prefixes
At its core, alternation is actually quite simple. In fact, it's probably correct to say that it's easier to understand alternation that it is to understand character classes. There is however, one caveat that you should be aware of in order to avoid surprises. If one of the choices in your alternation is a prefix another, your regex may not always match the same thing in every regex engine.
For example, consider this alternation:
cat|dog|mouse|snake|bird|caterpillar
It contains both the word 'cat' and the word 'caterpillar'. Since 'cat' is prefix of the word 'caterpillar' there can be ambiguity about which alternative to match.
In most modern regular expression engines, the part of the alternation that will match is simply the first choice that matches, starting on the left and moving towards the right. This means that the order in which you specify options in the alternation may affect which option will be chosen. For example, if we pass the text "The cat likes to play with the caterpillar." with the regex shown above through grep using Perl-compatible regular expression mode, it will only match the word 'cat' twice:
echo "The cat likes to play with the caterpillar" | grep -Po 'cat|dog|mouse|snake|bird|caterpillar'
outputs the following:
cat
cat
However, if we re-arrange the ordering of the alternation, so that 'caterpillar' is the first option we'll be able to match the word 'cat' and the word 'caterpillar':
echo "The cat likes to play with the caterpillar" | grep -Po 'caterpillar|dog|mouse|snake|bird|cat'
outputs the following:
cat
caterpillar
POSIX Alternation Gives Different Results
Now, if we change from using Perl-compatible mode to POSIX-based extended regular expression mode (by changing grep's command-line flag from -P to -E), the result is the same no matter what the ordering is of the choices in the alternation:
echo "The cat likes to play with the caterpillar" | grep -Eo 'cat|dog|mouse|snake|bird|caterpillar'
outputs the following:
cat
caterpillar
echo "The cat likes to play with the caterpillar" | grep -Eo 'caterpillar|dog|mouse|snake|bird|cat'
outputs the following:
cat
caterpillar
This is because POSIX based regular expression engines are required to always try all possible options in an alternation and pick the longest match when searching from the leftmost position in the search text.
Since all of these details can be confusing and hard to remember, if you simply make sure that the longest possible alternation choices are placed first in the alternation, then you should be able to avoid this ambiguity even when you don't exactly know whether you're using a Perl-like or POSIX-line regular expression engine. If possible, I would also recommend avoiding POSIX based regular expression engines when you can. The POSIX regular expression standards are much older and less well-defined than newer Perl based ones.
This article is part 3 of a Series On Regular Expressions. You can read Part 4 on quantifiers here.
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
Published 2020-07-09 |
$20.00 CAD |
Regular Expression Character Escaping
Published 2020-11-20 |
How Do Regular Expression Quantifier Work?
Published 2020-08-18 |
Interesting Regular Expression Test Cases
Published 2020-07-09 |
Character Ranges & Class Negation in Regular Expressions
Published 2020-05-31 |
Guide To Regular Expressions
Published 2020-07-09 |
Character Classes in Regular Expressions - A Gentle Introduction
Published 2020-05-10 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|