2020-05-31 - By Robert Elder
This article is part of Series On Regular Expressions.
This is the second article in a series that will train you to become a Regular Expression Master. The first part of the series is: Character Classes in Regular Expressions - A Gentle Introduction.
This article will discuss character ranges, and negated character classes.
In the last section, we saw how a character class is basically just a way to specify a list of characters that can appear at a specific position in your search string.
We also reviewed the first three characters that often require escaping inside a character class. Now is an appropriate time to review the fourth character that must be escaped in a character class: The 'dash', or 'hyphen' character.
Recall back to the example we reviewed where we wanted to search for references to figure names. These figure names included a single letter of the alphabet to identify the figure. We used the character class '[abcdefghijklmnopqrstuvwxyz]' to specify that the last character could be any lower-case letter of the alphabet. But this syntax looks ugly and takes up a lot of space to write out. Wouldn't it be better if there was a short-hand notation that allowed us to specify every lower-case letter of the alphabet without actually writing out every letter? Something like 'a' '-' 'z'. Well, this is exactly how it's done in real regular expressions. This feature is called 'character ranges'.
Introducing Character Ranges
And now you can see why the '-' character is the fourth character that must be escaped inside a character class. When you write '[a-z]' inside a character class, it will match all lower-case letters of the alphabet:
If for some reason, you *actually* wanted the dash character to be treated literally, you could do so by escaping it with a backslash like this:
Common Character Ranges
Some other common examples of character ranges include, matching all upper case letters:
Matching all number digits:
Or matching only a part of the alphabet, such as characters A to F:
Ranges Between Arbitrary Letters & Digits
You're not just limited to these fairly common character ranges either. Many people don't realize that you can also create character ranges between arbitrary consecutive digits or letters of the alphabet. This animation shows random examples of character ranges that are valid in almost any regular expression engine.
When defining these character ranges, the most important constraint that you need to be aware of, is that the character used to start range can't come after the character used at the end of the range. For example, if you try to create a character range from d to a, ([d-a]) you'll get an error about an invalid range. But, if you instead re-arrange the order of the characters to [a-d] you'll be able to match the characters 'a' to 'd' as expected.
Ranges Between Arbitrary Characters
Note that some regular expression engines only support the simpler character ranges that involve letters and numbers. For example, in grep using basic regular expression mode, you can do the following:
echo "_ASDF_" | grep "[a-d]"
But you can't do the same thing using symbols as a part of the character range:
echo "_ASDF_" | grep "[=-_]"
However, if you enable Perl compatible regular expressions with grep, this does work:
echo "_ASDF_" | grep -P "[=-_]"
Multiple Ranges In One Class
It's worth noting that you can put more than one character range inside a character class at the same time. Here is a character class that will match all lower-case and all upper case letters at the same time:
Multiple Ranges & Symbols
You can even include a mix of individual characters as well as character ranges at the same time. This is where character classes can start to become a bit hard to read, and the diagram illustrated here starts to become very useful:
The trick to reading this regular expression is to first look for all the unescaped dash characters. Once you find them and recognize that they define character ranges, you can pick out where the individual characters are.
Does Order Matter In A Character Class?
Something you have wondered is whether the order of characters in a character class matters. There are several points to make on this topic: Recall that the endpoints of a character range must be in ascending order for the regex to work properly, so in this case you can say that the order matters. However, what about the order in which two different character ranges are specified? In this case, the ordering will have no functional effect on which characters the regex will match. For example, this character class '[a-zA-Z]' will match exactly the same characters as the character class '[A-Za-z]'.
Similarly, the ordering of individual characters in a character class has no functional effect when determining which characters a regex will match. For example, the character class '[abcdef]' will match exactly the same characters as the character class '[fedcba]'.
Don't Break Up Escaped Characters
However, you should be careful when re-arranging characters in a character class that contains escaped characters. Consider the following case where you have a character class that contains an escaped backslash as the endpoint of a character range:
If you move one of the backslashes one character forward, this will completely change which characters the class matches:
This shouldn't be surprising though, because we saw earlier that putting two backslashes together represents a single backslash character due to the way the escaping rules work. If we break up these two character that are intended to represent a single escaped character, then there can be no expectation of getting the same result.
Therefore, as long as you take care not to break up individual escaped characters or character ranges, the order in which you specify items in a character class does not make any functional difference when determining what characters the class will match.
You might point out that the simple search program we created in the last section might run a bit faster or slower depending on the order of the characters in the character class. However, the code used in this program was just a simple example made for the purposes of illustrating the easiest possible way to implement character class searches. Real regular expression engines are likely to use many optimizations which make them much faster, but more complicated.
Another example is this use of an unescaped dash character just after a character range, such as in '[A-Z-z]'.
If you try these same examples in the Ruby programming language, only the first two will work without complaining. The last example will issue a warning message.
ruby -e 'puts "abc".scan(/[a-z-]/)' ruby -e 'puts "abc".scan(/[-a-z]/)' ruby -e 'puts "abc".scan(/[A-Z-z]/)'
-e:1: warning: character class has '-' without escape: /[A-Z-z]/
Therefore, to keep things simple, just stick to always escaping a literal 'dash' character in a character class.
Introducing Character Class Negation
Now that you understand how character ranges work, let's review the last special character that must be escaped inside a character class. The 'caret' character, sometimes called a 'hat' character can be used inside a character class to specify that the included characters are *not* to be matched instead.
For example, if we write the following character class:
This will match any character *except* for the characters 'a', 'b', or 'c'.
An example of a common use case for this feature, would be to match any instance of a single character surrounded by double-quotes. In this case we don't actually care whether the quoted character is a letter, a number, a symbol or a space. We just want to make sure that it's a single character surrounded by two double quotes. By using this feature, we can specify that we want the middle character to match anything other than a double-quote.
Character classes that use the caret character in this way are called 'negated' character classes. This is because they specify the characters we want to subtract from the set of possible matches instead of the characters we want to add to the set of possible matches.
When specifying negated character classes, the caret character must always be the first character inside the character class in order for it to have its special meaning. If we put the '^' symbol anywhere other than the start of the character class, it will lose its special meaning, and revert back to matching the literal caret character.
The 'caret' Symbol Inverts The Entire Character Class
Whenever the caret character is present as the first character in a character class, it will change the meaning of the entire character class, and not just the characters that are adjacent to the caret character.
For example, if you want to understand what this negated character class does:
just think of this simpler regular expression first:
and realize that the caret character at the start of the character class simply inverts what characters the entire class will match.
The caret character is probably one of the most confusing symbols in regular expressions. This is because it's the only regular expression character that can have 3 completely different and unrelated meanings depending on where it's positioned.
An unescaped caret character can have one of the following 3 meanings:
- 1) [^abc] A negated character class as we just saw.
- 2) [abc^] A literal '^' character if it's located in a character class anywhere other than at the start of the class.
- 3) ^[abc] (haven't talked about yet) Whenever the caret character appears outside a character class, it means to anchor that position in the regex to the start of the string or line, depending on the environment.
It's important to realize that the first special meaning isn't at all related at all to the third meaning.
An escaped caret character can be placed anywhere inside or outside a character class to be used without special meaning:
[\^] [^abc\^] \^
We have now completed the review of all 5 special characters that need to be escaped inside a character class.
Although there are only 5 symbols on your keyboard that must be escaped inside a character class, this doesn't mean we've reviewed every possible escape character out there. In fact, there are many more escape characters that are supported by most regular expression engines. A detailed review of all escape characters would likely be too tedious for the viewer at this point. The next section will instead skip to providing a introduction to quantifiers and alternation. Once you learn how to use character classes in combination with quantifiers, you will begin to know what it feels like to be a true regular expression master.
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
How Do Regular Expression Quantifier Work?
Guide To Regular Expressions
Character Classes in Regular Expressions - A Gentle Introduction
Interesting Regular Expression Test Cases
Character Class Visualizer Tool
Character Class Search Explainer Tool
An LL Grammar For Regular Expression Parsing