Regular Expression Character Escaping
2020-11-20 - By Robert Elder
This article is part of a Series On Regular Expressions.
In the last few sections of this series, we've gained an understanding of character classes, quantifiers and alternation. You'll be pleased to know that you're now very close to being able to read or write almost any regular expression. There's just one remaining major common cause of confusion: character escaping. In this section, we'll try to gain a fairly comprehensive understanding of the most common ways to escape characters in a regular expression. When learning how to correctly escape characters in a regular expression, it helps to divide the escaping rules into two different lists of characters that need to be escaped: One for characters inside a character class, and one for characters outside a character class. To keep this simple, this article will not consider Unicode characters and instead focus only on single-byte ASCII characters.
Character Escaping Inside A Character Class
In the sections Character Classes in Regular Expressions - A Gentle Introduction and Character Ranges & Class Negation in Regular Expressions we reviewed the 5 characters that need to be escaped inside a character class (anywhere inside [...]):
- 1) [ Purpose: Start of character class.
- 2) ] Purpose: End of character class.
- 3) \ Purpose: Escaping.
- 4) - Purpose: Character ranges.
- 5) ^ Purpose: Class Negation.
Character Escaping Outside A Character Class
But what can we say about characters that always need to be escaped outside a character class? For starters, let's consider the 5 characters from the list above, and take note of a few differences:
- 1) [ Character 1) from the list above obviously needs to be escaped outside a character class, otherwise, you'd never be able to define a character class!
- 2) ] Since character 2) from the list above is used to close a character class, some regular expression flavours will let you get away with not escaping a ']' as long as it's outside a character class. Some environments (such as in the Ruby programming language) will require that you escape it.
- 3) \ The '\' character is used to specify escaped characters both inside and outside a character class. Therefore, it must always be escaped when specifying a literal '\' character.
- #) - Character 4) from the list above is used to specify character ranges in a character class, but you can't specify character ranges outside a character class. Therefore, the '-' character does not need to be escaped outside a character class, and we won't add it to our list of characters to escape outside a character class.
- 4) ^ The '^' character (the 5th character in the list above) needs to be escaped outside a character class too, but for a different reason (which will be discussed later).
As you can see, the list of characters that need to be escaped outside a character class includes all the characters that need to be escaped inside a character class (except for the '-' character). So far, we've only listed 4 characters that need to be escaped outside a character class.
But what other characters need to be escaped outside a character class? Unsurprisingly, you also need to escape the characters that are used to represent quantifiers:
- 5) * This character must be escaped since it represents the 'zero or more' quantifier.
- 6) + This character must be escaped since it represents the 'one or more' quantifier.
- 7) ? This character must be escaped since it represents the 'zero or one' quantifier. The '?' character also has another special meaning (that hasn't been discussed yet) when it's used as the first character inside a set of parentheses.
- 8) { This character must be escaped since it represents the start of a range quantifier (a quantifier like {3,4}, {2}, {3,} etc.).
- 9) } This character sometimes (depending on the regex engine) needs to be escaped since it represents the end of a range quantifier.
And also unsurprisingly, you need to escape the '|' character which is used for alternation:
- 10) | This character must be escaped since it separates 'alternatives' in an alternation.
We also saw how parentheses are used to identify 'groups' or 'sub-expressions', so they need to be escaped too:
- 11) ( This character must be escaped since it is used to open a 'group' or 'sub-expression'.
- 12) ) This character must be escaped since it is used to close a 'group' or 'sub-expression'.
In the second part of this series, we discussed how the '^' character is used inside a character class to invert what the character class will match. We also briefly mentioned how it has a completely different meaning when it's found outside of a character class. When the '^' character is found outside of a character class, it's special meaning is to anchor that part of the regex to the 'start of string' or 'start of line' (depending on the regex engine/current flags). Similarly, this leads us to introduce one new character that needs to be escaped outside a character class: The '$' character will anchor its position in the regex to the 'end of string' or 'end of line' (again, depending on the regex engine/current flags).
- 4) ^ This character must be escaped since its special meaning is to anchor the regex to the 'start of string' or 'start of line'.
- 13) $ This character must be escaped since its special meaning is to anchor the regex to the 'end of string' or 'end of line'.
The last special character that needs to be escaped outside a character class is the '.' character. This character is commonly described as representing 'any character other than a newline', although there are some caveats to this. This special character will be discussed in more detail later in this article:
- 14) . This character needs to be escaped outside a character class since it represents 'any character other than a newline'.
The 14 Characters To Escape Outside A Character Class
It is impossible create a perfect and consistent list of characters that need escaping in every regular expression engine because all regular expression flavours have subtle differences. However, if you were to try and come as close as possible to defining one list to rule them all, the following 14 characters would come pretty close. Here are the 14 characters that you'll need to be careful about escaping in a regular expression (outside a character class):
[
]
\
^
*
+
?
{
}
|
(
)
$
.
Hexadecimal Escaping Characters
Most regular expression engines support more than one way to escape many characters. For example, a common way to escape any single-byte character in a regex is to use 'hex escaping'. For example, the hexadecimal equivalent of the character 'a' when encoded in ASCII or UTF-8 is '\x61'. Hexadecimal escaping is not supported by all regular expression implementations (most notably POSIX-based implementations).
Here is an animation that shows how you can use hex-based escaping to describe any single-byte character:
As you can see from the above animation, hexadecimal escaping allows you to specify non-printable characters as well as printable ones too. This also shows how printable characters can be specified in more than one way in a regular expression: In fact, some characters can be specified at least three different ways. For example, a tab character can usually be specified literally like this:
' '
or it can be specified in hexadecimal as '\x09', or it can be specified as '\t'.
Escaped ASCII Characters
There are a few other commonly supported escape sequences that will specify individual ASCII characters. These are: '\r' (carriage return), '\n' (newline), '\v' (vertical tab), and '\f' (form feed):
Can You Escape Any Character?
As we saw in the last section, if you put a backslash in front of characters like 'r', 'n', 'v', or 'f' to 'escape' them, you'll get a very different character that has its own special meaning. Therefore, it's worth asking: What happens if you use a backslash to 'escape' any other character, even if you don't need to? Well, you can try doing exactly that using the animations on this page, and you'll find that the result varies. The animations on this page use Javascript's regular expression engine, so the results you see here will only apply to the regex engine used by the Javascript implementation in your browser. If you try to use an escape sequence like '\h', '\i', or '\j', you'll just end up with the same literal character as if you hadn't escaped it at all. This is because most regular expression engines will just ignore the backslash if it comes before a character that doesn't have any special escaped meaning. This is why it's generally a good idea to escape ']' or '}' characters even if you don't need to: For some regular expression implementations, these characters need to be escaped. For others, they don't need to be escaped, but they'll just be treated as literal characters anyway.
However, the specifications for some regular expression implementations (POSIX for example), state in their documentation that when you escape a character that doesn't need to be escaped, the result will be 'undefined behaviour'. Usually the behaviour that most implementations default to is just to interpret the character literally, but you should keep this in mind because relying on undefined behaviour can get you into trouble.
Digit Characters
We just discussed how '\h', '\i', and '\j' will be interpreted as 'h', 'i', and 'j' in Javascript's regular expression engine, but what about something like '\d'? If you guessed that it specifies a literal 'd' character, you are completely wrong! In fact, '\d' doesn't even represent an individual character, it actually represents multiple characters! The '\d' escape sequence represents 'any single digit character from 0 to 9'. In fact, you could just mentally replace any occurrence of '\d' in your regex with '[0-9] and it would mean the same thing. The one potentially confusing exception to this is that using '\d' is really like using a character class without needing to include the brackets, so you can write things like '[\d]' and it will mean the same thing as '\d' or '[0-9]'. Writing something like '[[0-9]]' (with the nested character class) probably won't do what you expect in most regular expression engines. However, some regular expression engines (like in Java) do allow the nested character class syntax like '[[0-9]]'.
If you want 'any character that is not a digit', you can use the '\D' escape sequence which matches the opposite set of characters:
Word Characters
In addition to the '\d' escape sequence for digit characters, there is also a dedicated escape sequence for 'word characters' using the '\w' escape sequence, and a corresponding '\W' escape sequence to match any characters except word characters:
Space Characters
Similarly, you can use '\s' as a dedicated escape sequence for 'space' characters, and '\S' to match anything that is not a space character:
In the example above, the '\s' escape sequence was shown to include many of the expected 'space' characters such as normal space, tab, newline and carriage return. However, it also includes several 'less common' space characters such as the 'vertical tab' ('\v'), 'form feed' ('\f') and the character at hex code \x00A0 (the UTF-16 code for a non-breaking space).
To make things even more confusing, let's run a test on the command-line to quickly output all 256 1 byte characters all followed by a newline. We'll pipe them through grep and see which lines are matched to determine what it will match with the '\s' escape sequence:
for i in {0..255}; do echo -e $(printf "\\\\x%X\n" $i); done | grep -aPo "\s" | xxd
And here is the result:
00000000: 090a 0b0a 0c0a 0d0a 200a
As you can see above, it matched the tab, vertical tab, form feed, carriage return, and a normal space, but the 'line' with the newline or '\xA0' character isn't there!
Let's write a small script with Python and try the same thing again, while piping the results in to 'xxd' to see what it does:
import re
for i in range(0,256):
s = chr(i)
matcher = re.compile(r"\s")
if matcher.match(s):
print(s)
First, with python 2:
python2 test.py | xxd
which produces this output:
00000000: 090a 0a0a 0b0a 0c0a 0d0a 200a .......... .
As you can see above, the output shows that it matched a tab, a newline, a vertical tab, a form feed, a carriage return and a normal space.
Now let's try again with python 3:
python3 test.py | xxd
which produces this output:
00000000: 090a 0a0a 0b0a 0c0a 0d0a 1c0a 1d0a 1e0a ................
00000010: 1f0a 200a c285 0ac2 a00a .. .......
As you can see from above, even between different versions of python, the list of characters that '\s' will match is different! The lesson here should not be to try and memorize every single difference listed here. Instead it should be to realize that '\s' can almost always be expected to match the 'standard space' characters such as normal spaces and tabs. However, whenever you start talking about all the less frequently used types of 'spaces' or Unicode characters, then you can anticipate lots of inconsistencies.
The Period Character
The period character is typically described as matching 'any character except for a newline'. Here is a visualization showing that it matches every single-byte character, except for a newline and a carriage return in Javascript:
But just like with the '\s' escape sequence, the period character can have some subtle differences in what it will match between different regex engines. Let's see what it won't match in grep:
for i in {0..255}; do echo -e $(printf "\\\\x%X\n" $i); done | grep -aPo "." | xxd
which outputs the following:
00000000: 000a 010a 020a 030a 040a 050a 060a 070a ................
00000010: 080a 090a 0b0a 0c0a 0d0a 0e0a 0f0a 100a ................
00000020: 110a 120a 130a 140a 150a 160a 170a 180a ................
00000030: 190a 1a0a 1b0a 1c0a 1d0a 1e0a 1f0a 200a .............. .
00000040: 210a 220a 230a 240a 250a 260a 270a 280a !.".#.$.%.&.'.(.
00000050: 290a 2a0a 2b0a 2c0a 2d0a 2e0a 2f0a 300a ).*.+.,.-.../.0.
00000060: 310a 320a 330a 340a 350a 360a 370a 380a 1.2.3.4.5.6.7.8.
00000070: 390a 3a0a 3b0a 3c0a 3d0a 3e0a 3f0a 400a 9.:.;.>.=.<.?.@.
00000080: 410a 420a 430a 440a 450a 460a 470a 480a A.B.C.D.E.F.G.H.
00000090: 490a 4a0a 4b0a 4c0a 4d0a 4e0a 4f0a 500a I.J.K.L.M.N.O.P.
000000a0: 510a 520a 530a 540a 550a 560a 570a 580a Q.R.S.T.U.V.W.X.
000000b0: 590a 5a0a 5b0a 5c0a 5d0a 5e0a 5f0a 600a Y.Z.[.\.].^._.`.
000000c0: 610a 620a 630a 640a 650a 660a 670a 680a a.b.c.d.e.f.g.h.
000000d0: 690a 6a0a 6b0a 6c0a 6d0a 6e0a 6f0a 700a i.j.k.l.m.n.o.p.
000000e0: 710a 720a 730a 740a 750a 760a 770a 780a q.r.s.t.u.v.w.x.
000000f0: 790a 7a0a 7b0a 7c0a 7d0a 7e0a 7f0a y.z.{.|.}.~...
From reviewing the above, it appears that in my version of GNU grep with -P mode, '.' matches any character except a newline or any character above \x80. It does, however, match a carriage return (which wasn't matched by Javascript's '.' character).
Checking with Python shows that it also matches a carriage return character. Python's regular expression engine also supports flags that change the behaviour of the '.' character. The lesson here is that every regular expression engine implements things a bit differently, so you need to consult the documentation (or better yet test it) before relying on any particular behaviour.
Backreferences
Backreferences are a topic that we haven't discussed yet, but they need to be mentioned at least briefly here since they look like 'escaped' characters. When you see a pattern like '\1', '\2' or '\3', these are typically backreferences. But to make things even more confusing, they might not always be backreferences! Without diving into too much detail on how backreferences work, whenever part of a regex gets matched inside '(' ... ')' characters, some regular expression engines allow you to refer back to them according to their 'backreference number'. Using the pattern '\1' is a way to say "whatever you matched in the group number 1, please match that again here."
There are a several cases where the pattern '\1' won't be treated as a backreference. The first case is when the number of the backreference is bigger than the number of backreferences that are actually used in the regex (or if you refer to a group before it has been matched in a regex engine that doesn't support this). The second is when the regex engine doesn't support backreferences at all. In these cases, the '\1', or '\9', or whatever number you use will usually be treated as an escaped octal sequence which brings us to the next section.
Octal Escaping
I'm just going to come right out and say it: Nobody uses octal notation anymore. Octal notation has historically been supported as an alternative to hexadecimal notation, but today you don't see it used preferentially very often anymore.
Despite this fact, octal notation is still usually supported in many contexts. In some regular expression engines, simply putting a '\' character before a number like this: '\123' will cause it to be treated as an octal character. Sometimes, you may need to prefix it with an 'o' like this: '\o{123}', or a zero like this: '\0123'.
Other Escape Sequences
At this point, we have discussed almost every commonly supported way of escaping characters in a regular expression. There are, however, many more escape sequences that are either not supported in every regex engine (\R \B, \cM, \p), or in some cases don't actually match characters but perform some a control or position matching operation (\Q, \E, \b).
This article is part of a Series On Regular Expressions.
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
Published 2020-07-09 |
$20.00 CAD |
How Do Regular Expression Quantifier Work?
Published 2020-08-18 |
Interesting Regular Expression Test Cases
Published 2020-07-09 |
How Regular Expression Alternation Works
Published 2020-08-18 |
Character Ranges & Class Negation in Regular Expressions
Published 2020-05-31 |
Guide To Regular Expressions
Published 2020-07-09 |
Character Classes in Regular Expressions - A Gentle Introduction
Published 2020-05-10 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|