Robert Elder Software Inc.
  • Home
  • Store
  • Blog
  • Contact
  • Home
  • Store
  • Blog
  • Contact
  • #linux
  • |
  • #commandline
  • |
  • #softwareengineering
  • |
  • #embeddedsystems
  • |
  • #compilers
  • ...
  • View All >>

An LL Grammar For Regular Expression Parsing

2020-07-09 - By Robert Elder

     This article is part of Series On Regular Expressions.

     Below is an LL grammar for a fairly complete regular exprssion parser.  This grammar is a documentation of the grammar supported by the Regular Expression Visualizer Tool demonstrated here.  This grammar supports most basic regular expression operations such as character classes, alternation (|), greedy and lazy quantifiers (*, +, {n}, {m,n}, {n,}, {,n}, *?, +?, {n}?, {m,n}?, {n,}?, {,n}?).  Sub-expressions are supported.  Simple escaped ASCII characters such as '\n', '\t' or '\x97' are supported.  Currently, the following features are not supported:  Unicode, backreferences, word mark breaks, capture groups, look-aheads, look-behinds.  Other less-universal regular expression features like nested character classes are not supported.

     LL grammars like this one are an excellent resource for writing recursive descent parsers.  The grammar described below was written in a format similar to Yacc grammars, although it has never been perfected enough to actually run through a parser-generator to generate source code.  Instead this hand-written grammar was used as a guide when creating a hand-written parser.

Update History

  • 2020-12-05: Added grammar rules to support non-capture groups, negative and positive lookbehind/lookaheads, atomic groups. This will support better error messages in the compiler until they are fully supported.

The Grammar

epsilon
	: 'epsilon' just means it's ok to match an empty string when no better option is available.

decimal_digit
	: '0' to '9'

decimal_number_rest
	: decimal_digit decimal_number_rest
	| epsilon

decimal_number
	: decimal_digit decimal_number_rest

hex_digit
	: '0' to '9'
	| 'A' to 'F'
	| 'a' to 'f'

escaped_hex_character
	| 'x' hex_digit hex_digit

escaped_ascii_character
	: any character not in 'x' or escaped_regex_class_character

escaped_regex_class_character
	: 'd'
	| 'D'
	| 's'
	| 'S'
	| 'w'
	| 'W'
	| 'N'  #  Not currently supported: Provide error message in compiler.
	| 'h'  #  Not currently supported: Provide error message in compiler.
	| 'H'  #  Not currently supported: Provide error message in compiler.
	| 'V'  #  Not currently supported: Provide error message in compiler.
	| 'R'  #  Not currently supported: Provide error message in compiler.
	| 'A'  #  Not currently supported: Provide error message in compiler.
	| 'z'  #  Not currently supported: Provide error message in compiler.
	| 'Z'  #  Not currently supported: Provide error message in compiler.
	| 'G'  #  Not currently supported: Provide error message in compiler.
	| 'b'  #  Not currently supported: Provide error message in compiler.
	| 'B'  #  Not currently supported: Provide error message in compiler.

escaped_character
	: '\' escaped_hex_character
	| '\' escaped_ascii_character
	| '\' escaped_regex_class_character

character_range
	: character_range_endpoint '-' character_range_endpoint

#  Any non-escaped character that can appear inside a character class.
#  Can include '^' because this is picked up directly in the grammar rule
#  and subsequent '^' characters are treated as literal. '^' is only
#  special when it's first.  Allow un-escaped '-' since that seems to
#  be a commonly expected use case.
class_ascii_character
	: \x00 to ','
	| '.' to '['
	# Avoid '[' and '\'
	| '^' to \xFF

#  Any non-escaped character that can appear outside a character class.
non_class_ascii_character
	#  Any character from 0x00 to 0xFF, except
	#  Avoid '^' '$'
	#  Avoid '(' ')' '*' '+'
	#  Avoid '.'
	#  Avoid '?'
	#  Avoid '{' '}'
	#  Avoid '[' '\'
        #  Don't avoid ']' since a lot of regex parsers allow it unescaped
	#  Avoid '|'

non_class_special_character
	: '.'

anchor
	: '^'
	| '$'

non_class_character
	: non_class_ascii_character
	| non_class_special_character
	| escaped_character

character_range_endpoint
	: class_ascii_character
	| escaped_character

character_or_range
	: character_range
	| character_range_endpoint


character_class_rest
	: character_or_range character_class_rest
	| ']'

character_class
	: '[' '^' character_or_range character_class_rest
	| '[' character_or_range character_class_rest

explicit_quantifier
	: '{' decimal_number '}'
	| '{' decimal_number ',' '}'
	| '{' ',' decimal_number  '}'
	| '{' decimal_number ',' decimal_number  '}'

implicit_quantifier
	: '+'
	| '?'
	| '*'

quantifier_operator
	: implicit_quantifier '?'
	| implicit_quantifier
	| explicit_quantifier '?'
	| explicit_quantifier

character_or_class
	: character_class
	| non_class_character

sub_expression
	: '(' regular_expression ')'
	#  Technically supported by doing nothing since back-references are not implemented yet.
	| '(' '?' ':' regular_expression ')'
	#  None of these are currently supported, but they should be parsed to provide good error message:
	| '(' '?' '!' regular_expression ')'
	| '(' '?' '=' regular_expression ')'
	| '(' '?' '<' '!' regular_expression ')'
	| '(' '?' '<' '=' regular_expression ')'
	| '(' '?' '>' regular_expression ')'

quantified_expression
	: character_or_class quantifier_operator
	| character_or_class
	| sub_expression quantifier_operator
	| sub_expression
	| anchor

concatenation_expression_rest
	: quantified_expression concatenation_expression_rest
	| epsilon

concatenation_expression
	: quantified_expression concatenation_expression_rest

alternation_expression_rest
	: '|' concatenation_expression alternation_expression_rest
	| '|' alternation_expression_rest
	: concatenation_expression alternation_expression_rest
	| epsilon

alternation_expression
	: '|' alternation_expression_rest
	| concatenation_expression alternation_expression_rest

regular_expression
	: alternation_expression regular_expression
	| epsilon
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
Published 2020-07-09
Regular Expression Laptop Stickers
$20.00 CAD
Regular Expression Laptop Stickers
Interesting Regular Expression Test Cases
Interesting Regular Expression Test Cases
Published 2020-07-09
Regular Expression Character Escaping
Regular Expression Character Escaping
Published 2020-11-20
How Do Regular Expression Quantifier Work?
How Do Regular Expression Quantifier Work?
Published 2020-08-18
How Regular Expression Alternation Works
How Regular Expression Alternation Works
Published 2020-08-18
Character Ranges & Class Negation in Regular Expressions
Character Ranges & Class Negation in Regular Expressions
Published 2020-05-31
Guide To Regular Expressions
Guide To Regular Expressions
Published 2020-07-09
Join My Mailing List
Privacy Policy
Why Bother Subscribing?
  • Free Software/Engineering Content. I publish all of my educational content publicly for free so everybody can make use of it.  Why bother signing up for a paid 'course', when you can just sign up for this email list?
  • Read about cool new products that I'm building. How do I make money? Glad you asked!  You'll get some emails with examples of things that I sell.  You might even get some business ideas of your own :)
  • People actually like this email list. I know that sounds crazy, because who actually subscribes to email lists these days, right?  Well, some do, and if you end up not liking it, I give you permission to unsubscribe and mark it as spam.
© 2025 Robert Elder Software Inc.
SocialSocialSocialSocialSocialSocialSocial
Privacy Policy      Store Policies      Terms of Use