An Artisan Guide to Building Broken C Parsers

2017-03-30 - By Robert Elder

Introduction

Understanding how to do things correctly is a necessary condition for being good at something, but true mastery demands that you also know all the wrong ways of doing something. In this article, I will present many examples of incorrect assumptions you can use to waste countless hours of your life building broken implementations of C parsers. I will pay special attention to the challenging aspects of the C language itself, focusing on the type system and scoping rules as opposed to any particular compiler theory. Finally, for the cases presented I will review the most correct algorithms that I am currently aware of, although I'll probably still find out that they are subtly wrong at some distant point in the future.

The cases examined here are interesting because a sufficiently advanced compiler will need to consider them at parse time. A simple compiler may be able to ignore many of them at parse time and push these decisions to later stages of semantic analysis. One of the core problems comes from the fact that parsing typedefed types is a context sensitive operation that requires introspection into the current state of the type system. Some work can be pushed to later stages of the compiler, but you have to be careful because it's easy to not notice one corner case that totally invalidates your original approach. For many situations, there is no precise answer on what work fundamentally needs to be done at parse time. This will depend on what kinds of error messages you want to produce, and how many compiler extensions your compiler must be able to handle.

Incorrect Assumption 1:

"All of the information needed to parse a C program is contained in its context-free grammar."

The following piece of code will vary in meaning (and in its corresponding parse tree) in a way that depends on the meaning of the identifier 'a' in the context where these symbols are parsed:

a * b

For example, if 'a' is an integer variable:

int main(void){
	int a = 2;
	int b = 3;
	return a * b;
}

then 'a * b' will be parsed as a multiplication expression, but if 'a' is a typedefed type:

typedef struct foo {int i;} a;

int main(void){
	a * b;
	b->i = 0;
	return 0;
}

then 'a * b' will be parsed as a declaration of a pointer to the variable 'b'. In this case the type of 'b' resolves to 'struct foo *'.

Here is another example that illustrates a case where removing a line of code will cause a later line of code to become a parser error:

typedef int a;

int main(void){
        int a;  /*  If you remove this line... */
        a++;    /*  this line will cause a parser error. */
        return 0;
}

Note that although removing the line indicated above is obviously an error, the point is that it causes a parser error, and not a semantic error which could otherwise be pushed to later stages of the compilation process.

Therefore, in order to correctly parse the C language, it is necessary for the parser to have some knowledge of the typedef and variable declaration system so that it can make context-sensitive decisions about what type of parse tree to build.

Incorrect Assumption 2:

"Scope levels are not important when considering typedef statements."

Scope levels are very important when considering typedef statements. For example the following code

typedef int node;
static node one = 1;

int main(void){
        typedef struct foo {int i;} node;
        node n;
        n.i = one;
        return n.i;
}

declares two different typedefed types called 'node' in two different scope levels. The above program compiles with no warnings or errors according to [1]

Incorrect Assumption 3:

"Scope levels are not important when declaring struct, union or enum types."

struct foo { int i; };

int main(void){
	struct foo { int j; } s;
	s.j = 1;
	(void)s;
	return 0;
}

Scope levels are absolutely important when declaring struct, union or enum types as demonstrated in the above program. The above program compiles with no warnings or errors according to [1].

Incorrect Assumption 4:

"A struct definition can only belong to one of two scopes: File scope, and function scope."

struct foo { int i; };

int main(void){
	struct foo { int j; };
	{
		struct foo { int k; };
		{
			struct foo { int l; };
			{
				struct foo { int m; } s;
				s.m = 1;
				(void)s;
			}
		}
	}
	return 0;
}

For most compilers, there is no practical limit[2] to the number of different scope levels that a struct, union or enum type can be defined in as demonstrated in the above program. The above program compiles with no warnings or errors according to [1].

Incorrect Assumption 5:

"The bookkeeping required for managing typedef definitions does not depend on understanding struct, union, or enum declarations."

In C89[3], typedef re-definition is not allowed, however it is allowed in C11[4]. This should bring up the question: "How ambitious are you, and how much syntax do you want your C compiler to support?".

Here is a trivial example showing how re-defining a typedef may or may not be allowed, depending on the types involved:

typedef int type1;
typedef int type1;    /*  Re-definition permitted if second definition matches. */
typedef type1 type2;  /*  'type2' eventually resolves to 'int' */
typedef type2 type1;  /*  So this re-definition is consistient and allowed. */
typedef double type1; /*  re-definition error since it changes the type of 'type1'. */

int main(void){
        return 0;
}

Doing this correctly in all cases will require that you take the scope of structs into account:

struct foo {int i;};
typedef struct foo type1;

int main(void){
	struct foo {int j;};
	typedef struct foo type2;
	typedef struct foo type3;
	typedef type3 type2; /* No error. */
	typedef type1 type2; /* Re-definition error. */
        return 0;
}

In the above example, the re-definition error occurs because the two structs are identical in name, but from different scopes. This means you can't simply compare the 'struct foo' part to calculate similarity, you also need information about what scope they come from.

At a bare minimum to support this case, you could simply track the struct tag name, and the scope it resides in, without actually understanding the contents of the struct at parse time. Then, for situations that have a re-definition error, you could simply keep track of the active typedef identifier names, but ignore their definitions in the parser, thereby avoiding many of the complexities of dealing with struct definitions at parse time. Another case will be presented in this article which demonstrates that such an approach will not work in all cases.

Incorrect Assumption 6:

"It is never necessary to traverse a struct or union definition at parse time."

The 'typeof' keyword is an extremely common, non-standard extension to C that allows you to identify the type of an expression and make use of this in places where you would otherwise identify a type. The use of typeof can be used to produce parse-time dependencies on being able to traverse and understand struct definitions. Consider the following program:

struct f1 {int i;};
struct f2 {int j;};

struct boo {
        struct f1 a;
        /* If you change 'f2' to 'f1'... */
        struct f2 b;
} b;

typeof(b.a) type1(void);

int main(void) {
        typeof(b.b) type1(void);
        typedef int type1;
        type1 t = 1; /* clang will emit a parser error on this line. */
        return t;
}

If we compile the program above in clang, it will unsurprisingly emit a semantic error complaining that the function prototype for 'type1' has been re-defined in a way that changes the type:

main.c:13:21: error: conflicting types for 'type1'
        typeof(b.b) type1(void);
                    ^
main.c:10:13: note: previous declaration is here
typeof(b.a) type1(void);
            ^
1 error generated.

If we leave everything the same and change the type of one of the struct members:

struct f1 {int i;};
struct f2 {int j;};

struct boo {
        struct f1 a;
        /* If you change 'f2' to 'f1'... */
        struct f1 b;
} b;

typeof(b.a) type1(void);

int main(void) {
        typeof(b.b) type1(void);
        typedef int type1;
        type1 t = 1; /* clang will emit a parser error on this line. */
        return t;
}

the new result will include semantic, as well as parse errors:

main.c:14:21: error: redefinition of 'type1' as different kind of symbol
        typedef int type1;
                    ^
main.c:13:21: note: previous definition is here
        typeof(b.b) type1(void);
                    ^
main.c:15:14: error: expected ';' after expression
        type1 t = 1; /* clang will emit a parser error on this line. */
             ^
             ;
main.c:15:15: error: use of undeclared identifier 't'
        type1 t = 1; /* clang will emit a parser error on this line. */
              ^
main.c:16:16: error: use of undeclared identifier 't'
        return t;
               ^
main.c:15:9: warning: expression result unused [-Wunused-value]
        type1 t = 1; /* clang will emit a parser error on this line. */
        ^~~~~
1 warning and 4 errors generated.

The reason for the syntax error appearing in the second version only is because when the type of the member changes from 'f2' to 'f1', the function re-declaration of 'type1' that takes place in main is allowed to succeed. Once this re-declaration succeeds it also creates an entry in the local identifier namespace for the 'type1' as a non-typedefed identifier. Since 'type1' already has an entry in this namespace, the typedef that attempts to re-use 'type1' on the next line fails. Previously, it was the function re-declaration that failed so the typedef rule was allowed to succeed. Therefore, the possible parse trees for the line 'type1 t = 1' in this function can change conditionally through the use of the 'typeof' operator and the resulting errors that are influenced by the type of a struct member.

Now at this point, you might think be thinking "Well these are both error cases, and it uses non-standard features of the C language, so it doesn't count!". In the very least, if you want to emulate clang's behaviour in all cases, you'll have to consider what happens in this situation. This is also where I should remind you that there are no absolute correct answers on what your parser should be able to handle. It's all about how ambitious you are, and how well you want to handle error cases or non-standard language features.

Admittedly, this case took me way too many hours to come up with. I was almost tempted to believe for a while that I would never be able to demonstrate a parse-time dependency on struct definitions. Removing such a dependency would greatly simplify what type of work is absolutely necessary in the parser, so it is tempting to think about. Given how much time it took me to come up with this case, I would not be surprised if there are many more simpler (and more practical) cases that demonstrate this same parse-time dependency on struct, union, or enum definitions.

Incorrect Assumption 6:

"You cannot define a struct inside another struct definition."

struct foo {
	struct boo {int i;} i;
};


int main(void){
	struct foo f;
	f.i.i = 1;
	return f.i.i;
}

Of course you can. The above program compiles with no warnings or errors according to [1].

Incorrect Assumption 7:

"You cannot define a struct inside a return type."

struct foo boo(void);

struct foo {int i;} boo(void){
	struct foo f;
	f.i = 0;
	return f;
}


int main(void){
	return boo().i;
}

Of course you can. The above program compiles with no warnings or errors according to [1].

Incorrect Assumption 8:

"Whenever you see the '{' or '}' characters that means a 'scope level', so a nested struct definition is only visible within the struct definition it was written in."

struct foo1{
	struct foo2{
		struct foo3{
			struct foo4{
				int i;
			} c;
		} b;
	} a;
};

static struct foo1 f1;
static struct foo2 f2;
static struct foo3 f3;
static struct foo4 f4;

int main(void){
	(void)f1; (void)f2; (void)f3; (void)f4;
	return 0;
}

'{' and '}' characters do not create a 'scope level' when defining a struct. Therefore, any named nested struct definition creates an entry in the same tag namespace of the outermost struct that the inner struct was declared in. The above program illustrates this fact and compiles with no warnings or errors according to [1].

Incorrect Assumption 9:

"An incomplete struct type can be completed in a deeper scope level."

The following piece of code is well-understood to declare an 'incomplete' struct type 'foo', and then later 'complete' the struct definition:

struct foo;

struct foo{
	int i;
};

This might lead one to believe that you could also complete the struct type in a later deeper scope:

int main(void){
	struct foo;
	{
		struct foo{
			int i;
		};
	}
	struct foo abc; /*  Error:  Incomplete struct type. */
	(void)abc;
	return 0;
}

The above code yields an error in gcc and clang because the 'struct foo' from a deeper scope represents a separate struct type leaving the first one incomplete and therefore unsuitable for declaring non-pointer variable declarations.

Incorrect Assumption 10:

"The addition of a struct at file scope cannot silently change the behaviour of pre-existing code that does not require the additional file-scope struct to compile."

This is false in cases that make use of non file-scope structs and also declare pointers to the struct before it has been defined. For example, compare the behaviour of the following two programs:

#include <stdio.h>
#include <stdlib.h>

int main(void){
        struct foo {
                struct buffer_object * p;
        };
        struct buffer_object {
                int arr[4];
        };
        struct foo f;
        int write_at = 5;
        int buffer_size = sizeof(f.p->arr) / sizeof(int);
        f.p = malloc(sizeof(struct buffer_object));

        if(write_at < buffer_size){
                printf("The buffer is size %d, so it's ok to write to location %d.\n", buffer_size, write_at);
                f.p->arr[write_at] = 0;
        }else{
                printf("Woah there!  The buffer is size %d, can't write to location %d.\n", buffer_size, write_at);
        }

        free(f.p);
        return 0;
}

Which has the following output when run through valgrind:

$  valgrind -q ./a.out
Woah there!  The buffer is size 4, can't write to location 5.

This next example is identical, except there is an additional struct at file scope (which could come from an #include):

#include <stdio.h>
#include <stdlib.h>

struct buffer_object {
	int arr[999];
};

int main(void){
        struct foo {
                struct buffer_object * p;
        };
        struct buffer_object {
                int arr[4];
        };
        struct foo f;
        int write_at = 5;
        int buffer_size = sizeof(f.p->arr) / sizeof(int);
        f.p = malloc(sizeof(struct buffer_object));

        if(write_at < buffer_size){
                printf("The buffer is size %d, so it's ok to write to location %d.\n", buffer_size, write_at);
                f.p->arr[write_at] = 0; /*  Write after buffer occurs. */
        }else{
                printf("Woah there!  The buffer is size %d, can't write to location %d.\n", buffer_size, write_at);
        }

        free(f.p);
        return 0;
}

But the output from the second program will be:

$ valgrind -q ./a.out
The buffer is size 999, so it's ok to write to location 5.
==19306== Invalid write of size 4
==19306==    at 0x400603: main (main.c:22)
==19306==  Address 0x5203054 is 4 bytes after a block of size 16 alloc'd
==19306==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19306==    by 0x4005D5: main (main.c:18)
==19306==

Both of these programs will compile according to [1].

I found this case interesting enough that I wrote an entire blog post about it: Be Careful When Using Scoped Structures In C.

Incorrect Assumption 11:

"You cannot define a struct in a function parameter list."

Sure you can:

void fun(struct foo {int i;} a){
        fun(a);
}

int main(void){ return 0; }

This isn't something you're likely to see in a real program very often, but it's still in the language so your parser should consider it. This program will compile, although it issues warnings in gcc and clang due to the fact that the function prototype creates a new scope level and the struct definition will not be visible again at file scope:

struct foo;

void fun(struct foo {int i;} a){
        struct foo abc; /*  No Error:
	definition from parameter list is visible. */
        fun(a);
}

struct foo abc; /*  Error:  Incomplete struct type foo. */

int main(void){ return 0; }

An interesting side-effect of struct definition not being visible outside the function, is that the only way to call this function from within the same translation unit is to call it recursively from within the function. I believe you can still obey the C standard by calling this function from outside this translation unit (in another .c file that gets linked together) as long as you declare the prototype in that translation unit to have struct layout compatibility.

Incorrect Assumption 12:

"There are no weird edge cases with nested function parameter lists and struct declarations."

Of course there are:

int foo(struct foo {int i;} a, int (* b)(struct foo {double j;} c)){
        struct foo lol;
}

int main(void){ return 0; }

In this case, gcc and clang do not agree on what to do. gcc issues warnings but compiles, and clang says 'main.c:2:16: error: reference to 'foo' is ambiguous'. Based on my reading of the standard, I would say that gcc is probably doing the most correct thing here and clang may not be considering the tree-like structure of tag namespaces that you can possibly get from nested function prototypes.

Incorrect Assumption 13:

"If a struct has 2 members that are both of type 'struct foo', then those two members have exactly the same type."

Are you crazy? Of course not:

struct foo{
        int i;
};

int main(void){
        struct boo{
                struct foo f;
                struct foo { long double d; } g;
        };
        struct boo b;
        b.f = b.g;
        return 0;
}

Fortunately, gcc and clang both issue helpful error messages:

$ gcc -std=c89 -pedantic -Wall main.c
main.c: In function ‘main’:
main.c:12:6: error: incompatible types when assigning to type ‘struct foo’ from type ‘struct foo’
  b.f = b.g;
$ clang -std=c89 -pedantic -Weverything main.c
main.c:12:6: error: assigning to 'struct foo' from incompatible type 'struct foo'
        b.f = b.g;
            ^ ~~~
1 error generated.

Incorrect Assumption 14:

"An identifier can never change its meaning in the middle of a declarator list."

Here is an example of where it can:

#include <stdio.h>
typedef char foo;
int main(void) {
        unsigned int a = sizeof(foo), foo=sizeof(foo), b=sizeof(foo);
	printf("%u %u %u\n", a, foo, b);
	return 0;
}

Incorrect Assumption 15:

"An identifier can never change its meaning in the middle of a declaration specifier list."

The grammar rule for 'declaration specifiers' in C looks (depending on your flavour of C) something like this:

declaration_specifiers
	: storage_class_specifier declaration_specifiers
	| storage_class_specifier
	| type_specifier declaration_specifiers
	| type_specifier
	| type_qualifier declaration_specifiers
	| type_qualifier
	;

The quickest (non-precise) way to describe what 'declaration specifiers' are is to say that they are the things like 'int', 'long', 'const', 'volatile' etc. that you see in the declaration of a variable just before the identifier or any pointer, array, or function information. For example:

/*  These       */      /* And   */
/*  things      */      /* these */
/*  all         */      /* are   */
/*  declaration */      /* not.  */
/*  specifiers. */      /*       */
int                     a;
const int               b;
long long double        c;
extern                  foo();
static struct foo       f;
float                   abc[33];
typedef int             t1;
t1                      i;  /*  Only valid when preceded by typedef. */

You'll note from the grammar rules mentioned above, that the grammar allows for one or more specifiers combined in any order. This means that the following things are valid with respect to the grammar of C (although they might be semantic errors), but they will not cause an error in parsing:

int i;
const int i;
const const const const const int i;
struct foo struct boo struct woo f;
extern static long register volatile int float f;

Since a typedefed identifier counts as one of the 'declaration specifiers' you can put it in front of a variable when declaring it, and there is a fabulous corner-case where you're declaring a variable of a given type where the name of the variable is the same as the type:

typedef int type1;

int main(void){
	volatile type1 type1;
	type1 = 3;
	return type1;
}

This makes life hard for the parser, because you can include any number of specifiers before the actual identifier of the variable when you're declaring it:

typedef int type1;

int main(void){
	/*  The parser can handle this: */
	volatile volatile volatile int i;
	i = 3;
	/*  So why shouldn't it handle this? */
	type1 type1 type1 type1;
	type1 = 3;
	return type1;
}

In the above example, gcc and clang both issue warnings about the redundant volatile specifiers (semantic, not parser issues), but the key here is that both compilers issue a parse error after the second 'type1':

main.c:8:20: error: expected ';' at end of declaration
        type1 type1 type1 type1;
                   ^
                   ;

This is because after the consumption of the first typedef type identifier, we can no longer treat that same identifier as a specifier, otherwise we would never be able to recognize a declaration that declared a variable with the same identifier as the typedef name itself! Be careful though, because if you made your parser simply stop accepting any specifier, you'd get the following case wrong:

typedef int type1;

int main(void){
	type1 volatile type1;
	type1 = 3;
	return type1;
}

Incorrect Assumption 16:

"An unnamed struct, union, or enum definition that declares no instances is always useless and can be ignored."

Not necessarily:

int main(void){
	int i = sizeof(struct {int i;});
	return i;
}

Incorrect Assumption 17:

"An unnamed struct, or union definition that appears inside another struct or union definition, and also declares no instances is always useless and can be ignored."

No, it's actually a very special case that was formally added to the c11 standard[6] which causes the inner members to appear in the outer struct/union:

struct s1{
	int a;
	union {
		int b;
		int c;
	};
};

struct s2{
	union {
		struct {
			union {
				int i;
				long j;
			};
		};
		int w;
	};
};

int main(void){
	struct s1 x;
	x.a = 1;
	x.b = 2;
	x.c = 3;
	struct s2 y;
	y.i = 1;
	y.j = 2;
	y.w = 3;
	return 0;
}

The above program compiles with no warnings or errors, but only in C11.

Incorrect Assumption 18:

"Referencing a struct of incomplete type is always an error as long as it's not a pointer to struct."

No, for example you can declare a pointer to a function that takes a completed struct type, inside of that same struct definition:

struct foo{
        void (*f)(struct foo);
};

int main(void){
        return 0;
}

How To Handle These Cases Correctly?

There is a lot of say about this topic, so I moved it out to another article: Modelling C Structs And Typedefs At Parse Time.

Closing Thoughts

In a university classroom you can come up with clean theoretical results (using turing machines, finite-automata, LL and LR parsers etc.) about how powerful individual parsing techniques are, but if you go read the source code of real compilers (like clang or gcc), you'll see that they are full of source code that looks like this[5]:

/* FIXME. This will not work for multiple declarators; as in:
  __typeof__(a) b,c,d; */

if((isMicrosoftMissingTypename() &&
	!(isLibstdcxxPointerReturnFalseHack() && TryOCLZeroEventInitialization()) )){
	InitializedSomething = true; /* TODO: is it correct? */
}
/* FIXME: This is horrible. Adding a mechanism to skip an expression would be much cleaner. */

Things like the above are the reason why real programming languages have so many rough edges. The rough edges are not the result of any fundamental mathematical elegance, in fact they are mostly due to an absence of the latter.

If you found this article interesting, you'd probably also enjoy Modelling C Structs And Typedefs At Parse Time and The Magical World of Structs, Typedefs, and Scoping.

Notes

[1] The program compiles with no warnings or errors according to the command

gcc -std=c89 -pedantic -Wall main.c && clang -std=c89 -pedantic -Weverything main.c

with the following version information:

$  gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$  clang --version
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
Target: x86_64-pc-linux-gnu
Thread model: posix

[2] There are minimum requirements imposed in section 2.2.4.1 (Translation limits) of the C89 standard and a conforming compiler may reject programs that go beyond these limits.

[3] C89 3.1.2.2 Linkages of identifiers: "The following identifiers have no linkage: an identifier declared to be anything other than an object or a function; an identifier declared to be a function parameter; an identifier declared to be an object inside a block without the storage-class specifier extern.". C89 3.5 DECLARATIONS "If an identifier has no linkage, there shall be no more than one declaration of the identifier (in a declarator or type specifier) with the same scope and in the same name space, except for tags as specified in 3.5.2.3".

[4] C11 6.7 Declarations "a typedef name may be redefined to denote the same type as it currently does, provided that type is not a variably modified type;".

[5] These are actual comments and functions obtained from the clang compiler source code. The comments and function calls have been re-arranged and removed from their original context in order to optimize comedic impact. Much appreciation is directed toward the maintainers of the clang project for their contributions to open source software.

[6] Unnamed Structure and Union Fields

How to Get Fired Using Switch Statements & Statement Expressions Published 2016-10-27	$40.00 CAD C Programming Fridge Magnets	The Jim Roskind C and C++ Grammars Published 2018-02-15	7 Scandalous Weird Old Things About The C Preprocessor Published 2015-09-20
Building A C Compiler Type System - Part 1: The Formidable Declarator Published 2016-07-07	Modelling C Structs And Typedefs At Parse Time Published 2017-03-30	Building A C Compiler Type System - Part 2: A Canonical Type Representation Published 2016-07-21	The Magical World of Structs, Typedefs, and Scoping Published 2016-05-09

Why Bother Subscribing?

Free Software/Engineering Content. I publish all of my educational content publicly for free so everybody can make use of it. Why bother signing up for a paid 'course', when you can just sign up for this email list?
Read about cool new products that I'm building. How do I make money? Glad you asked! You'll get some emails with examples of things that I sell. You might even get some business ideas of your own :)
People actually like this email list. I know that sounds crazy, because who actually subscribes to email lists these days, right? Well, some do, and if you end up not liking it, I give you permission to unsubscribe and mark it as spam.