Building A C Compiler Type System - Part 1: The Formidable Declarator

2016-07-07 - By Robert Elder

Updated July 15, 2016: Emphasize that declarators and declarations are completely different things.

Updated July 8, 2016: Corrections related to spiral rule and K & R C style declarators. See comment on HN discussion

Introduction

This will be the first part in a series of posts that are intended to document my thoughts during the process of building a type system for my c compiler. As of today this type system is not 100% complete, but I have accumulated so many thoughts that I feel I should write them down even if it is just for my own reference.

There are not many reference guides on the internet that describe how the C type system works well enough to write a C compiler. One idea would be to look at the source of a major compiler like gcc/clang, but I've chosen not to do that. The small amount of source I've looked at from these compilers suggests that it may not be worth the learning curve, and working in isolation might cause me to produce something novel and useful. The majority of what I've come up with is mostly the result of staring at the EBNF grammar of C for hundreds of hours, and thinking "Hmmm, can you do that?". Then I'll try it out in gcc or clang and I'll find out that I know less than I thought I did. After having done this for at least a few years now, I'm still learning things all the time. That brings me to my next point.

I have not made any consideration for the C++ type system at all in this work. Years ago, I would have said that I 'know' C++. I've taken courses using C++ in university, and written many C++ programs. I was once hired with the title 'C++ Software Developer' (where I wrote Java). Despite this, I would say that I know almost nothing at all about C++.

The Formidable Declarator

I call it the 'formidable' declarator because it has defeated my mind on more than one occasion. The simple declarator that you're used to is found in the following declaration:

int i;

The declarator is just the 'i' part. Here's another example:

int * i;

In this declaration, the declarator is the '* i' part. A declarator always contains at least an identifier, and possibly some other symbols. It is worth noting that a 'declarator' and a 'declaration' are completely different things. A 'declaration' has a rather intuitive meaning, whereas a 'declarator' is better thought of as a piece C jargon that comes from the name of a grammar rule that describes the C language (although there is also a grammar rule for declarations too).

int foo(int, char);

In this function declaration, the declarator is the 'foo(int, char)' part.

int boo(double), *k, arr[3];

In this declaration, there are 3 declarators: 'boo(double)', '*k', and 'arr[3]'.

These are all pretty simple, so what's the big deal? Well, here's one that is a bit more complicated:

int * (*(*(*foo)(char))(double))[3];

The declarator in this declaration is '* (*(*(*foo)(char))(double))[3]'. This is obviously a pointer to a function that takes a 'char', and returns a pointer to a function that takes a 'double', and returns a pointer to an array of 3 pointers to 'int's. Here's an example of how you might use this in a real program:

#include <stdio.h>

int * array_of_3_pointer[3];

int * (*(get_array_of_3_pointer(double d)))[3]{
        return &array_of_3_pointer;
}

int * (*(*(get_fcn_ptr_that_returns_array_of_3_pointer)(char c))(double))[3]{
        return get_array_of_3_pointer;
}

int main(void){
        int i = 42;
        int * (*(*(*foo)(char))(double))[3];
        foo = get_fcn_ptr_that_returns_array_of_3_pointer;
        (*foo('a')(1.1))[0] = &i;  /*  Point to the 'i' variable */
        (*((*foo('a')(1.1))[0]))++;  /*  Increment it indirectly */
        printf("Here is the number %d\n", i); /*  Prints 43 */
}

Abstract Declarators

The declarator has a friendly companion called the 'abstract declarator' that looks and feels very similar to the simple declarator. The only difference is that the abstract declarator does not contain an identifier. Abstract declarators are found in places where we need to describe types abstractly without reference to any specific instance of the variable. Examples include function prototypes, cast expressions, and sizeof arguments:

(void *)0x99;  /*  The abstract declarator is '*' */
sizeof(struct foo [20]);  /*  The abstract declarator is '[20]' */
int foo(int *, double **);  /*  The function declarator is 'foo(int *, double **)'.  It contains 2 abstract declarators: '*' and '**'. */

Are Declarators Unique?

If you decide to build your own data structure to represent declarators, you'll want to ask yourself the question "It is possible to write two different declarators with different token sequences that have exactly the same meaning?". The answer is, of course, yes. Even when ignoring whitespace. As you've seen above declarators are recursive, meaning that they can contain other declarators. The grammar rules allow you to place parentheses around any declarator. Depending on where they are placed, this may or may not change the meaning of the declarator:

/*  There is no difference between these two declarations. */
int * j;
int (* k);

/*  But these two are different. */
int * l[2];   /*  Array of two pointers to integers */
int (* m)[2]; /*  Pointer to an array of two integers */

Here is another example that shows this issue with abstract declarators:

/*  No error: declarations are the same.  */
void take_int_ptr(int (*)); /*  Takes an int pointer param. */
void take_int_ptr(int  * ); /*  Takes an int pointer param. */

void take_int_ptr(int (*i)){
        (void)i;
}

/*  Error:  Conflicting types declared (But only in C).  */
/*  Note:  You'll only get an error in C, because In C++ you're just overloading function declarations.  */
void take_fcn_ptr(void (*)(void)); /*  Takes a pointer to a function that takes 0 args and returns nothing. */
void take_fcn_ptr(void  * (void)); /*  Takes a pointer to a function that takes 0 args and returns a void ptr. */

void take_fcn_ptr(void (*fun)(void)){
        (void)fun;
}

You can even use these unnecessary parentheses to write some pretty wacky declarations:

int *((((((((*j))))))));
int * (*((((((((*(((((*foo)))))(char)))))))))(double))[3];
int (i);
int (arr[4])[3];

Problems With Canonicalizing Declarators

In order to determine if two types are equal, you will need a way of 'canonicalizing' both declarators and abstract declarators. One thought would be to just pass the programmer's representation through a function that would detect and remove superfluous parentheses, but you'd have to be careful since only some of the parentheses are not necessary, and the rest are necessary to prevent the '*' from binding in the wrong direction.

There is also another special case when considering declarators that involve function pointers. This was illustrated (but not explained) in the above example on incompatible function prototypes. This happens because 'function designators' and pointers to them are treated a bit inconsistently when it comes to switching back and forth between pointer and non-pointer types.

void take_fcn_ptr(void (void));    /* First arg is fcn ptr that takes 0 parameters and returns nothing. */
void take_fcn_ptr(void (*)(void)); /* Exact same thing written a different way. */

void take_fcn_ptr(void (*fun)(void)){
        (void)fun;
}

There are a number of sections that describe this in the standard, but I'll cite just one involving calling a function that expects a function pointer: C89 3.5.4.3 Function declarators: "For two function types to be compatible, both shall specify compatible return types. Moreover, the parameter type lists, if both are present, shall agree in the number of parameters and in use of the ellipsis terminator; corresponding parameters shall have compatible types. ... For each parameter declared with function or array type, its type for these comparisons is the one that results from conversion to a pointer type, as in 3.7.1 ..."

Here is a great example of this in action:

int f(());

This example compiles in my version of clang (with warnings), but my version of gcc does not seems to accept it. This example makes a bit more sense if you consider the legacy feature of C where the default specifier of a variable or function is 'int'. The above will effectively become:

int f(int ());

And after you consider the above citation in section 3.5.4.3, we end up with

int f(int (*)());

Note: A commenter pointed out that the grammar rules of C require that a parameter declaration must start with at least one declaration specifier. This prompted me to change the example from what I had before. Before I referenced this example

int f((((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))));

Which appears to be the same thing, but with a lot of extra parentheses.

An ideal canonical representation for a declarator should be able to describe all of the types we've seen above, but also allow for efficient comparison and modifications. When you consider the fact that a function can have a number of other declarators or abstract declarators in its parameter list, it would appear that this canonical representation might need to require a tree structure. In the next sections, I'll show why I believe that a list representation should be adequate.

The Three Fundamental Parts Of A Declarator

Any given declarator can be thought of as a list of zero or more 'declarator parts'. The concept of a 'declarator part' is part of my own taxonomy so don't look for descriptions of it elsewhere. The three fundamental types of declarator parts are:

() /*  Function  */
[] /*  Array */
*  /*  Pointer*/

The natural way to order the list is the same way in which you would operate on the type repeatedly to get close to whatever fundamental value it represents. This is the same order that you would use to describe what a given declarator type is if someone asked you, "What does this type describe?". Finally, this ordering is also very compatible with the order in which the parser builds the parse tree for a declarator, so it's easy to construct at compile time. In the original version of this article, I stated that this ordering is also equivalent to what you get from using the spiral rule, but it turns out that the spiral rule has some shortcomings that appear when you have consecutive '[]', '()', or '*'. See further reading.

Let's do some examples:

int * const arr[3];

/*  Declarator part list will be */
[]  -> With subtype '3'
*   -> With subtype 'const'

int (* arr)[3];

/*  Declarator part list will be */
*    -> With subtype bare
[3]  -> With subtype '3'

For the simplest of declarators, the declarator part list is simply an empty list. This representation normalizes over regular declarators and abstract declarators because the place where the identifier would go is simply at the start of the list if you wanted to include it. It also gives you the same representation regardless of how many superfluous parentheses are in the programmer's representation of the declarator. Here's the less trivial example from above:

int * (*(*(*foo)(char))(double))[3];
/*  Declarator part list will be */
*    -> With subtype bare
()   -> With subtype 'char'
*    -> With subtype bare
()   -> With subtype 'double'
*    -> With subtype bare
[]   -> With subtype '3'
*    -> With subtype bare

If a declarator contains a parameter list with yet more complex declarators we can still use a list, but we'll just have an item in the list that is of function type, and references some other data structure to keep track of that function's parameter list:

int * (*(*(*foo)(char, int * (*(*(*)(char))(double))[3], struct foo *))(double))[3];
/*  Declarator part list will be */
*    -> With subtype bare
()   -> With subtype 'char, int * (*(*(*)(char))(double))[3], struct foo *'
*    -> With subtype bare
()   -> With subtype 'double'
*    -> With subtype bare
[]   -> With subtype '3'
*    -> With subtype bare

Declarators and Typedefs

This representation leads to some very simple and efficient methods of manipulating types. One great example comes from dealing with resolving typedefs to determine what the 'real' type of something is.

typedef int * (*(*(*foo)(char))(double))[3];
foo (*(*b)[4])(float);

/*  (List 1) The Declarator part list for foo is */
*    -> With subtype bare
()   -> With subtype 'char'
*    -> With subtype bare
()   -> With subtype 'double'
*    -> With subtype bare
[]   -> With subtype '3'
*    -> With subtype bare

/*  (List 2) The Declarator part list observed in the declaration of 'b' is */
*    -> With subtype bare
[]   -> With subtype '4'
*    -> With subtype bare
()   -> With subtype 'float'

Now, the full declarator with the typedef of 'foo' resolved is simply the concatenation of list 1 onto the end of list 2:

b has type int * (*(*(*(*(*b)[4])(float))(char))(double))[3];
/*  The full Declarator part list of b */
*    -> With subtype bare
[]   -> With subtype '4'
*    -> With subtype bare
()   -> With subtype 'float'
*    -> With subtype bare
()   -> With subtype 'char'
*    -> With subtype bare
()   -> With subtype 'double'
*    -> With subtype bare
[]   -> With subtype '3'
*    -> With subtype bare

The first time I tried to write a function to resolve typedefs, I wrote an ungodly algorithm to manipulate the types based on the original parse tree. That wasn't fun and I don't recommend it.

Printing Declarators

Another common requirement is to be able to print one of the above declarators in the 'visual order' that you would see it described in a program. This can be complicated by the fact that you may need to add parentheses that were not even present in the original program when using typedefs:

typedef int a[3];
a * foo;

/*  The Declarator part list typedef declaration */
[]   -> With subtype '3'
/*  The Declarator part list foo declaration */
*   -> With subtype bare


/*  The Complete Declarator part list of foo */
*    -> With subtype bare
[]   -> With subtype '3'

The declarator in the typedef is a[3] and the declarator in the declaration of foo is '*foo'. If you want to print out the full type, you can't just string substitute '*foo' into 'a[3]' from the typedef, otherwise you'd get:

int * foo[3];

This is wrong because the '*' will bind to the 'int' instead of toward '[3]', and the type will be 'array of 3 pointers to ints'. What you actually want is this:

int (*foo)[3];

The general algorithm that I came up with for printing declarators in their 'visual order' involves recursing through a declarator part list in reverse order. Here it is in pseudocode:

print_declarator_in_visual_order(current_declarator, prev_was_fcn_or_arr){
        if(current_declarator is empty){
                /*  Done printing */
                return
        }else{
                switch(current_declarator.type){
                        case FUNCTION_DECLARATOR_PART or case ARRAY_DECLARATOR_PART:{
                                print_declarator_in_visual_order(current_declarator.previous_declarator,  true);
                                print_declarator_part(current_declarator);
                                break;
                        }case POINTER_DECLARATOR_PART:{
                                if(prev_was_fcn_or_arr){
                                        printf("(");
                                }
                                print_declarator_part(current_declarator);
                                print_declarator_in_visual_order(current_declarator.previous_declarator, false);
                                if(prev_was_fcn_or_arr){
                                        printf(")");
                                }
                                break;
                        }
                }
        }
}

/*  How to invoke this function */
print_declarator_in_visual_order(last_declarator_part, false);

I have not yet exhaustively unit tested this algorithm, but I've seen it work on enough cases to make me believe it should be correct for any declarator you can create in C. This should always print the declarator with the minimal number of parentheses necessary.

Declarator Dereferencing

Using this representation it becomes extremely easy to find the declarator that results from dereferencing a type (function calling, array indexing, or pointer dereferencing): You just remove the first element from the list and now you have the declarator for the new type. Here are some examples:

int arr[2][3][4];

/*  The declarator part list is  */
[]   -> With subtype '2'
[]   -> With subtype '3'
[]   -> With subtype '4'

/*  Dereference the array: */
arr[0];

/*  The new declarator part list (the array type) is  */
[]   -> With subtype '3'
[]   -> With subtype '4'

int (*p)[4];

/*  The declarator part list is  */
*    -> With subtype bare
[]   -> With subtype '4'

/*  Dereference the pointer: */
*arr;


/*  The new declarator part list (the type of thing being pointed to) is  */
[]   -> With subtype '4'

int (*f(char))(float);

/*  The declarator part list is  */
()   -> With subtype 'char'
*    -> With subtype bare
()   -> With subtype 'float'

/*  Call the function: */
f();

/*  The new declarator part list (the return type) is  */
*    -> With subtype bare
()   -> With subtype 'float'

Closure On Canonicalizing Declarators

The model presented above, where declarators are described as a list of 'declarator parts' appears to work quite well for a number of practical reasons. There were 2 problems related to being able to compare declarators written by programmers:

The first problem was related to surerpfluous brackets. The model presented here does not require any use of brackets at all, so this representation will never include a meaningless difference related to superfluous ones. The important information where brackets actually are necessary is not lost because it is described by ordering the elements in the list differently. The difference is identical to the difference you would see when describing the type using the spiral rule.

The second problem was related to the automatic type changes that are required when a function designator or array is included in a function prototype. Since the 'declarator part list' representation can represent function and pointer types agnostically, this second problem can be solved by simply starting with the function designator representation, and then doing a conversion afterward just after the function prototype has been encountered.

Conclusion

In this first part of my series on building a C type system, I presented a model for how to represent declarators in a compiler or any other context that needs to manipulate types from the C programming language. There are still a number of unaddressed topics, like how to consider specifiers, or what to do about declarators that are part of parameter lists, but I will leave these for a future post.

Keep reading Part 2: A Canonical Type Representation.

How to Get Fired Using Switch Statements & Statement Expressions Published 2016-10-27	$40.00 CAD C Programming Fridge Magnets	The Jim Roskind C and C++ Grammars Published 2018-02-15	7 Scandalous Weird Old Things About The C Preprocessor Published 2015-09-20
Modelling C Structs And Typedefs At Parse Time Published 2017-03-30	Building A C Compiler Type System - Part 2: A Canonical Type Representation Published 2016-07-21	The Magical World of Structs, Typedefs, and Scoping Published 2016-05-09	An Artisan Guide to Building Broken C Parsers Published 2017-03-30

Why Bother Subscribing?

Free Software/Engineering Content. I publish all of my educational content publicly for free so everybody can make use of it. Why bother signing up for a paid 'course', when you can just sign up for this email list?
Read about cool new products that I'm building. How do I make money? Glad you asked! You'll get some emails with examples of things that I sell. You might even get some business ideas of your own :)
People actually like this email list. I know that sounds crazy, because who actually subscribes to email lists these days, right? Well, some do, and if you end up not liking it, I give you permission to unsubscribe and mark it as spam.