Building A C Compiler Type System - Part 2: A Canonical Type Representation

2016-07-21 - By Robert Elder

Updated December 20, 2016: Corrected claims about calling function that defines a struct in its parameter list.

Introduction

In part 1, I introduced the concept of a declarator and explained why it can be confusing to a programmer. A possible canonical representation for declarators was discussed. In this section will make use of this representation to describe a method of canonicalizing all non-tag types that can be represented in the C programming language. This representation will be general enough that it will also be able to represent various nonsensical declarations that programmers are likely to create (which necessitate good error messages), and it will even be able to represent a few types that do not exist in the C programming language.

During our exploration, we'll also see a secret C type that's only possible through the magic of typedefs! We'll also see an example of a function that can be defined but cannot be called.

One Representation To Rule Them All

Let's jump straight to the punch-line before we review the details later in this article. Here is a general model for describing the 'type' of any non-tag type in the C programming language:

Specifier List

Declarator Part List

Bitfield

A Simple Example

int * i;

An Array Of Structs

const volatile struct abc {
	int i;
} foo[4][5];

Note that this declaration declares the array 'foo', but also defines the struct 'abc'. The structure definition itself would be described using another data structure. For the purposes of this discussion, it makes more sense to consider this as the equivalent declaration/definitions written a different way:

struct abc {
	int i;
};
const volatile struct abc foo[4][5];

const
volatile
struct abc

A Bitfield Member Of A Struct

long b:20;

long

:20

Multi-Declarator Declaration

extern volatile signed long long int g(int), * const * k, (*f(double))[];

This declaration actually declares 3 different items, and can be re-written a different way:

extern volatile signed long long g(int);
extern volatile signed long long * const * k;
extern volatile signed long long (*f(double))[];

extern
volatile
long
long

(int)

extern
volatile
long
long

*
* const

extern
volatile
long
long

(double)
*
[]

K & R C Style Function Declaration

int foo(a,b,c)
        int a;
        int b;
        int c;
{
        return a;
}

For K & R C style function declarations, the types of parameters (if specified) are part of the function definition and can be stored separately. When considering the function declaration, we only need to consider:

int foo(a,b,c)

(a,b,c)

Typedef

Typedef can be treated as just another regular old storage class specifier. Of course the 'typedef' specifier is just a syntatic convenience that would need to be removed when performing a comparison on types.

typedef int *a, foo(double);

typedef
int

typedef
int

(double)

Typedefed Type

A typedefed type acts just like a custom specifier that you can define. Of course, its expansion can also affect the declarator:

typedef int (*foo)(double);
static foo (*boo)(char, int);

The type definition for 'foo' can be represented as

typedef
int

*
(double)

The declaration of 'boo' can be represented as

static
foo

*
(char, int)

And the 'real' type of the variable 'boo' after removing the typedef and typename specifier is

static
int

*
(char, int)
*
(double)

A Secret Hidden Type

In writing this article, I realized that there are also a number of corner cases when considering 'const' or 'volatile' qualified typedefed types. In these cases, the 'const' qualifier can become absorbed into the declarator (for example when const qualifying a typedefed pointer). I also discovered that you can use typedefs to declare a secret type that can't even be declared without typedefs!

typedef const int * foo;
const foo a;

The 'full' type of a is

const
int

* const

This is just like declaring

const int * const a;

Similarly

typedef const int foo(void);
const foo a;

The 'full' type of a is

const
int

(void) const

which is impossible to declare on it's own! In consulting ISO C89 section 3.5.3 Type qualifiers: "If the specification of an array type includes any type qualifiers, the element type is so-qualified, not the array type. If the specification of a function type includes any type qualifiers, the behavior is undefined. ... Both of these can only occur through the use of typedef s.".

I believe if you were able to declare this type directly (without using typedef), it would look like this:

const int a(void) const;

but the grammar rules of C don't support this!

It's not hard to understand why you can't declare this: Functions in C are always 'constant' as far as the compiler is concerned, it doesn't make sense to even try to change one (at least in this language).

Specifiers Can Have Any Order

Remember that declaration specifiers can have any order:

struct foo {
	int i;
};

volatile struct foo const extern b;

volatile
struct foo
const
extern

Array Of Pointers

int * arr[8];

Pointer To An Array

int (* arr)[8];

Extremely Const But Also Volatile

There are also various nonsense declarations that don't accomplish anything, but nonetheless are accepted by compilers (but generate warnings).

const const const volatile;

const
const
const
volatile

Useless

'warning: useless type name in empty declaration'.

int;

Default Specifier

The case where all 3 sections are empty is this trivial example. Supporting this case will help your compiler be backward compatible with the B programming_language.

i;

This will produce compiler warnings, but will declare a variable 'i' of type 'int' by default.

Unnamed Struct With No Instances

'warning: unnamed struct/union that defines no instances'.

struct {int i;};

struct <unnamed struct 1234>

Function Returning Function

In C a function can't return a function, but if it could, this is how you would declare it:

void foo(int)(int);

void

(int)
(int)

Array of Functions

If you could return functions, why not create an array of them (compile error):

int foo[3](int);

[3]
(int)

Returning Array of Functions

Or why not create a function that returns an array of functions (compile error):

void foo(int)[3](double);

void

(int)
[3]
(double)

Impossible Function Pointer Bitfield

Attempting to declare a function pointer bitfield: "error: bit-field 'f' has non-integral type 'int (*)(int)'";

int (*f)(int):2;

*
(int)

Enum And Struct Together

Attempting to a variable that is both a struct and an enum (compile error):

enum foo struct boo i;

enum foo
struct foo

Unsigned Signed Double Int Char

unsigned signed double int char i; /*  Does not compile */

unsigned
signed
double
int
char

Applications

As you can see, using the model above you can describe any non-tag type that can be declared in C. Furthermore, this representation allows you to easily generalize over types that you may find in the following places:

Structure Members
Function Parameters
K & R C Function Parameters
Globals And Locals
Multi-declarator Declarations
Function Declarations
Function Definitions
Default specifier type declarations
Various Error Cases
Sizeof calculations
All cast expressions
Abstract function parameters

This representation can also expresses many impossible types, such as global bitfields, or other things described above. I think this is fine, since the current model allows us to keep the data structure extremely simple. My preference is to move the 'artificial' complexity of these impossible types into validation checks, and allowing the compiler to understand impossible things like functions returning functions is good for producing error messages. Using this representation, many otherwise difficult checks become very simple. For example, when validating the type of a bitfield, the declarator part list should always be empty. You should never have two adjacent declarator parts that are of function type (otherwise you have a function returning a function).

There are still a number of required data structures that haven't yet been described in a detailed way: Function parameter lists; struct, union or enum definitions; array constants; etc. These will be described in future sections. It should however be obvious that this data structure can be used recursively in at least some way when describing function parameter lists or struct definitions.

A Somewhat Formal Definition

The examples above should give you a good idea of how to represent any non-tag type in C, but for clarity's sake let's be a bit more specific. Any non-tag type that can be declared in the C programming language can be expressed as a combination of the following three things:

1) An ordered list of type specifiers, type qualifiers, or storage class specifiers.
2) An ordered list of 'declarator parts'.
3) A single bitfield constant.

All three items can be empty. It should be noted that as stated above, this representation is not one-to-one with what a programmer would consider to be distinct types in C. For example, 'signed int i', 'int i', and 'int signed i' all have the same meaning to the programmer, so a further step would be necessary before they could be directly compared for 'sameness'. Modifying the specifier list is necessary not only for trivial re-orderings, but it is also required for situations where certain operations ignore, remove or add specifiers like const, volatile, extern, etc.

In the previous section, no formal definition was provided for what a 'declarator part' is other than that every declarator part 'type' is either an array, function or pointer. Extra meta-data will be necessary to track things like array size, array type (flexible or not), function parameters or qualifiers on pointers such as const or volatile. No description for how to accomplish this has been provided yet.

Notes On Specifiers

In this section, I have made a number of references to 'specifiers'. The grammar rules for C use the term 'specifier' in a somewhat inconsistent way, but whenever I say 'specifiers', you can take this to mean any and all of type specifiers ('int', 'enum abc', ...), storage class specifiers ('extern', 'static', ...), and type qualifiers ('const', 'volatile', ...).

The grammar rules of C, allow us to define a struct (or enum or union) anywhere that you can reference a struct (or union or enum). You can even define a struct in a function parameter (although you'd then never be able to call the function externally within the same translation unit (you could, however, define the function to call itself recursively)):

void foo(struct abc{int i;} p){
	/*  This code will compile in gcc and clang. */
}

It is assumed that whenever you encounter a reference to a struct (or union or enum) that also defines it, that the definition body will not be included as part of the specifier when using the specifier list representation discussed here. Every struct, union or enum specifier will need a reference to the definition it belongs to. Structs, unions and enums obey scoping rules, so the following example:

struct abc{int i;} p;

void foo(struct abc{int i;} p){

}

int main(void){
	struct abc{int i;} p;
}

contains 3 different incompatible definitions of something called 'struct abc'. None of them can be assigned to each other. Therefore, struct, unions and enum specifiers should be internally represented in a way that indicates which scope level their definition belongs to. The above example would contain:

struct abc (file scope)

struct abc (function foo parameter list scope)

struct abc (main scope)

Another form of specifier is a 'type name' declared using the typedef keyword. These are also scoped and will need some description of the namespace they belong to. All other types of specifier, qualifiers etc. are simple C keywords which should not be difficult to represent internally.

In the previous part of this series, I introduced the idea of a 'declarator part', and discussed how a natural way to describe declarators uses a list of these parts. For a list of 'declarator parts' ordering definitely matters. In this section, I've also used a list to describe the specifiers. I believe I can say that the ordering of specifiers in a type never matters in C (I have been unable to find any counter-examples). If this assumption about the ordering of specifiers turns out to be true (in a very robust way) it might make sense to use a set or hashtable (with a counter for duplicate specifiers) to store specifiers instead. For now I'll stick with a list as the original ordering of the specifiers will at least be of interest for printing error messages.

Conclusion

In this section, we saw that every non-tag type in the C programming language can be described as a combination of a specifier list; a declarator part list; and a possible bitfield constant. This representation also allows for the type system to be generalized in meaningful ways that describe types that don't actually exist in the language, but would require specific and meaningful error messages. In future sections we will explore more aspects of representing types, such as data structures for tag types and declarator parts.

How to Get Fired Using Switch Statements & Statement Expressions Published 2016-10-27	$40.00 CAD C Programming Fridge Magnets	The Jim Roskind C and C++ Grammars Published 2018-02-15	7 Scandalous Weird Old Things About The C Preprocessor Published 2015-09-20
Building A C Compiler Type System - Part 1: The Formidable Declarator Published 2016-07-07	Modelling C Structs And Typedefs At Parse Time Published 2017-03-30	The Magical World of Structs, Typedefs, and Scoping Published 2016-05-09	An Artisan Guide to Building Broken C Parsers Published 2017-03-30

Why Bother Subscribing?

Free Software/Engineering Content. I publish all of my educational content publicly for free so everybody can make use of it. Why bother signing up for a paid 'course', when you can just sign up for this email list?
Read about cool new products that I'm building. How do I make money? Glad you asked! You'll get some emails with examples of things that I sell. You might even get some business ideas of your own :)
People actually like this email list. I know that sounds crazy, because who actually subscribes to email lists these days, right? Well, some do, and if you end up not liking it, I give you permission to unsubscribe and mark it as spam.