Building A C Compiler Type System - Part 2: A Canonical Type Representation
2016-07-21 - By Robert Elder
Updated December 20, 2016: Corrected claims about calling function that defines a struct in its parameter list.
Introduction
In part 1, I introduced the concept of a declarator and explained why it can be confusing to a programmer. A possible canonical representation for declarators was discussed. In this section will make use of this representation to describe a method of canonicalizing all non-tag types that can be represented in the C programming language. This representation will be general enough that it will also be able to represent various nonsensical declarations that programmers are likely to create (which necessitate good error messages), and it will even be able to represent a few types that do not exist in the C programming language.
During our exploration, we'll also see a secret C type that's only possible through the magic of typedefs! We'll also see an example of a function that can be defined but cannot be called.
One Representation To Rule Them All
Let's jump straight to the punch-line before we review the details later in this article. Here is a general model for describing the 'type' of any non-tag type in the C programming language:
Specifier List | Declarator Part List | Bitfield |
A Simple Example
int * i;
|
|
An Array Of Structs
const volatile struct abc {
int i;
} foo[4][5];
Note that this declaration declares the array 'foo', but also defines the struct 'abc'. The structure definition itself would be described using another data structure. For the purposes of this discussion, it makes more sense to consider this as the equivalent declaration/definitions written a different way:
struct abc {
int i;
};
const volatile struct abc foo[4][5];
|
|
A Bitfield Member Of A Struct
long b:20;
|
:20 |
Multi-Declarator Declaration
extern volatile signed long long int g(int), * const * k, (*f(double))[];
This declaration actually declares 3 different items, and can be re-written a different way:
extern volatile signed long long g(int);
extern volatile signed long long * const * k;
extern volatile signed long long (*f(double))[];
|
|
|
|
|
|
K & R C Style Function Declaration
int foo(a,b,c)
int a;
int b;
int c;
{
return a;
}
For K & R C style function declarations, the types of parameters (if specified) are part of the function definition and can be stored separately. When considering the function declaration, we only need to consider:
int foo(a,b,c)
|
|
Typedef
Typedef can be treated as just another regular old storage class specifier. Of course the 'typedef' specifier is just a syntatic convenience that would need to be removed when performing a comparison on types.
typedef int *a, foo(double);
|
|
|
|
Typedefed Type
A typedefed type acts just like a custom specifier that you can define. Of course, its expansion can also affect the declarator:
typedef int (*foo)(double);
static foo (*boo)(char, int);
The type definition for 'foo' can be represented as
|
|
The declaration of 'boo' can be represented as
|
|
And the 'real' type of the variable 'boo' after removing the typedef and typename specifier is
|
|
A Secret Hidden Type
In writing this article, I realized that there are also a number of corner cases when considering 'const' or 'volatile' qualified typedefed types. In these cases, the 'const' qualifier can become absorbed into the declarator (for example when const qualifying a typedefed pointer). I also discovered that you can use typedefs to declare a secret type that can't even be declared without typedefs!
typedef const int * foo;
const foo a;
The 'full' type of a is
|
|
This is just like declaring
const int * const a;
Similarly
typedef const int foo(void);
const foo a;
The 'full' type of a is
|
|
which is impossible to declare on it's own! In consulting ISO C89 section 3.5.3 Type qualifiers: "If the specification of an array type includes any type qualifiers, the element type is so-qualified, not the array type. If the specification of a function type includes any type qualifiers, the behavior is undefined. ... Both of these can only occur through the use of typedef s.".
I believe if you were able to declare this type directly (without using typedef), it would look like this:
const int a(void) const;
but the grammar rules of C don't support this!
It's not hard to understand why you can't declare this: Functions in C are always 'constant' as far as the compiler is concerned, it doesn't make sense to even try to change one (at least in this language).
Specifiers Can Have Any Order
Remember that declaration specifiers can have any order:
struct foo {
int i;
};
volatile struct foo const extern b;
|
Array Of Pointers
int * arr[8];
|
|
Pointer To An Array
int (* arr)[8];
|
|
Extremely Const But Also Volatile
There are also various nonsense declarations that don't accomplish anything, but nonetheless are accepted by compilers (but generate warnings).
const const const volatile;
|
Useless
'warning: useless type name in empty declaration'.
int;
|
Default Specifier
The case where all 3 sections are empty is this trivial example. Supporting this case will help your compiler be backward compatible with the B programming_language.
i;
This will produce compiler warnings, but will declare a variable 'i' of type 'int' by default.
Unnamed Struct With No Instances
'warning: unnamed struct/union that defines no instances'.
struct {int i;};
|
Function Returning Function
In C a function can't return a function, but if it could, this is how you would declare it:
void foo(int)(int);
|
|
Array of Functions
If you could return functions, why not create an array of them (compile error):
int foo[3](int);
|
|
Returning Array of Functions
Or why not create a function that returns an array of functions (compile error):
void foo(int)[3](double);
|
|
Impossible Function Pointer Bitfield
Attempting to declare a function pointer bitfield: "error: bit-field 'f' has non-integral type 'int (*)(int)'";
int (*f)(int):2;
|
|
:2 |
Enum And Struct Together
Attempting to a variable that is both a struct and an enum (compile error):
enum foo struct boo i;
|
Unsigned Signed Double Int Char
unsigned signed double int char i; /* Does not compile */
|
Applications
As you can see, using the model above you can describe any non-tag type that can be declared in C. Furthermore, this representation allows you to easily generalize over types that you may find in the following places:
- Structure Members
- Function Parameters
- K & R C Function Parameters
- Globals And Locals
- Multi-declarator Declarations
- Function Declarations
- Function Definitions
- Default specifier type declarations
- Various Error Cases
- Sizeof calculations
- All cast expressions
- Abstract function parameters
This representation can also expresses many impossible types, such as global bitfields, or other things described above. I think this is fine, since the current model allows us to keep the data structure extremely simple. My preference is to move the 'artificial' complexity of these impossible types into validation checks, and allowing the compiler to understand impossible things like functions returning functions is good for producing error messages. Using this representation, many otherwise difficult checks become very simple. For example, when validating the type of a bitfield, the declarator part list should always be empty. You should never have two adjacent declarator parts that are of function type (otherwise you have a function returning a function).
There are still a number of required data structures that haven't yet been described in a detailed way: Function parameter lists; struct, union or enum definitions; array constants; etc. These will be described in future sections. It should however be obvious that this data structure can be used recursively in at least some way when describing function parameter lists or struct definitions.
A Somewhat Formal Definition
The examples above should give you a good idea of how to represent any non-tag type in C, but for clarity's sake let's be a bit more specific. Any non-tag type that can be declared in the C programming language can be expressed as a combination of the following three things:
- 1) An ordered list of type specifiers, type qualifiers, or storage class specifiers.
- 2) An ordered list of 'declarator parts'.
- 3) A single bitfield constant.
All three items can be empty. It should be noted that as stated above, this representation is not one-to-one with what a programmer would consider to be distinct types in C. For example, 'signed int i', 'int i', and 'int signed i' all have the same meaning to the programmer, so a further step would be necessary before they could be directly compared for 'sameness'. Modifying the specifier list is necessary not only for trivial re-orderings, but it is also required for situations where certain operations ignore, remove or add specifiers like const, volatile, extern, etc.
In the previous section, no formal definition was provided for what a 'declarator part' is other than that every declarator part 'type' is either an array, function or pointer. Extra meta-data will be necessary to track things like array size, array type (flexible or not), function parameters or qualifiers on pointers such as const or volatile. No description for how to accomplish this has been provided yet.
Notes On Specifiers
In this section, I have made a number of references to 'specifiers'. The grammar rules for C use the term 'specifier' in a somewhat inconsistent way, but whenever I say 'specifiers', you can take this to mean any and all of type specifiers ('int', 'enum abc', ...), storage class specifiers ('extern', 'static', ...), and type qualifiers ('const', 'volatile', ...).
The grammar rules of C, allow us to define a struct (or enum or union) anywhere that you can reference a struct (or union or enum). You can even define a struct in a function parameter (although you'd then never be able to call the function externally within the same translation unit (you could, however, define the function to call itself recursively)):
void foo(struct abc{int i;} p){
/* This code will compile in gcc and clang. */
}
It is assumed that whenever you encounter a reference to a struct (or union or enum) that also defines it, that the definition body will not be included as part of the specifier when using the specifier list representation discussed here. Every struct, union or enum specifier will need a reference to the definition it belongs to. Structs, unions and enums obey scoping rules, so the following example:
struct abc{int i;} p;
void foo(struct abc{int i;} p){
}
int main(void){
struct abc{int i;} p;
}
contains 3 different incompatible definitions of something called 'struct abc'. None of them can be assigned to each other. Therefore, struct, unions and enum specifiers should be internally represented in a way that indicates which scope level their definition belongs to. The above example would contain:
|
|
|
Another form of specifier is a 'type name' declared using the typedef keyword. These are also scoped and will need some description of the namespace they belong to. All other types of specifier, qualifiers etc. are simple C keywords which should not be difficult to represent internally.
In the previous part of this series, I introduced the idea of a 'declarator part', and discussed how a natural way to describe declarators uses a list of these parts. For a list of 'declarator parts' ordering definitely matters. In this section, I've also used a list to describe the specifiers. I believe I can say that the ordering of specifiers in a type never matters in C (I have been unable to find any counter-examples). If this assumption about the ordering of specifiers turns out to be true (in a very robust way) it might make sense to use a set or hashtable (with a counter for duplicate specifiers) to store specifiers instead. For now I'll stick with a list as the original ordering of the specifiers will at least be of interest for printing error messages.
Conclusion
In this section, we saw that every non-tag type in the C programming language can be described as a combination of a specifier list; a declarator part list; and a possible bitfield constant. This representation also allows for the type system to be generalized in meaningful ways that describe types that don't actually exist in the language, but would require specific and meaningful error messages. In future sections we will explore more aspects of representing types, such as data structures for tag types and declarator parts.
How to Get Fired Using Switch Statements & Statement Expressions
Published 2016-10-27 |
$40.00 CAD |
The Jim Roskind C and C++ Grammars
Published 2018-02-15 |
7 Scandalous Weird Old Things About The C Preprocessor
Published 2015-09-20 |
Building A C Compiler Type System - Part 1: The Formidable Declarator
Published 2016-07-07 |
Modelling C Structs And Typedefs At Parse Time
Published 2017-03-30 |
The Magical World of Structs, Typedefs, and Scoping
Published 2016-05-09 |
An Artisan Guide to Building Broken C Parsers
Published 2017-03-30 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|