CPPGM Programming Assignment 5 (preproc)
Write a C++ application called
preproc that takes as input a set of C++ Source Files, executes translation phases 1 through 6 and the tokenization part of phase 7, and describes the resulting sequence of tokens to an output file. (Notice that this means
preproc will be a complete standard-compliant C++ Preprocessor and Lexer.)
preproc will be invoked as follows:
$ preproc -o <outfile> <srcfile1> <srcfile2> ... <srcfileN>
N >= 1
preproc shall read each
<srcfilei> file, execute phases 1 through 6 and the tokenization part of phase 7, then describe the resulting sequences of fully preprocessed and lexically analyzed tokens to
<outfile> in the specified format.
- If the first argument of
-o, behaviour is undefined.
- If the
<outfile>argument or any
<srcfilei>argument starts with
-, behaviour is undefined
These two relaxations give you the freedom to conditionally-implement implementation-defined non-default behaviour with command-line switches such as
--foo, and so on.
You should complete Programming Assignment 3 and 4 before starting this assignment.
The starter kit can be obtained from:
$ git clone git://git.cppgm.org/pa5.git
It contains a stub implementation of
preproc with some starter code, a compiled reference implementation and a test suite.
You will also want to reuse most of your code from PA1-PA4.
<srcfilei> should be opened and read relative to the current working directory of
std::ifstream. Each input file will be in UTF-8 format.
preproc shall create (or overwrite if it exists) a file with the name
<outfile>. You can use
std::ofstream to do so.
The first line of
<outfile> shall be:
<N> is the number of
Following that, for each
<srcfilei>, a per-srcfile section is output, in the same order as given on the command-line.
The first line of the per-srcfile section shall be:
<srcfilei> is the name of the infile given on the command-line.
The remainder of each per-srcfile output is the same as PA2
posttoken. This implies each per-srcfile section is terminated with an
eof, so there will be N
eofs in total. (
eof lines are not generated for
For example for a call of
$ preproc -o foo bar baz
bar is a file that contains
1 + 2 and
baz is a file that contains
A file called
foo should be created with the following contents:
preproc 2 sof bar literal 1 int 01000000 simple + OP_PLUS literal 2 int 02000000 eof sof baz identifier qux eof
For PA5 it is considered an error if any token stream at phase 7 contains a PA2
It is also an error if the
#error directive is encountered in a non-excluded section.
It is also an error if an
#if controlling expression contains a PA3
If an error occurs,
return EXIT_FAILURE as per the behaviour of
pptoken in PA1.
The state of the
outfile in case of
EXIT_FAILURE is undefined.
preproc standard output and standard error are undefined and ignored, whether or not an error occurs.
As per PA1
As usual to run the tests against the reference implementation use:
$ make test
Note that because the filesystem and current working directory are significant for PA5,
preproc must be run from the root of the PA5 starter kit (one directory up from the
make test takes care of this for you.
Also note that stdout and stderr are combined into a
.stdout file for PA5. This output is not significant. The
.my files without a further suffix are the actual
For adhoc testing two wrapper scripts are included called
preproc-ref-stdin. They create temporary files to hold the
outfile and then pass stdin to the temporary file, execute the
preproc, and write the outfile to stdout.
preproc-stdin uses your
preproc-ref-stdin uses the reference implementation
Logically each source file is processed in turn with no shared state between them.
The following preprocessing directives must be implemented:
- conditional inclusion (`#if`...`#endif`) (using PA3) - source file inclusion (`#include`) - macro replacement (using PA4) - line control (`#line`) - a course-defined list of pre-defined macros - a course-defined list of pragmas (`#pragma`) - error directive (`#error`) - null directive and non-directive
Also you must implement the
Preprocessing Directive List
#if #ifdef #ifndef #elif #else #endif #include #define #undef #line #error #pragma
If no tokens follow
# on a logical line, it is a null directive and is always ignored.
If the first token after
# is not an identifier, or not one of the identifiers in the above list, the logical line is a
non-directive. It is an error if in an active
#if section, or ignored if it is in an inactive one.
The following pre-defined macros shall be implemented:
#define __CPPGM__ 201303L // CPPGM course run version #define __cplusplus 201103L // C++ version #define __STDC_HOSTED__ 1 // hosted implementation
The above are fixed values.
#define __CPPGM_AUTHOR__ "John Smith" // Your full real name as enrolled in the course
Replace "John Smith" with a string literal that is your full real name.
#define __FILE__ "foo" // current presumed source file name #define __LINE__ 123 // current presumed source line number
Behaviour specified below.
#define __DATE__ "Mmm dd yyyy" // build date from asctime #define __TIME__ "hh:mm:ss" // build time from asctime
std::asctime function to implement
__TIME__. Call it once at the entry of main, and use the same build date and time for all srcfiles.
You may conditionally-implement any other pre-defined macros provided they start with an underscore and then a capital (
_Foo) or another underscore (
The following pragmas shall be supported:
The following pragma shall NOT be supported (and so should be ignored):
#pragma cppgm_mock_unknown pptokens
You may conditionally-implement any additional pragmas.
The course-defined treatment of the _Pragma operator is as follows:
_Pragma(string-literal)shall be recognized only in a PA4
text-sequence, and only after all macro replacement. Any occurence of the identifier
_Pragmamust be proceeded by
( string-literal )or it is an error. The pragma operator invocation tokens shall be removed from the
text-sequenceafter it is executed.
Source File Inclusion and
The course-defined handling of source file inclusion in the default case shall be as follows:
The current file shall be tracked by a string variable
__FILE__ shall be the same as the command-line argument.
#include directive is encountered it will match this form:
pptokens is any sequences of
pptokens shall be macro replaced as per PA4. If the resulting sequence of tokens is not a
header-name or an ordinary
string-literal, behaviour is undefined. (Notice that a
header-name can only be the result if it was already there, before macro replacement, as per PA1)
header-name types (
"foo") are treated the same. The delimiters are stripped and the resulting code points are converted into a UTF-8 string, we shall call
In the case of an ordinary
string-literal it shall be post-tokenized into a UTF-8 string, and likewise we shall call the string
__FILE__ contains a
/ character, a new string
pathrel is formed by concatenating (A) the sub-string of
__FILE__ up to and including the last
/; and (B) the string
__FILE__ = "foo/bar/baz" #include "qux" nextf = "qux" pathrel = "foo/bar/" + "qux" = "foo/bar/qux"
Once the two strings
pathrel are identified they are searched as follows:
pathrelis defined and a file exists of that path relative to the current working directory (or absolute if it starts with
/), it shall be the include file.
- Otherwise, if a file exists of the path
nextfrelative to the current working directory (or absolute if it starts with a
/) shall be the include file.
- Otherwise, it is an error and
EXIT_FAILUREshould be returned.
The new value of
__FILE__ is whichever one of 1 or 2 succeeded.
Recall that you can optionally implement command-line switches which alter or extend this behaviour. You may wish to implement a
-I <path> switch to add additional paths, and/or a
--stdinc switch which also searches
/usr/include, and so on. However, exactley the two search paths specified (
__FILE__ relative, and current working directory relative) must be the
preproc default behaviour.
Line Control and
pptokens is any sequences of
preprocessing-tokens are macro-replaced.
After macro replacement
#line will match one of the following two forms:
#line ppnumber #line ppnumber string-literal
ppnumber should post-tokenize to a positive integer. The
string-literal if present shall be an ordinary
string-literal. If this is not the case behaviour is undefined.
The integer shall set the current value of
__LINE__, the string shall set the current value of
__FILE__ (notice that this will impact future
#pragma once handling is specified as follows:
There is a function in the starter code called
bool PA5GetFileId(const string& path, PA5FileId& out_fileid)
It takes a file
path as input and an out parameter of type
PA5FileId. It returns
true on success (or
false on failure, such as because the file does not exist).
So it should be used as follows:
string filepath = "foo/bar/baz"; PA5FileId fileid; bool ok = PA5GetFileId(filepath, fileid); if (ok) // use fileid else // file not found or unaccessible
For each srcfile maintain an (initially empty) set of fileids of headers that have been pragma onced (eg with a
When you process a
#pragma once, add the file id of
__FILE__ to that set.
Each time you encounter an
#include directive, lookup the file id and check if it is in that set. If it is ignore the
Side Note: PA5GetFileId is implemented by using the
stat(2)system call. It looks up the device id and ino number of the file, and uses those two numbers to differentiate files. We will implement system calls in a later assignment, so you don't need to understand this now.
You should read the parts of clause 16 Preprocessing directives that you have not already read.
Design Notes (Optional)
You will need to start with PA1
pptoken code as usual, however you will need to add source file name and physical line tracking to support
One way to do this is before the entry to
PPTokenizer count physical new line characters before they enter the tokenizer, and then as
preprocessing-tokens are emitted, store the current source file name and line in each token. So you are not storing a string in each token (although they are usually copy-on-write anyway), you may want to store filenames in a table and use an
int index. After macro invocation assign the source file and line of each produced token to be that of the head. When you invoke a
__LINE__ predefined macro, simply replace it with its own file or line.
Now we have a stream of
pptokens with file and line numbers marked.
The next step is to accumulate preprocessing directives and text sequences. You will need to do this in a stream, as some preprocessing directives will change the system state that will effect how later directives are handled. You can delimit preprocessing tokens as previously. A
new-line # signifies the start of a preprocessing directive and a
new-line ends one. This can be done with a little DFA or state machine.
Each time a preprocessing directive is encountered you will need to take a certain action.
Conditional inclusion is a bit more complicated. A
#ifndef will start an if group, a matching
#endif will close it. In between there can be one of more
#elifs and an optional
#else. Depending on the corresponding truth values of the controlling expressions, each section is either active or inactive.
Whether in an inactive or active group, a nested
#if group must be ordered correctly (with respect to
#endif), however inside an inactive group all preprocessing directives (aside from there name) are ignored. For example, the following is valid:
#if 0 #if foo bar baz #elif foo bar baz ... @ #else @@@@@ #endif @@@@@ #else ok #endif
This is because the nested
#if group is in an inactive section. Any sequence of tokens can come after the directive names in this case. However the following is invalid:
#if 0 #if foo bar baz #else @@@@@ #elif foo bar baz ... @ #endif @@@@@ #else ok #endif
#elif directives are in the wrong order. This order must be correct even in an inactive section.
This can be implemented with a stack of states. When you enter an
#if group or
#include, push a new state onto the stack. As you proceed through an
#if group update the state. When you exit an
#if group or
#include pop the state off the stack. The state can keep track of whether or not you are in an
#if group, where in the
#if group you are, and whether or not you are currently active. As the different directives are encountered they alter the state on the top of the stack or push or pop states from it.
Standard Platform Includes (Optional/Ungraded)
This is an optional additional feature. You are not required to implement it, and it will not be graded.
--stdinc command-line switch that adds the following directories to your include file search path:
/usr/include/c++/4.7 /usr/include/c++/4.7/x86_64-linux-gnu /usr/include/c++/4.7/backward /usr/lib/gcc/x86_64-linux-gnu/4.7/include /usr/local/include /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed /usr/include/x86_64-linux-gnu /usr/include
These are the stanard system includes from the bootstrap enviornment.
You can extract this list from the bootstrap gcc compiler with the following command:
$ echo | g++ -E -v -x c++ -
Also of interest, to extract the list of predefined macros from gcc you can use this command:
$ gcc -dM -E - <<<''
If you have time feel free to experiment with preprocessing the headers in
/usr/include, but note that many headers use non-standard features that will not be compatible with your toolchain. This is expected, and not something to worry about. Much later in the course you will be making a self-hosting build of your compiler, and discard the bootstrap environment. The produced self-hosted compiler will compile programs against your standard library headers, and not the standard library headers from the bootstrap environment. The bootstrap standard library is only used for building your "stage 1" compiler, which is dynamically linked against the bootstrap standard library (
libstdc++). Your stage 1 compiler will statically link compiled programs against your standard library. Your stage 1 compiler will then be used to compile your stage 2 compiler, by compiling your compiler sources. This stage 2 compiler will be statically linked against your standard library (as well as compiling programs against your standard library like the stage 1 compiler). Notice that this stage 2 compiler will have no dependencies on the bootstrap environment.
The important thing to understand for this assignment is that nowhere in this whole bootstrapping process does your
preproc code need to preprocess the headers in