CPPGM Programming Assignment 5 (preproc)
Overview
Write a C++ application called preproc
that takes as input a set of C++ Source Files, executes translation phases 1 through 6 and the tokenization part of phase 7, and describes the resulting sequence of tokens to an output file. (Notice that this means preproc
will be a complete standard-compliant C++ Preprocessor and Lexer.)
Command-Line Arguments
preproc
will be invoked as follows:
$ preproc -o <outfile> <srcfile1> <srcfile2> ... <srcfileN>
where N >= 1
preproc
shall read each <srcfilei>
file, execute phases 1 through 6 and the tokenization part of phase 7, then describe the resulting sequences of fully preprocessed and lexically analyzed tokens to <outfile>
in the specified format.
- If the first argument of
preproc
is not-o
, behaviour is undefined. - If the
<outfile>
argument or any<srcfilei>
argument starts with-
, behaviour is undefined
These two relaxations give you the freedom to conditionally-implement implementation-defined non-default behaviour with command-line switches such as -a
, --foo
, and so on.
Prerequisites
You should complete Programming Assignment 3 and 4 before starting this assignment.
Starter Kit
The starter kit can be obtained from:
$ git clone git://git.cppgm.org/pa5.git
It contains a stub implementation of preproc
with some starter code, a compiled reference implementation and a test suite.
You will also want to reuse most of your code from PA1-PA4.
Input Format
Each <srcfilei>
should be opened and read relative to the current working directory of preproc
using std::ifstream
. Each input file will be in UTF-8 format.
Output Format
preproc
shall create (or overwrite if it exists) a file with the name <outfile>
. You can use std::ofstream
to do so.
The first line of <outfile>
shall be:
preproc <N>
where <N>
is the number of <srcfiles>
.
Following that, for each <srcfilei>
, a per-srcfile section is output, in the same order as given on the command-line.
The first line of the per-srcfile section shall be:
sof <srcfilei>
where <srcfilei>
is the name of the infile given on the command-line.
The remainder of each per-srcfile output is the same as PA2 posttoken
. This implies each per-srcfile section is terminated with an eof
, so there will be N eofs
in total. (sof
and eof
lines are not generated for #include
files)
Example
For example for a call of
$ preproc -o foo bar baz
Where bar
is a file that contains 1 + 2
and baz
is a file that contains qux
A file called foo
should be created with the following contents:
preproc 2
sof bar
literal 1 int 01000000
simple + OP_PLUS
literal 2 int 02000000
eof
sof baz
identifier qux
eof
Error Reporting
For PA5 it is considered an error if any token stream at phase 7 contains a PA2 invalid
token.
It is also an error if the #error
directive is encountered in a non-excluded section.
It is also an error if an #if
controlling expression contains a PA3 error
result.
If an error occurs, preproc
should return EXIT_FAILURE
as per the behaviour of pptoken
in PA1.
The state of the outfile
in case of EXIT_FAILURE
is undefined.
preproc
standard output and standard error are undefined and ignored, whether or not an error occurs.
Restrictions
As per PA1
Testing
As usual to run the tests against the reference implementation use:
$ make test
Note that because the filesystem and current working directory are significant for PA5, preproc
must be run from the root of the PA5 starter kit (one directory up from the tests
directory). make test
takes care of this for you.
Also note that stdout and stderr are combined into a .stdout
file for PA5. This output is not significant. The.ref
and .my
files without a further suffix are the actual outfiles
.
For adhoc testing two wrapper scripts are included called preproc-stdin
and preproc-ref-stdin
. They create temporary files to hold the srcfile
and outfile
and then pass stdin to the temporary file, execute the preproc
, and write the outfile to stdout. preproc-stdin
uses your preproc
implementation. preproc-ref-stdin
uses the reference implementation preproc-ref
.
Features
Logically each source file is processed in turn with no shared state between them.
The following preprocessing directives must be implemented:
- conditional inclusion (`#if`...`#endif`) (using PA3)
- source file inclusion (`#include`)
- macro replacement (using PA4)
- line control (`#line`)
- a course-defined list of pre-defined macros
- a course-defined list of pragmas (`#pragma`)
- error directive (`#error`)
- null directive and non-directive
Also you must implement the _Pragma()
operator
Preprocessing Directive List
#if
#ifdef
#ifndef
#elif
#else
#endif
#include
#define
#undef
#line
#error
#pragma
If no tokens follow #
on a logical line, it is a null directive and is always ignored.
If the first token after #
is not an identifier, or not one of the identifiers in the above list, the logical line is a non-directive
. It is an error if in an active #if
section, or ignored if it is in an inactive one.
Pre-defined Macros
The following pre-defined macros shall be implemented:
#define __CPPGM__ 201303L // CPPGM course run version
#define __cplusplus 201103L // C++ version
#define __STDC_HOSTED__ 1 // hosted implementation
The above are fixed values.
#define __CPPGM_AUTHOR__ "John Smith" // Your full real name as enrolled in the course
Replace "John Smith" with a string literal that is your full real name.
#define __FILE__ "foo" // current presumed source file name
#define __LINE__ 123 // current presumed source line number
Behaviour specified below.
#define __DATE__ "Mmm dd yyyy" // build date from asctime
#define __TIME__ "hh:mm:ss" // build time from asctime
Use the std::asctime
function to implement __DATE__
and __TIME__
. Call it once at the entry of main, and use the same build date and time for all srcfiles.
You may conditionally-implement any other pre-defined macros provided they start with an underscore and then a capital (_Foo
) or another underscore (__foo
).
Pragmas
The following pragmas shall be supported:
#pragma once
The following pragma shall NOT be supported (and so should be ignored):
#pragma cppgm_mock_unknown pptokens
You may conditionally-implement any additional pragmas.
The course-defined treatment of the _Pragma operator is as follows:
A
_Pragma(string-literal)
shall be recognized only in a PA4text-sequence
, and only after all macro replacement. Any occurence of the identifier_Pragma
must be proceeded by( string-literal )
or it is an error. The pragma operator invocation tokens shall be removed from thetext-sequence
after it is executed.
Source File Inclusion and __FILE__
The course-defined handling of source file inclusion in the default case shall be as follows:
The current file shall be tracked by a string variable __FILE__
. Initially __FILE__
shall be the same as the command-line argument.
When a #include
directive is encountered it will match this form:
#include pptokens
where pptokens
is any sequences of preprocessing-tokens
.
pptokens
shall be macro replaced as per PA4. If the resulting sequence of tokens is not a header-name
or an ordinary string-literal
, behaviour is undefined. (Notice that a header-name
can only be the result if it was already there, before macro replacement, as per PA1)
Both header-name
types (<foo>
and "foo"
) are treated the same. The delimiters are stripped and the resulting code points are converted into a UTF-8 string, we shall call nextf
.
In the case of an ordinary string-literal
it shall be post-tokenized into a UTF-8 string, and likewise we shall call the string nextf
.
If __FILE__
contains a /
character, a new string pathrel
is formed by concatenating (A) the sub-string of __FILE__
up to and including the last /
; and (B) the string nextf
In pseudo-code:
__FILE__ = "foo/bar/baz"
#include "qux"
nextf = "qux"
pathrel = "foo/bar/" + "qux" = "foo/bar/qux"
Once the two strings nextf
and pathrel
are identified they are searched as follows:
- If
pathrel
is defined and a file exists of that path relative to the current working directory (or absolute if it starts with/
), it shall be the include file. - Otherwise, if a file exists of the path
nextf
relative to the current working directory (or absolute if it starts with a/
) shall be the include file. - Otherwise, it is an error and
EXIT_FAILURE
should be returned.
The new value of __FILE__
is whichever one of 1 or 2 succeeded.
Recall that you can optionally implement command-line switches which alter or extend this behaviour. You may wish to implement a -I <path>
switch to add additional paths, and/or a --stdinc
switch which also searches /usr/include
, and so on. However, exactley the two search paths specified (__FILE__
relative, and current working directory relative) must be the preproc
default behaviour.
Line Control and __LINE__
#line pptokens
where pptokens
is any sequences of preprocessing-tokens
. preprocessing-tokens
are macro-replaced.
After macro replacement #line
will match one of the following two forms:
#line ppnumber
#line ppnumber string-literal
The ppnumber
should post-tokenize to a positive integer. The string-literal
if present shall be an ordinary string-literal
. If this is not the case behaviour is undefined.
The integer shall set the current value of __LINE__
, the string shall set the current value of __FILE__
(notice that this will impact future #include
behaviour).
Pragma Once
The course-defined #pragma once
handling is specified as follows:
There is a function in the starter code called PA5GetFileId
:
bool PA5GetFileId(const string& path, PA5FileId& out_fileid)
It takes a file path
as input and an out parameter of type PA5FileId
. It returns true
on success (or false
on failure, such as because the file does not exist).
So it should be used as follows:
string filepath = "foo/bar/baz";
PA5FileId fileid;
bool ok = PA5GetFileId(filepath, fileid);
if (ok)
// use fileid
else
// file not found or unaccessible
For each srcfile maintain an (initially empty) set of fileids of headers that have been pragma onced (eg with a std::set<PA5FileId>
)
When you process a #pragma once
, add the file id of __FILE__
to that set.
Each time you encounter an #include
directive, lookup the file id and check if it is in that set. If it is ignore the #include
directive.
Side Note: PA5GetFileId is implemented by using the
stat(2)
system call. It looks up the device id and ino number of the file, and uses those two numbers to differentiate files. We will implement system calls in a later assignment, so you don't need to understand this now.
Reading
You should read the parts of clause 16 Preprocessing directives that you have not already read.
Design Notes (Optional)
You will need to start with PA1 pptoken
code as usual, however you will need to add source file name and physical line tracking to support __FILE__
and __LINE__
.
One way to do this is before the entry to PPTokenizer
count physical new line characters before they enter the tokenizer, and then as preprocessing-tokens
are emitted, store the current source file name and line in each token. So you are not storing a string in each token (although they are usually copy-on-write anyway), you may want to store filenames in a table and use an int
index. After macro invocation assign the source file and line of each produced token to be that of the head. When you invoke a __FILE__
or __LINE__
predefined macro, simply replace it with its own file or line.
Now we have a stream of pptokens
with file and line numbers marked.
The next step is to accumulate preprocessing directives and text sequences. You will need to do this in a stream, as some preprocessing directives will change the system state that will effect how later directives are handled. You can delimit preprocessing tokens as previously. A new-line #
signifies the start of a preprocessing directive and a new-line
ends one. This can be done with a little DFA or state machine.
Each time a preprocessing directive is encountered you will need to take a certain action.
Conditional inclusion is a bit more complicated. A #if
, #ifdef
or #ifndef
will start an if group, a matching #endif
will close it. In between there can be one of more #elifs
and an optional #else
. Depending on the corresponding truth values of the controlling expressions, each section is either active or inactive.
Whether in an inactive or active group, a nested #if
group must be ordered correctly (with respect to #elifs
, #else
and #endif
), however inside an inactive group all preprocessing directives (aside from there name) are ignored. For example, the following is valid:
#if 0
#if foo bar baz
#elif foo bar baz ... @
#else @@@@@
#endif @@@@@
#else
ok
#endif
This is because the nested #if
group is in an inactive section. Any sequence of tokens can come after the directive names in this case. However the following is invalid:
#if 0
#if foo bar baz
#else @@@@@
#elif foo bar baz ... @
#endif @@@@@
#else
ok
#endif
The #else
and #elif
directives are in the wrong order. This order must be correct even in an inactive section.
This can be implemented with a stack of states. When you enter an #if
group or #include
, push a new state onto the stack. As you proceed through an #if
group update the state. When you exit an #if
group or #include
pop the state off the stack. The state can keep track of whether or not you are in an #if
group, where in the #if
group you are, and whether or not you are currently active. As the different directives are encountered they alter the state on the top of the stack or push or pop states from it.
Standard Platform Includes (Optional/Ungraded)
This is an optional additional feature. You are not required to implement it, and it will not be graded.
Implement a --stdinc
command-line switch that adds the following directories to your include file search path:
/usr/include/c++/4.7
/usr/include/c++/4.7/x86_64-linux-gnu
/usr/include/c++/4.7/backward
/usr/lib/gcc/x86_64-linux-gnu/4.7/include
/usr/local/include
/usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed
/usr/include/x86_64-linux-gnu
/usr/include
These are the stanard system includes from the bootstrap enviornment.
You can extract this list from the bootstrap gcc compiler with the following command:
$ echo | g++ -E -v -x c++ -
Also of interest, to extract the list of predefined macros from gcc you can use this command:
$ gcc -dM -E - <<<''
If you have time feel free to experiment with preprocessing the headers in /usr/include
, but note that many headers use non-standard features that will not be compatible with your toolchain. This is expected, and not something to worry about. Much later in the course you will be making a self-hosting build of your compiler, and discard the bootstrap environment. The produced self-hosted compiler will compile programs against your standard library headers, and not the standard library headers from the bootstrap environment. The bootstrap standard library is only used for building your "stage 1" compiler, which is dynamically linked against the bootstrap standard library (libstdc++
). Your stage 1 compiler will statically link compiled programs against your standard library. Your stage 1 compiler will then be used to compile your stage 2 compiler, by compiling your compiler sources. This stage 2 compiler will be statically linked against your standard library (as well as compiling programs against your standard library like the stage 1 compiler). Notice that this stage 2 compiler will have no dependencies on the bootstrap environment.
The important thing to understand for this assignment is that nowhere in this whole bootstrapping process does your preproc
code need to preprocess the headers in /usr/include
.
