C++ Grandmaster Certification

CPPGM Programming Assignment 4 (macro)

Overview

Write a C++ application called macro that accepts a C++ Source File on standard input and executes phases 1 through 6 and the tokenization part of phase 7.

The input file may contain:

  • #define preprocessing directives
  • #undef preprocessing directives

The input file does NOT include any...

  • preprocessing directives other than #define or #undef
  • pre-defined macro names
  • the pragma operator

Output the resulting tokens as per PA2 posttoken.

Prerequisites

You should complete Programming Assignment 2 before starting this assignment.

Starter Kit

The starter kit can be obtained from:

$ git clone git://git.cppgm.org/pa4.git

It contains a stub implementation of macro, a compiled reference implementation and a test suite.

You will also want to reuse most of your code from PA2.

Input Format

The C++ Source File shall be read from standard input std::cin in UTF-8 format as per PA1 pptoken.

Output Format

The tokens shall be output as per PA2 posttoken.

Error Reporting

If an error occurs macro should return EXIT_FAILURE as per the behaviour of pptoken in PA1.

Example

For an input of:

#define a 3
#define f(x) x x
f(a)

Produces an output of:

literal 3 int 03000000
literal 3 int 03000000
eof

Restrictions

As per PA1

Features

Once you have executed phase 1-3 as per PA1 pptoken to produce a sequence of preprocessing-tokens, you will divide them into preprocessing directives and text-sequences.

Preprocessing directives can be identified by their prefix, one of:

start-of-file #
start-of-file whitespace-sequence #
new-line #
new-line whitespace-sequence #

Where start-of-file denotes the beggining of the token sequence.

Any of these patterns must start a preprocessing directive. The preprocessing directive is terminated by the next new-line.

The only preprocessing directives for PA4 will be macro defines and undefs:

// object-like macros
# define identifier new-line
# define identifier must-whitespace replacement-list new-line

// function-like macros
# define identifier no-whitespace ( ) replacement-list new-line
# define identifier no-whitespace ( identifier-list ) replacement-list new-line
# define identifier no-whitespace ( ... ) replacement-list new-line
# define identifier no-whitespace ( identifier-list , ... ) replacement-list new-line

// undefine macro
# undef identifier new-line

no-whitespace means that there is no whitespace-sequence between the two adjacent tokens. must-whitespace means there must be a whitespace-sequence between the two adjacent tokens. Otherwise whitespace-sequence may optionally appear between any tokens.

A replacement-list is a possibly empty sequence of any preprocessing-tokens apart from new-line.

An identifier-list is a comma-separated sequence of identifiers.

#define directives define macros, #undef directives undefine them.

A text-sequence is a maximal sequence of tokens that is not part of a preprocessing directive.

Each text-sequence must be macro replaced using the currently defined set of macros at that point in the file. Once this has been done the macro replaced preprocessing-tokens should be processed by phase 5-7 as per PA2 posttoken and output.

The rules for macro replacement are described in section 16.3 Macro Replacement of the standard.

If the name of the macro being replaced is found during this scan of the replacement list (not including the rest of the source file’s preprocessing tokens), it is not replaced. Furthermore, if any nested replacements encounter the name of the macro being replaced, it is not replaced. These nonreplaced macro name preprocessing tokens are no longer available for further replacement even if they are later (re)examined in contexts in which that macro name preprocessing token would otherwise have been replaced.

We shall course-define two invocations X and Y to be nested with respect to each other if the head macro name identifier token of Y was part of the replaced tokens of the X invocation. This nestedness relationship shall not survive parameter substitution:

#define f(x) 1 g(x)
#define g(x) 2 f(x)

f(f(x))

The trace of the above is:

f(f(x))
f(1 g(x))
f(1 2 f(x))
1 g(1 2 f(x))
1 2 f(1 2 f(x))

However a macro name token that is not replaced due to nesting is never replaced again, even after substitution as:

#define z z[0]
#define f(x) 1 x

f(z)

The trace is:

f(z)
f(z[0])
1 z[0]

z remains unavailable for invocation even after it is substituted in a parameter.

Finally we show:

#define f(x) 1 x
#define g(x) 2 x

g(f)(g)(3)

The trace is:

g(f)(g)(3)
2 f(g)(3)  <-- f is nested with respect to g as the token is part of the previous g invocation
2 1 g(3)   <-- g is nested with respect to f, therefore it is not replaced (g -> f -> g)

Note that this course-defined interpretation of the nesting semantics is not necessarily the same one used by all other preprocessors (in particular gcc).

Design Notes (optional)

As usual these design notes only describe one possible way to do this assignment, and are entirely optional.

First of all, it is recommended that you read 16.3 a couple of times before you continue reading these notes, and certainly before you start writing code. Some find the logic of macro replacement quite challenging to implement, so be prepared for this.

You can start by making a PreprocessingToken class that packs up the output of IPPTokenStream.

Perform a simple top-down predictive parse to get a stream of preprocessing directives and text-sequences.

Keep a table of currently defined macros.

When you encounter a #define add the macro to the table. If it currently exists check to see that the identifier-list and replacement-list is the same as the currently defined one. It is an error if they do not match.

When you encounter an #undef remove the macro from the table.

When you encounter a text-sequence you need to process any macro invocations it contains.

As a first step note that new-lines are not significant once a text-sequence has been identified, so you can collapse new-lines into whitespace-sequence at this point. After that you will also want to collapse each resulting contiguous sequences of whitespace-sequences into a singular whitespace-sequence.

As you find macro invocations you will invoke them and replace the invocation with a new sequence of tokens. These new tokens have to be rescanned.

Consider the following:

#define f(x) 1 x (
#define g(x) 2 x
f(g)b)

The text-sequence is as follows:

f(g)b)

After the first invocation f(g) is replaced by 1 g ( resulting in:

1 g (b)

This must be rescanned, and a new invocation is found. That of g (b), which is replaced by 2 b:

1 2 b

Given the nature of that operation, a good data structure to use is a stack, where the top of the stack is the start of the text sequence (ie token order is in reverse). That way when an invocation is replaced, its tokens can be popped off the top of the stack and the replacement list can be pushed onto it, and processing can continue from the top of the stack again.

After you identify a function-like macro invocation and its arguments, there are three possible transformations you may need to perform on each argument before they are used:

  • the argument is used as is
  • the argument is macro replaced before being used
  • the argument is stringized

Which of the three options is applied depends on the usage of the parameter within the replacement-list of the macro. If it follows a #, the argument is stringized. If it is next to a ## the argument is used as is. If neither, than it is macro replaced before being used.

The rules for nesting and recursion are difficult to understand, and even more difficult to implement.

We suggest keeping with each identifier token, a blacklist of nested macro names and a noninvokable flag. When an identifier token is produced as part of a macro invocation its blacklist is assigned to be that of the head token of the macro that was just invoked, and also the name of the macro that was used is added.

Once you have identified a new potential macro invocation, check to see if the heads name is contained in its own blacklist, if it is flag it as noninvokable. If the head token is flagged as noninvokable, abort the invocation.