CPPGM Programming Assignment 2 (posttoken)
Overview
Write a C++ application called posttoken
that accepts a C++ Source File on standard input that does NOT include any...
- preprocessing directives
- pre-defined macro names
- the pragma operator
...executes phases 1, 2, 3, 4, 5, 6 and the tokenization part of 7 of the Phases of Translation, and describes the resulting sequence of analyzed and categorized tokens
to standard output in the specified format.
Notice that by virtue of the restrictions, phase 4 (preprocessing) is a no-op (does nothing).
Prerequisites
You should complete Programming Assignment 1 before starting this assignment.
Starter Kit
The starter kit can be obtained from:
$ git clone git://git.cppgm.org/pa2.git
It contains a stub implementation of posttoken
with some optional starter code, a compiled reference implementation and a test suite.
You will also want to reuse most of your code from PA1.
Input Format
The C++ Source File shall be read from standard input std::cin
in UTF-8 format as per PA1.
Error Reporting
If an error occurs in phases 1, 2 or 3 main
should return EXIT_FAILURE
as per the behaviour of pptoken
in PA1.
If an error occurs while converting a preprocessing-token
to a token
than you should output an invalid
token (see below) and continue.
Preprocessing-tokens #
, ##
, %:
, %:%:
, non-whitespace-characters
, and header-names
, should be output as invalids
.
If a preprocessing-token
contains a pre-defined macro names or the pragma operator you may treat them as identifiers.
Restrictions
As per PA1
Output Format
posttoken
shall write to standard output the following in UTF-8 format:
For each token in the sequence, one line shall be printed consisting of space-character separated elements.
The first element is the Token Type, one of:
simple
identifier
literal
user-defined-literal
invalid
The second element is the UTF-8 encoded source from. The source is the PA1 Token Data. (In particular it should have universal-character-names
decoded, but escape-sequences
are not decoded).
In the case of string literal token concatenation, multiple preprocessing-tokens
are used to form a single token
. The source for such a token is a space-seperated list of the preprocessing-token
sources. If string literal concatenation fails than output an invalid
that also has a source of the same space seperated list.
The output should end with a eof
on line by itself.
Depending on the Token Type the remaining elements are as follows:
simple
simple
represent keywords
, operators
and punctuators
.
This includes true
, false
, nullptr
, new
and delete
- for this assignment we will treat them as keywords
and not literals
or operators
.
They are output as
simple <source> <TERMNAME>
Where <TERMNAME>
is given in the list below under Simple Token Types (defined below).
For example, for an input of:
auto &&
the output is:
simple auto KW_AUTO
simple && OP_LAND
identifier (in output context)
identifiers
are output as
identifier <source>
Where <source>
is the identifier in UTF-8, for example for an input of:
foo
the output is:
identifier foo
literal
literal <source> <type> <hexdump>
Where <type>
is the type of the literal - either a Fundamental Type (defined below), or an array of <n> <fundamental_type>
.
For example:
char
long long int
array of 42 char16_t
And <hexdump>
is the hexadecimal representation of the data in memory in the Linux x86-64 ABI (defined below).
Important: When scanning
floating-literals
for output, please use the functions from the starter codePA2Decode_float
,PA2Decode_double
andPA2Decode_long_double
. You may also implement scanning yourself but please pass the literal through these functions for PA2 output for bit-perfect compatibility with test harness and reference implementation.
For example, for an input of:
1000000 'A' "ABC" 3.2
the output is:
literal 1000000 int 40420F00
literal 'A' char 41
literal "ABC" array of 4 char 41424300
literal 3.2 double 9A99999999990940
In the case of string literal concatenation:
"abc" /* concat */ "def"
The output should have the sources joined space separated:
literal "abc" "def" array of 7 char 61626364656600
user-defined-literal
One of:
user-defined-literal <source> <ud-suffix> integer <prefix>
user-defined-literal <source> <ud-suffix> floating <prefix>
user-defined-literal <source> <ud-suffix> character <type> <hexdump>
user-defined-literal <source> <ud-suffix> string <type> <hexdump>
Where <ud-suffix>
is the ud-suffix
In the case of integer
and floating
, <prefix>
is the source without the ud-suffix
In the case of string
and character
, <type> <hexdump>
have the same meaning as in literal
(above).
For example:
123_foo 4.2_bar 0x3_baz "abc"_qux 'a'_quux
outputs:
user-defined-literal 123_foo _foo integer 123
user-defined-literal 4.2_bar _bar floating 4.2
user-defined-literal 0x3_baz _baz integer 0x3
user-defined-literal "abc"_qux _qux string array of 4 char 61626300
user-defined-literal 'a'_quux _quux character char 61
invalid
For any valid preprocessing-token
that does not posttokenize output:
invalid <source>
For example:
# 123abc 1..e
Should output:
invalid #
invalid 123abc
invalid 1..e
Also as stated above invalid string literal concatenation should output one invalid. For example:
u8"abc" u"def" U"ghi"
Should output:
invalid u8"abc" u"def" U"ghi"
Features
First you will need to apply the same functionality from pptoken to get the stream of processing-tokens
.
whitespace-sequence
and new-line
are ignored. (They are for preprocessing, which we shall implement in a later assignment)
Any occurences of non-whitespace-character
or header-name
output as an invalid
.
The remaining preprocessing-tokens
are:
identifier
preprocessing-op-or-punc
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
We will discuss each of them:
identifier (in preprocessing-token context)
If it is a member of the keyword list, output it as a keyword, otherwise output it as an identifier.
preprocessing-op-or-punc
If you encounter the operators #
, ##
, %:
, %:%:
output them as invalid
.
For the remaining operators
and punctuation
you should map them to the appropriate simples
given in the table below, or output them as invalid
if not found.
pp-number
You need to analyze the string and classify it as matching the integer-literal
, floating-literal
, user-defined-integer-literal
or user-defined-floating-literal
grammar.
If it doesn't match any of these grammars, output it as invalid
.
In the case of user-defined-integer-literal
and user-defined-floating-literal
you need to split the ud_suffix
off and then output the prefix and ud-suffix
as a string.
In the case of integer-literal
and floating-literal
you will then need to calculate its appropriate type (see 2.14.2 and 2.14.4 in the standard). If no appropriate type is found output it as an invalid
.
Once the appropriate type is found you will then scan it into that type.
For floating-literals
once you have confirmed that they match the appropriate grammar, use the method shown in Decode_float
, Decode_double
and Decode_long_double
in the starter code to scan them for output. (This is just for testing so the hexdump is binary identical to the reference implementation.)
You may optionally check that floating-literals
are in the range of their types:
If the scaled value is not in the range of representable values for its type, the program is ill-formed. (course defined optional)
but this is not required for PA2, nor will it be tested for PA2.
For integer-literals
you need to check that they fit into the maximal type based on their integer-suffix
, if they do not output them as invalid
. Calculate the appropriate type to place them in and scan them into that type for output. This is explained in 2.14.2.
character-literal
Escape sequences (2.14.3.3 Table 7) need to be decoded into their code point.
We shall course define the limits of character literals as follows:
All character literals must contain one code point in the valid Unicode range.
[0 .. 0xD800)
and[0xE000 .. 0x110000)
If an ordinary character literals code point is less than or equal to 127 (it is in the ASCII range and representable as one UTF-8 code unit), its type is
char
, otherwise its type isint
. In both cases its value is its code point.For the remaining character literals types (u, U, L) the type is as per the standard (
char16_t
,char32_t
,wchar_t
), and ill-formed if the code point doesn't fit in a single code unit.
L
/wchar_t
is defined by the ABI (defined below) as 32 bits (UTF-32, same as U
)
user-defined-character-literal
The same as for character-literal except extract the ud-suffix
first
string-literal
First decode escape sequences (in non-raw literals) so that each string-literal
are sequences of code points.
You then need to consider maximal consequtive sequences of both kinds (user-defined and non-user-defined) together and apply the phase 6 rules about string concatenation.
A string-literal
without an encoding prefix is called an ordinary string literal.
There are 4 encoding-prefixes
: u8
(UTF-8), u
(UTF-16), U
(UTF-32), L
(UTF-32).
We shall course define the rule for concatenating string literals as follows:
- If a sequence of
string-literals
contains no encoding prefixes, the entire sequence is an treated as an ordinary string literal- If a sequence of
string-literals
contains one of the fourencoding-prefixes
, the entire sequence shall use that prefix.- If a sequence of
string-literals
contains two or more different types of the fourencoding-prefixes
, the program is ill-formed, output asinvalid
.
Read 2.14.5.13 for clarification of this.
Concatenate the sequences and append a terminating 0 code point.
You will then need to UTF-8, UTF-16 or UTF-32 encode them as appropriate depending on the encoding prefix. As our execution character set is UTF-8, ordinary string literals are UTF-8 (same as u8
).
The type is array of <n> <ch>
, where <n>
is the number of code units (not code points), including terminating 0 code point - and <ch>
is the appropriate type (char
, char16_t
, char32_t
, wchar_t
)
user-defined-string-literal
The rules are similiar to string-literal
, and in fact user-defined-string-literals
and combined with normal string-literals
as part of the maximal sequence.
The rule about combining ud-suffixes
is similiar to the one about encoding-prefixes
. The total number of ud-suffix
types must be 0 or 1, or the program is ill-formed. If it is 0 than it is a normal string-literal
. If it is 1 than it is a user-defined-string-literal
of that type.
Read 2.14.8.8 for clarification.
Testing / Reference Implementation
The test suite is similiar to PA1. Execute:
$ make test
In order to deal with invalids and keep going, stderr is not included in the output. It is placed in another file. So the content of the tests directory is:
tests/123-test-name.t // test stdin
tests/123-test-name.ref // reference impl stdout
tests/123-test-name.ref.exit_status // reference impl exit status
tests/123-test-name.ref.stderr // reference impl stderr
tests/123-test-name.my // your impl stdout
tests/123-test-name.my.exit_status // your impl exit status
tests/123-test-name.my.stderr // your impl stderr
If you add a test case:
tests/456-my-test-case.t
You can regenerate reference impl output with:
$ make ref-test
and then retest against it:
$ make test
However you should not modify existing tests or their reference output.
Definition: Fundamental Types
There are exactly 20 fundamental types in C++.
They are:
signed char
short int
int
long int
long long int
unsigned char
unsigned short int
unsigned int
unsigned long int
unsigned long long int
wchar_t
char
char16_t
char32_t
bool
float
double
long double
void
nullptr_t
Each fundamental type is different (distinct) from every other.
The C++ standard does not specify exactly the size, alignment and representation of each of these types.
This information is given in a separate document called the ABI (defined below).
Side note: In some cases the ABI specifies that two fundamental types have the same representation. For example
char
vssigned char
. Alsowchar_t
vschar32_t
. However the toolchain must still treat them as different types (as will your bootstrap compiler).
At this point please read the C++ standard 3.9.1 Fundamental types.
Definition: Linux x86-64 ABI
In order for the program produced by your toolchain to make system calls to the kernel, the size, alignment and representation of types needs to be agreed upon.
The C/C++ standards do not specify exact representations of types, so a specification called the ABI (Application Binary Interface) is formed.
The Linux x86-64 kernel adheres to an ABI specification called the "System V AMD64 ABI". It is documented here:
System V ABI AMD64 Architecture
It is course defined that your toolchain shall implement this ABI.
In particular:
- 8-bit byte
- Multibyte types are stored byte-wise little-endian
- Signed integers are stored using 2s-compliment
- The size of the fundamental types is given in the Table 3.1 Scalar Types of the ABI spec page 12
For this assignment this dictates the hexdump of literals.
Definition: Simple Token Types
The course mapping for simple tokens is given below, and this has also been transcribed in the starter code:
simple alignas KW_ALIGNAS
simple alignof KW_ALIGNOF
simple and_eq OP_BANDASS
simple and OP_LAND
simple asm KW_ASM
simple auto KW_AUTO
simple bitand OP_AMP
simple bitor OP_BOR
simple bool KW_BOOL
simple break KW_BREAK
simple case KW_CASE
simple catch KW_CATCH
simple char16_t KW_CHAR16_T
simple char32_t KW_CHAR32_T
simple char KW_CHAR
simple class KW_CLASS
simple compl OP_COMPL
simple const_cast KW_CONST_CAST
simple constexpr KW_CONSTEXPR
simple const KW_CONST
simple continue KW_CONTINUE
simple decltype KW_DECLTYPE
simple default KW_DEFAULT
simple delete KW_DELETE
simple do KW_DO
simple double KW_DOUBLE
simple dynamic_cast KW_DYNAMIC_CAST
simple else KW_ELSE
simple enum KW_ENUM
simple explicit KW_EXPLICIT
simple export KW_EXPORT
simple extern KW_EXTERN
simple false KW_FALSE
simple float KW_FLOAT
simple for KW_FOR
simple friend KW_FRIEND
simple goto KW_GOTO
simple if KW_IF
simple inline KW_INLINE
simple int KW_INT
simple long KW_LONG
simple mutable KW_MUTABLE
simple namespace KW_NAMESPACE
simple new KW_NEW
simple noexcept KW_NOEXCEPT
simple not_eq OP_NE
simple not OP_LNOT
simple nullptr KW_NULLPTR
simple & OP_AMP
simple -> OP_ARROW
simple ->* OP_ARROWSTAR
simple = OP_ASS
simple &= OP_BANDASS
simple | OP_BOR
simple |= OP_BORASS
simple : OP_COLON
simple :: OP_COLON2
simple , OP_COMMA
simple ~ OP_COMPL
simple -- OP_DEC
simple / OP_DIV
simple /= OP_DIVASS
simple . OP_DOT
simple ... OP_DOTS
simple .* OP_DOTSTAR
simple == OP_EQ
simple operator KW_OPERATOR
simple >= OP_GE
simple > OP_GT
simple ++ OP_INC
simple && OP_LAND
simple <% OP_LBRACE
simple { OP_LBRACE
simple <= OP_LE
simple ! OP_LNOT
simple || OP_LOR
simple ( OP_LPAREN
simple << OP_LSHIFT
simple <<= OP_LSHIFTASS
simple <: OP_LSQUARE
simple [ OP_LSQUARE
simple < OP_LT
simple - OP_MINUS
simple -= OP_MINUSASS
simple % OP_MOD
simple %= OP_MODASS
simple != OP_NE
simple + OP_PLUS
simple += OP_PLUSASS
simple ? OP_QMARK
simple } OP_RBRACE
simple %> OP_RBRACE
simple ) OP_RPAREN
simple >> OP_RSHIFT
simple >>= OP_RSHIFTASS
simple :> OP_RSQUARE
simple ] OP_RSQUARE
simple ; OP_SEMICOLON
simple * OP_STAR
simple *= OP_STARASS
simple ^ OP_XOR
simple ^= OP_XORASS
simple or_eq OP_BORASS
simple or OP_LOR
simple private KW_PRIVATE
simple protected KW_PROTECTED
simple public KW_PUBLIC
simple register KW_REGISTER
simple reinterpret_cast KW_REINTERPET_CAST
simple return KW_RETURN
simple short KW_SHORT
simple signed KW_SIGNED
simple sizeof KW_SIZEOF
simple static_assert KW_STATIC_ASSERT
simple static_cast KW_STATIC_CAST
simple static KW_STATIC
simple struct KW_STRUCT
simple switch KW_SWITCH
simple template KW_TEMPLATE
simple this KW_THIS
simple thread_local KW_THREAD_LOCAL
simple throw KW_THROW
simple true KW_TRUE
simple try KW_TRY
simple typedef KW_TYPEDEF
simple typeid KW_TYPEID
simple typename KW_TYPENAME
simple union KW_UNION
simple unsigned KW_UNSIGNED
simple using KW_USING
simple virtual KW_VIRTUAL
simple void KW_VOID
simple volatile KW_VOLATILE
simple wchar_t KW_WCHAR_T
simple while KW_WHILE
simple xor_eq OP_XORASS
simple xor OP_XOR
Design Tips (Optional)
These design tips are optional, and only suggest one way to complete this assignment.
You can start by creating an implementation of IPPTokenStream
from PA1, and connect up your PA1 code to it.
emit_whitespace_sequence
and emit_new_line
are no-ops.
Apart from string-literals
and user-defined-string-literals
each preprocessing-token
(IPPTokenStream
) outputs exactly one token (maybe invalid
).
string-literals
and user-defined-string-literals
need to be collected up into a sequence, terminated when you see some other token type. Once terminated they are processed as one token. You can implement string-literals
first as one to one (outputing one token like the others), and then refactor in this string concatenation feature after.
To output a token you can use the DebugPostTokenOutputStream
class from the starter code.
You will need to write UTF-8 and UTF-16 decoders and encoders for string and character literals. Recall that UTF-32 encoding/decoding is trivial (UTF-32 code units are one-to-one with code points).
