Derived from: none
Declared in: Lexer.h
The lexer object is meant to perform the lexical analysis, that is, group the input characters to form lexemes, and return the corresponding token. Hence, this object encapsulates all disk-related operations.
The input stream is buffered using a pair of buffers, whose length is configurable (the default is 1024, and should be a multiple of the file block size, for efficiency reasons).
Another buffer is declared to hold the last lexeme. The allocation size is configurable and is set to 128 by default. This buffer should be large enough to contain the longest lexeme to be dealt with. However, the library will accomodate lexemes of any size by using dynamic allocations.
Note that the length of the main buffers MUST have be strictly higher than the
length of the longest lexeme. If this is not the case, it's gonna screw up!!!
Even in modern programming languages, the longest lexemes (which are the variable
names) don't exceed 128 (or at most 256) characters.
To summarize a little bit: the main buffer should be large enough (to contain the longest
lexeme), but the lexeme buffer can be virtually of any size. Still, I recommend a length of 128 or 256 for it.
The lexeme pointer returned by GetLexeme() (either the lexeme buffer, or a dynamic one), can be used further:
LX_CHAR_SPACE LX_CHAR_ALPHA LX_CHAR_DIGIT LX_CHAR_MINUS LX_CHAR_ILLEGAL LX_CHAR_EOF LX_CHAR_LAST = LX_CHAR_EOFNote the illegal character code. Among the lexical rules the lexer implements, one of them states whether illegal characters are ignored or generate an error. The latter case is the most frequent.
Having a character C, its type is given by Charset[C]. In the method concerned with lexeme determination, one can easily skip the spaces in the input stream:
LexmStart = LexmEnd; while (Charset[*LexmEnd] == LX_CHAR_SPACE) Fwd;Note that the user can define any character as a space by assigning the LX_CHAR_SPACE code to it, see SetChar(). The fact is that the code assigned to any character is totally free, in case of particular needs. A default charset can be used by default, by calling SetDefaultCharset().
enum { MY_CODE_FOR_RIGHT_BRACKET = LX_CHAR_LAST+1, MY_CODE_FOR_UNDERSCORE, ... MY_CODE_FOR_WHATEVER, };Note that it is important to use LX_CHAR_LAST+1, to avoid any code "interference".
Even though this class provides the user with a whole bunch of facilities, it is unlikely that this class will ever be instanciated. The LAP library provides 2 derived classes, namely:
In compiling theory, the idea to write a lexer by constructing a DFA (Deterministic Finite Automaton).
This is the way Lex or its AT&T succesor Flex do the work of recognizing lexemes. Even though
a DFA is a very powerful way to perform lexical analysis, it is equally possible to implement the
DFA by representing each state by a procedure (or function, or method). This is how the user will
have to do if a pretty complex lexical analysis is to be performed.
This topic is covered in depth in some good books dealing with compiling theory ("Dragoon" for
instance).
Lexer(int32 BufferLength, int32 LexemeLength)
Creates a Lexer object and allocates the paired-buffers (2*BufferLength) bytes, and the lexeme buffer (LexemeLength bytes). As the allocations may have failed, the user should always check the status of the object after creation (see GetStatus()).
~Lexer(void)
Frees the buffers and the file descriptor, if any. Also flushes the symbols table
status_t GetStatus(void)
returns the Status of the object. It should be called right after the object instanciation to check for memory allocation failures:
Lexer* MyLexer = new Lexer(); // Default buffers length if (MyLexer->GetStatus() != B_NO_ERROR) // Failure else // Ok
bool Eof(void)
returns true if the end of the input file has been reached, false otherwise.
virtual void Terminate(void)
Used when the parsing of the file is over. It mainly closes the input file. The same lexer object can be reused for any other file, if needed.
void GetInputSize(off_t& )
Returns the size of the input in bytes (64 bits) integer.
char * GetLexeme(void)
Returns the lexeme starting at LexmStart and ending at LexmEnd. Note however that the character pointed to by LexmEnd is not part of the lexeme. The inner mechanism is a little bit more complicated in that both lexemes that fit in the lexeme buffer and those that don't, are dealt with. The length of the lexeme is determined by calling GetLexemeLen()
void DeleteLexeme(char *)
This function is used in conjunction with the output of GetLexeme(),
that is, a lexeme. This method is called to delete the buffer given in input. If the
input is a pointer to the lexeme buffer of the object Lexer, then, it is not deleted,
because it is needed for further lexemes.
Otherwise, it si actually deleted, because it corresponds to a dynamic allocation,
which is no more needed.
char * DetachLexeme(char *)
This function is used in conjunction with the output of GetLexeme(), that is, a lexeme. This method is called when a lexeme which has been returned is needed a further more. In that case, one have to prevent any possibility that the buffer contents are overwritten by subsequent calls of GetLexeme(). This is exactly what this method does: it detaches the buffer given in input from the lexer.
If the input is a pointer to the lexeme buffer of the object Lexer, then, a copy is returned because it is needed for further lexemes. Otherwise, nothing is done.
virtual void Fwd()
Advances the end-of-lexeme marker (or pointer) LexmEnd one position forward. If necessary, a disk load is performed to accomodate.
Note that no check is performed on whether the length of the lexeme exceeds the size of the main buffer, in which case, the lexeme returned will be incorrect.
Also note that a call to this function may fail if the end of the file has already been reached. In that case, false is returned, and the calling function also has to fail -at least, it's obvious. In that case, the macro FWD can be used. It has been defined as
if (!Fwd()) return false;We can rewrite the part of routine to skip spaces described above in a safer manner:
LexmStart = LexmEnd; while (Charset[*LexmEnd] == LX_CHAR_SPACE) FWD; // returns if Fwd() failsYou are strongly encouraged to use these macros as it provides a facility to deal with errors, in a much secure way.
virtual void Bwd()
Moves the end-of-lexeme marker (or pointer) LexmEnd one position backwards.
Note that LexmEnd will never get beyond LexmStart. If asked to do so, false is returned.
Also note that in much the same way as for Fwd(), a macro has been defined as:
if (!Bwd()) return false;You are strongly encouraged to use these macros as it provides a facility to deal with errors, in a much secure way.
void SetChar(char Char, uint8 Code) void SetChar(char CharStart, char CharEnd, uint8 Code)
The first method assigns the code of the character Char to Code. The second method sets the code of the characters whose ASCII code is between CharStart and CharEnd to Code
void SetDefaultCharset()
When called, apply the following mapping to the characters' code:
SetChar(0, 255, LX_CHAR_ILLEGAL); SetChar('a', 'z', LX_CHAR_ALPHA); SetChar('A', 'Z', LX_CHAR_ALPHA); SetChar('0', '9', LX_CHAR_DIGIT); SetChar(' ', LX_CHAR_SPACE); SetChar('\t', LX_CHAR_SPACE); SetChar('\n', LX_CHAR_SPACE);If needed, one can call SetDefaultCharset() and modify some other codes.
Note that these codes must fit in a 8-bit integer.
Going back to the DFA concept and its equivalent in "regular" programming (procedures),
it is clear that a single charset can't handle every single situation, given that for each
state of the DFA, a different charset is needed.
However, for simple input files, it can be enough.
bool AddSymbol(Symbol* Sym)
This function adds the symbol given in parameter to the symbol table. If the symbol already exists, false is returned, and the symbol is not added. Otherwise, the symbol is added and true is returned.
Note that there is neither scope nor overloaded names support.
Symbol* SearchSymbol(char* SymbolName)
This function looks up for the symbol SymbolName in the symbol table. If found, the symbol is returned. Otherwise, NULL is returned.
bool PrintSymbol(Symbol* Sym)
This function prints out a few information about the symbol given
in parameter.
This function always returns false, but if mustn't be considered as a failure.
For debugging purposes, a priori.
void PrintSymbols()
This function prints out a few information about the symbols stored in the
symbol table
For debugging purposes, a priori.