The LAP library: Lexer

Derived from: none

Declared in: Lexer.h


Overview

The lexer object is meant to perform the lexical analysis, that is, group the input characters to form lexemes, and return the corresponding token. Hence, this object encapsulates all disk-related operations.

The input stream is buffered using a pair of buffers, whose length is configurable (the default is 1024, and should be a multiple of the file block size, for efficiency reasons).

Another buffer is declared to hold the last lexeme. The allocation size is configurable and is set to 128 by default. This buffer should be large enough to contain the longest lexeme to be dealt with. However, the library will accomodate lexemes of any size by using dynamic allocations.

Note that the length of the main buffers MUST have be strictly higher than the length of the longest lexeme. If this is not the case, it's gonna screw up!!! Even in modern programming languages, the longest lexemes (which are the variable names) don't exceed 128 (or at most 256) characters.
To summarize a little bit: the main buffer should be large enough (to contain the longest lexeme), but the lexeme buffer can be virtually of any size. Still, I recommend a length of 128 or 256 for it.

Lexeme

A lexeme is determined by two pointers called LexmStart and LexmEnd. When a token is asked by calling NextToken(), both LexmStart and LexmEnd point to the same location. The lexer will loop over the input, move LexmEnd forward by calling Fwd() and backwards by calling Bwd(). Once the end of a lexeme is detected, the lexeme can be isolated in the lexeme buffer by calling GetLexeme(). If the lexeme fits into the lexeme buffer, then the pointer returned is that of the lexeme buffer. Otherwise, a pointer on a dynamic buffer is returned.

The lexeme pointer returned by GetLexeme() (either the lexeme buffer, or a dynamic one), can be used further:

Charset

The Lexer class provides another facility meant to easily determine how a character should be considered. The Lexer class owns an attribute called Charset which stores a 8-bit code for each of the 256 ASCII characters. A few codes are pre-defined and are:
  LX_CHAR_SPACE
  LX_CHAR_ALPHA
  LX_CHAR_DIGIT
  LX_CHAR_MINUS
  LX_CHAR_ILLEGAL
  LX_CHAR_EOF
  LX_CHAR_LAST = LX_CHAR_EOF
Note the illegal character code. Among the lexical rules the lexer implements, one of them states whether illegal characters are ignored or generate an error. The latter case is the most frequent.
In C++, examples of illegal characters are '$', '£' or 'é' (expected in comments). When reading a binary file, there should be no illegal characters.

Having a character C, its type is given by Charset[C]. In the method concerned with lexeme determination, one can easily skip the spaces in the input stream:

  LexmStart = LexmEnd;
  while (Charset[*LexmEnd] == LX_CHAR_SPACE)
    Fwd;
Note that the user can define any character as a space by assigning the LX_CHAR_SPACE code to it, see SetChar(). The fact is that the code assigned to any character is totally free, in case of particular needs. A default charset can be used by default, by calling SetDefaultCharset().
User-defined character codes for use in Lexer::Charset are defined like this (to prevent code duplication) in MyLexer.h:
  enum
  {
    MY_CODE_FOR_RIGHT_BRACKET = LX_CHAR_LAST+1,
    MY_CODE_FOR_UNDERSCORE,
    ...
    MY_CODE_FOR_WHATEVER,
  };
Note that it is important to use LX_CHAR_LAST+1, to avoid any code "interference".
Also note that these codes must be less than 256 (they are stored in a 8-bit integer).

Symbol table

Another facility, a symbol table, is provided in this class. One can: Be aware that this symbol table is very simple, and is not meant to fit every single needs.
In particular, there is no notion of scope, or overloaded symbols (you have to mangle the name by yourself).

Even though this class provides the user with a whole bunch of facilities, it is unlikely that this class will ever be instanciated. The LAP library provides 2 derived classes, namely:

In compiling theory, the idea to write a lexer by constructing a DFA (Deterministic Finite Automaton). This is the way Lex or its AT&T succesor Flex do the work of recognizing lexemes. Even though a DFA is a very powerful way to perform lexical analysis, it is equally possible to implement the DFA by representing each state by a procedure (or function, or method). This is how the user will have to do if a pretty complex lexical analysis is to be performed.
This topic is covered in depth in some good books dealing with compiling theory ("Dragoon" for instance).


Constructor and D estructor


Lexer()

 
      Lexer(int32 BufferLength, int32 LexemeLength) 

Creates a Lexer object and allocates the paired-buffers (2*BufferLength) bytes, and the lexeme buffer (LexemeLength bytes). As the allocations may have failed, the user should always check the status of the object after creation (see GetStatus()).


~Lexer()

 
      ~Lexer(void) 

Frees the buffers and the file descriptor, if any. Also flushes the symbols table


Member Functions


GetStatus()

 
      status_t GetStatus(void)

returns the Status of the object. It should be called right after the object instanciation to check for memory allocation failures:

  Lexer*  MyLexer = new Lexer();    // Default buffers length
  if (MyLexer->GetStatus() != B_NO_ERROR)
    // Failure
  else
    // Ok


Eof()

 
      bool Eof(void)

returns true if the end of the input file has been reached, false otherwise.


Terminate()

 
      virtual void Terminate(void)

Used when the parsing of the file is over. It mainly closes the input file. The same lexer object can be reused for any other file, if needed.


GetInputSize()

 
      void GetInputSize(off_t& )

Returns the size of the input in bytes (64 bits) integer.


GetLexeme()

 
      char * GetLexeme(void)

Returns the lexeme starting at LexmStart and ending at LexmEnd. Note however that the character pointed to by LexmEnd is not part of the lexeme. The inner mechanism is a little bit more complicated in that both lexemes that fit in the lexeme buffer and those that don't, are dealt with. The length of the lexeme is determined by calling GetLexemeLen()

From the user point of view, there is no difference. 2 other methods are useful when GetLexeme() is used:
DeleteLexeme() and DetachLexeme()


DeleteLexeme()

 
      void DeleteLexeme(char *)

This function is used in conjunction with the output of GetLexeme(), that is, a lexeme. This method is called to delete the buffer given in input. If the input is a pointer to the lexeme buffer of the object Lexer, then, it is not deleted, because it is needed for further lexemes.
Otherwise, it si actually deleted, because it corresponds to a dynamic allocation, which is no more needed.


DetachLexeme()

 
      char * DetachLexeme(char *)

This function is used in conjunction with the output of GetLexeme(), that is, a lexeme. This method is called when a lexeme which has been returned is needed a further more. In that case, one have to prevent any possibility that the buffer contents are overwritten by subsequent calls of GetLexeme(). This is exactly what this method does: it detaches the buffer given in input from the lexer.

If the input is a pointer to the lexeme buffer of the object Lexer, then, a copy is returned because it is needed for further lexemes. Otherwise, nothing is done.


Fwd()

 
      virtual void Fwd()

Advances the end-of-lexeme marker (or pointer) LexmEnd one position forward. If necessary, a disk load is performed to accomodate.

Note that no check is performed on whether the length of the lexeme exceeds the size of the main buffer, in which case, the lexeme returned will be incorrect.

Also note that a call to this function may fail if the end of the file has already been reached. In that case, false is returned, and the calling function also has to fail -at least, it's obvious. In that case, the macro FWD can be used. It has been defined as

  if (!Fwd())
    return false;
We can rewrite the part of routine to skip spaces described above in a safer manner:
  LexmStart = LexmEnd;
  while (Charset[*LexmEnd] == LX_CHAR_SPACE)
    FWD;		// returns if Fwd() fails
You are strongly encouraged to use these macros as it provides a facility to deal with errors, in a much secure way.


Bwd()

 
      virtual void Bwd()

Moves the end-of-lexeme marker (or pointer) LexmEnd one position backwards.

Note that LexmEnd will never get beyond LexmStart. If asked to do so, false is returned.

Also note that in much the same way as for Fwd(), a macro has been defined as:

  if (!Bwd())
    return false;
You are strongly encouraged to use these macros as it provides a facility to deal with errors, in a much secure way.


SetChar()

      void SetChar(char Char, uint8 Code)
      void SetChar(char CharStart, char CharEnd, uint8 Code)

The first method assigns the code of the character Char to Code. The second method sets the code of the characters whose ASCII code is between CharStart and CharEnd to Code


SetDefaultCharset()

      void SetDefaultCharset()

When called, apply the following mapping to the characters' code:

  SetChar(0, 255, LX_CHAR_ILLEGAL);

  SetChar('a', 'z', LX_CHAR_ALPHA);
  SetChar('A', 'Z', LX_CHAR_ALPHA);

  SetChar('0', '9', LX_CHAR_DIGIT);

  SetChar(' ', LX_CHAR_SPACE);
  SetChar('\t', LX_CHAR_SPACE);
  SetChar('\n', LX_CHAR_SPACE);
If needed, one can call SetDefaultCharset() and modify some other codes.

Note that these codes must fit in a 8-bit integer.

Going back to the DFA concept and its equivalent in "regular" programming (procedures), it is clear that a single charset can't handle every single situation, given that for each state of the DFA, a different charset is needed.
However, for simple input files, it can be enough.


AddSymbol()

 
      bool AddSymbol(Symbol* Sym)

This function adds the symbol given in parameter to the symbol table. If the symbol already exists, false is returned, and the symbol is not added. Otherwise, the symbol is added and true is returned.

Note that there is neither scope nor overloaded names support.


SearchSymbol()

 
      Symbol* SearchSymbol(char* SymbolName)

This function looks up for the symbol SymbolName in the symbol table. If found, the symbol is returned. Otherwise, NULL is returned.


PrintSymbol()

 
      bool PrintSymbol(Symbol* Sym)

This function prints out a few information about the symbol given in parameter.
This function always returns false, but if mustn't be considered as a failure.
For debugging purposes, a priori.


PrintSymbols()

 
      void PrintSymbols()

This function prints out a few information about the symbols stored in the symbol table
For debugging purposes, a priori.


The LAP library, V0.7, November 3rd 1997, First public release