CSC 468 Project.

MiniNXBase Tokenizer Speicification.
-----------------------------------


1. Outline

MiniNXBase tokenizer has to implement the following functionality:

- input a string;
- traverse the string and break it into individual tokens;
- construct a list of tokens corresponding to the string;
- output the list of tokens;
- provide functionality to manipulate the list of tokens.

As input, the tokenizer is to take the strings specifying MiniXBase
commands. MiniXBase commands are described in the Project Description
document.

2. Implementation issues.

 The tokenizer is to be implemented in Java. While its final architecture
 may be more extensive, the following classes must be implemented.

 * public class MXBtokenizer

   The main tokenizer class, it will contain the actual tokenizer/lexical
   analyzer code.

 * public class Token

   elements of class Token represent individual tokens constructed by
   the methods of class MXBtokenizer.

 * public class TokenList

    represents a list of Token elements. The output of the tokenizer
    method is an instance of this class.


3. Class details

   Here, we list a number of functions that MUST be implemented for each of the
   three classes. These functions are designed to become the interface between
   the MiniXBase code and the tokenizer and its output. The final implementation
   may, contain other functionality.


   public class Token

      private <...>  Type;
      private String Value;

      public <...> GetType();
      public String GetValue();

      public int SetType(<...> T);
      public int SetValue(String V);


  Comments:
    - <...> means that the type of the attribute Type of class Token is to be
      determined by the developer. It can be an integer type, or it can be an
      enumeration of the types (the list and description of the types are given
      below).
    - the outputs of SetType() and SetValue() methods are "error codes".


   public class TokenList

      public Token  Pop();
      public Token  Head();

      public int InsertToken(Token T);
      public boolean IsEmpty();


  Comments:
    - Pop(); removes the head token from the TokenList object;
    - Head(); returns the head token without removing it from the list.
    - all-in-all, TokenList, really, behaves like a stack.


    public class MXBtokenizer

       public TokenList  Tokenize(String Command);

   Comments: Tokenize() takes as input a string that contains a MiniXBase
      command. It must break the string into individual tokens and construct
      their list in the output. For each token, its type (see below) and its
      value - i.e., the part of the string that forms the token, must be
      determined and stored in a Token instance.


 4. Token types.

     The following token types must be supported.


  (1) Token Type:  Separator

      Description: Tokens of type Separator serve to separate parts of XPLite expression.

      Possible Values: "::"
                       "/"
		       "["
		       "]"
                       ")"
                       

  (2) Token Type: Keyword

       Description: MiniXBase command keywords.

       Possible Values: "CREATE"
                         "Create"
			 "create"
			 "INSERT"
			 "Insert"
			 "insert"
			 "DROP"
			 "Drop"
			 "drop"
			 "LIST"
			 "List"
			 "list"
                         "STATS" 
                         "Stats"
                         "stats"
                         "FILE"
                         "File"
                         "file"
                         "XML"
                         "Xml"
                         "xml"
                         "CLEAR"
                         "Clear"
                         "clear"
                         "DELETE"
                         "Delete"
                         "delete"
      Note: for simplicity, convert all values in the Token instances into ALLCAPS.
            That is, if the actual string contains "Create", Token.Value for it
            has to be "CREATE", etc...

   (3) Token Type: Axis

       Description: keywords for XPLite axes.

       Possible Values:  "self"
                         "child"
			 "parent"
			 "following-sibling"
			 "preceding-sibling"
                         "following"
                         "preceding"
                         "ancestor"
                         "descendant"
			 "attribute"

       Comments: Axis values are case-sensitive, that is, must be all lowercase.

   (4) Token Type: Operator

       Description: comparison operators

       Possible Values: "="
                        "<>"
			"<"
			">"
			">="
			"<="


    (5) Token Type: Function

        Description: an id that ends with "(". Used in the predicate and nodetest parts
	of the XPLite expressions.

	Possible Values: all nodetests and standard functions:
                      "node("
                      "attribute("
                      "text("
                      "position("
                      "last("
                      "count("
                      "not("
                      "true("
                      "false("
        

    (6) Token Type: Other

        Description: identifiers, values, and more.

	Possible values: any combination of ASCII characters between two tokens of types (1)-(5),
	and/or whitespace.  Used for Repository Names, Element and Attribute names and
	function values. E.g., "MyRepository", "x", "root", ""Hello!"", "2"...


        Comment: tokens of type Other cannot match keywords and axis names. That is,
	repository names such as "Create" or "parent" are not allowed. Note that "Create"
	may be a value of a function: the token value would be ""Create"" (that is,
	one pair of double quotes is a part of the token value itself).