Javacc Token Examples

Apr 22, 2015 - For example, when parsing a C program, the parser may put the scan. In other words the token INT is not active in the lexical state COMMENT. Compile hello.jj and generate java codes java -classpath ' java javacc-5.0 bin lib javacc.jar' javacc hello.jj Run the Hello.java.

I'm trying to parse my input partially, so that I can store certain chunks for a later parse.

In this case, the 'ANY' token will only be valid if previous tokens didn't match anything else, but assuming I have many more token definitions, the grammar above won't do.

I know that ~[] matches any character and not any token.

Further, let's say I would use token states instead (stuff they do with javadoc, pragmas etc.), I would still have a problem capturing the chunks, since I don't have any token to set my special token state. Also, setting the token state via the parser seems to be a bad practice according to JavaCC's FAQ, since the TokenManager might already have some tokens in its queue.

So I'm wondering if there's any ANY-equivilent regarding tokens. Or does someone at least have an idea how to approach my problem in a different way?

bcausebcause

1 Answer

Of course one way to do it is to make a big production that lists every kind of token except '{' and '}'.

But that's not at all elegant.

Instead, you can write a JAVACODE production that consumes tokens until the final close-brace is found. See https://javacc.java.net/doc/javaccgrm.html#JAVACODE for a similar example.

Theodore NorvellTheodore Norvell

7,6874 gold badges19 silver badges37 bronze badges

Not the answer you're looking for? Browse other questions tagged tokenjavaccany or ask your own question.

Now you're talking my language!

Introduction to JavaCC

Many Web-based projects contain ad-hoc query systems that allow end users to search for information. For this, some sort of language is needed for the end users to convey what it is they wish to search for. Sometimes, a user query language is defined quite simply. If your end user is content to have a language as simple as your most typical Google search, then Java's StringTokenizer is more than up to the task for parsing. However, if a customer wishes to have a more robust language, perhaps adding in parentheses and 'AND'/'OR' logic, then we quickly find ourselves in need of a bigger hammer. We need a means to first define the language that the user will be using and then use that definition to parse the user's entries and formulate the correct query to whatever back-end database we have.

That's where a tool like JavaCC comes in. JavaCC stands for 'Java® Compiler Compiler', a name that stands in homage to YACC, 'Yet Another Compiler Compiler', the C-based tool developed by AT&T for the purposes of building parsers for C and other high-level languages. YACC and its cohort lexical tokenizer, 'Lex', accepted as input a definition of a language in a common form known as Backus-Naur form (BNF, aka Backus Normal Form) and emitted a 'C' program that could parse input in that language and perform functions on it. JavaCC, like YACC, is designed to speed the process of developing parser logic for languages. Whereas YACC emitted C code, however, JavaCC, as you might expect, emits Java code.

JavaCC's history is storied. It began its life in Sun, under the name 'Jack'. Jack later became JavaCC which passed through several owners, notably Metamata and WebGain, before coming back to Sun, who then released it as Open Source code under the BSD license.

JavaCC's strength is in its simplicity and extensibility. To compile the Java code generated by JavaCC, no outside JAR files or directories are required. Simply compile with the base 1.2 Java compiler. The layout of the language also makes it easy to add production rules and behaviors. The Web site even describes how you can tailor exceptions to give your user targetted syntax hints.

The Problem Definition

Let us suppose that you have a customer at a video rental store that has a simple database of movies. The database consists of tables for MOVIES, ACTORS, and KEYWORDS. The MOVIES table lists the pertinent data for each movie in his shop, namely things like the title and director for each movie. The ACTORS table lists the names of actors in each movie. The KEYWORDS table lists words describing the movie, such as words like 'action', 'drama', 'adventure', and so on.

The customer would like to be able to issue somewhat sophisticated queries against this database. He would, for example, like to enter a query of the form

actor = 'Christopher Reeve' and keyword=action and keyword=adventure

and have it return the Superman movies featuring Christopher Reeve. He would also like to use parentheses to indicate order of evaluation to distinguish queries like

(actor = 'Christopher Reeve' and keyword=action) or keyword=romance

which might return movies that do not feature Christopher Reeve from

actor = 'Christopher Reeve' and (keyword=action or keyword=romance)

which should always return movies starring Christopher Reeve.

The Solution

For this exercise, you will define the solution in two stages. In the stage 1, you will define the language in JavaCC, ensuring that end-user queries are parsed correctly. In stage 2, you will add behavior to JavaCC code in order to produce DB2® SQL code that will ensure that the correct movies are returned in answer to the end-user queries.

Stage 1 - Defining the User's Query Language

The language will be defined inside of a file named UQLParser.jj. This file will be 'compiled' by the JavaCC tool into a set of Java class files of type .java. To define a language in a JJ file, you need to do five things:

Define the context of parsing
Define 'white space'
Define 'tokens'
Define the syntax of the language itself in terms of the tokens
Define the behavior that will happen at each stage of parsing

You can define your own UQLParser.jj file using the segments presented, or you can follow along using the code associated with this paper. Use the copy of UQLParser.jj in JavaCCPaper/stage1/src for steps 1 through 4. Step 5 is covered in JavaCCPaper/stage2/src. DDL for the sample database can be found in JavaCCPaper/moviedb.sql. The example works best if the database is created and the parser run using the same userid. Ant files (build.xml) are provided to speed to compilation process.

Step 1. Define the context of parsing

JavaCC .jj files are translated by the JavaCC executable into .java files. JavaCC takes the portion of the .jj file that lives between PARSER_BEGIN and ends with PARSER_END and copies it directly into the resulting java file. This is the place where you, the parser designer, can place all of your pre- and post-parsing activity related to your parser. This is also the place where you can link your java code to the parser actions you will define in steps 4 and 5.

In the example shown below, the parser is relatively simple. The constructor UQLParser takes as input a String, reading it in through Java's java.io.StringReader class, and then calls another, unseen constructor with the StringReader cast to Reader. The only other method defined here is the static main method which calls the constructor and then calls an as-yet undefined method called parse().

As you might have already guessed, JavaCC already provides a constructor from Java's Reader class. We have added the String-based constructor for ease of use and testing.

Listing 1. The parser's Java context

Step 2. Define white space

In this language, you wish to treat as delimiters but otherwise ignore spaces, tabs, carriage returns and newlines. These characters are known as white space. In JavaCC, we define these characters in the SKIP section, as indicated in Listing 2.

Listing 2. Defining white space in the SKIP section

Step 3. Define the tokens

Next, you define what tokens you will recognize in the language. A token is the smallest unit of parsed string that will have meaning to the parsing program. The process that scans an input string and determines what the tokens are is called the tokenizer. In the query,

actor = 'Christopher Reeve'

The tokens are

actor
=
'Christopher Reeve'

In your language, you will make actor and the equal sign (=) reserved tokens in the language, much as the words if and instanceof are reserved tokens with particular meaning within the Java language. By reserving words and other special tokens, the programmer promises that the parser will recognize these words literally, assigning a specific meaning to them. So long as you're reserving words, go ahead and reserve the not equal sign (<>), and, or, the left and right parentheses. Also reserve title, director, and keyword to represent specific fields for the user to search.

To define all of these, use the TOKEN directive of JavaCC. Each token definition is enclosed in angle-brackets, (< and >). The token's name is given on the left hand side of the colon ( : ) and a regular expression is given on the right hand side. A regular expression is a way of defining a portion of text to be matched. In its simplest form, regular expressions can match exact sequences of characters. Use the code below to define six tokens that match exact words and four others that match symbols. When the parser sees any of the words, it will match them symbolically as AND, OR, TITLE, ACTOR, DIRECTOR, or KEYWORD. When the symbols are matched, the parser will return LPAREN, RPAREN, EQUALS, or NOTEQUAL accordingly. Listing 3 shows the JavaCC reserved tokens definition.

Listing 3. Defining reserved tokens

For strings like 'Christopher Reeve', you cannot possibly set aside all the names of actors as reserved words in our language. Instead, you will recognize tokens of type STRING or QUOTED_STRING using a character pattern defined using regular expressions. Regular expressions are strings of characters that define a pattern to be matched. Defining the regular expression that will match all strings or quoted strings is a bit trickier than defining exact word matches.

You will define a STRING as a series of one or more characters, the valid characters being A through Z, upper and lower case, and the numerals 0 through 9. For the purposes of simplification, don't worry about movie stars or movie titles with accented characters or other anomalies. You can write this pattern as a regular expression in the following way.

<STRING : (['A'-'Z', 'a'-'z', '0'-'9'])+ >

The plus sign indicates that the pattern enclosed in the parentheses (any character from A to Z, a to z, or 0 to 9) should occur one or more times in sequence. In JavaCC, you may also use the asterisk (*) to indicate zero or more appearances of a pattern or a question mark (?) to indicate 0 or 1 repetition.

QUOTED_STRING is a little trickier. You will define a string to be a QUOTED_STRING if it begins with a quote, ends with a quote and has any other character in between. The regular expression for this is '' (~[''])+ '' which certainly is an eyeful. To simplify, it helps to understand that since the quote character itself has meaning to JavaCC, we need to escape the quote for it to be meaningful to our language instead of JavaCC. To escape the quote, we use a backward slash to precede it. The tilde character (~) means NOT to the JavaCC tokenizer. (~[''])+ is shorthand for one or more non-quote characters. Taken altogether, '' (~[''])+ '' means ?a quote followed by one or more non-quotes followed by a quote'.

You should add the tokenizer rules for STRING and QUOTED_STRING after the rules for the reserved words. It is important to preserve this ordering because the order that tokenizer rules appear in the file are the order in which token rules are applied. You want to be very sure that 'title' is taken as the reserved word rather than as a STRING. The complete STRING and QUOTED_STRING token definition appears in Listing 4.

Listing 4. Defining STRING and QUOTED_STRING

Step 4. Define the language in terms of the tokens

Now that you have the tokens defined, it is time to define the parsing rules in terms of the tokens. The user enters the query in the form of an expression. An expression is defined as a series of one or more query terms joined by the boolean operators and or or.

To express this, we need to write a parsing rule, also known as a production. Write the production in Listing 5 into the JavaCC UQLParser.JJ file.

Listing 5. The expression() production

When Javacc is run against a .jj file, productions are translated into methods. The first set of braces will enclose any declarations that will be needed by the production method. For now, this is left blank. The second set of braces encloses the production rule written in a way that JavaCC understands. Note the use of the AND and OR tokens that were defined earlier. Also note that queryTerm() is written as a method call. queryTerm() in fact, is another production method.

Now, let us define the queryTerm() production. queryTerm() is either a single criterion (such as) or an expression surrounded by parentheses. queryTerm() is defined in JavaCC recursively, via expression(), which allows you to sum up the language succinctly using the code shown in Listing 6.

Listing 6. The queryTerm() production method in JavaCC (UQLParser.jj)

That's all the rules we need. The entire language parser has been summed up in two productions.

Taking JavaCC for a test drive

At this point, you should have a valid JavaCC file. You can compile and 'run' this program to see if your parser operates correctly before moving on to step 5.

The ZIP file provided with this paper should include the sample stage 1 JavaCC file, UQLParser.jj. Unzip the entire ZIP file into an empty directory. To compile stage1/UQLParser.jj, first you'll need to download JavaCC and install according to the directions at the JavaCC Web page. For simplicity's sake, be sure to put the executable path for Javacc.bat into your PATH environment variable. Compilation is easy; cd into the directory where you've unloaded UQLParser.jj and enter the following.

javacc ?debug_parser ?output_directory=.comdemostage1 UQLParser.jj

You may also use the enclosed Ant file, build.xml if you prefer. You will have to make some adjustments to the top of the properties file to point to your JavaCC installation. JavaCC should produce the messages shown in Listing 7 the first time you run it.

Listing 7. Output from compilation of UQLParser.jj

In addition to the four files mentioned, JavaCC will also produce UQLParser.java, UQLParserConstants.java, and UQLParserTokenManager.java. All of these are written to the comdemostage1 directory. From here, you should be able to compile these files without any additions to the default runtime classpath. The Ant file default target will perform the Java compile automatically if the JavaCC step runs successfully. If not, you can compile your files from the top level directory (JavaCCPaper/stage1) with

javac ?d bin srccomdemostage1*.java

Once you have Java class files in place, you may test your new parser by entering the following sample user query into the 'main' java method you have defined. If you are using the same code, begin from the JavaCCPaper/stage1 directory and enter the following on the command line.

java ?cp bin com.demo.stage1.UQLParser 'actor = 'Tom Cruise'

The ?debug_parser option that we used during the JavaCC step ensures that the following useful trace messages are output, indicating how the user query is parsed. The output should appear as in Listing 8.

Listing 8. Output from UQLParser, query actor='Tom Cruise'

To test the recursive path for parenthesized expressions, try the following test.

java ?cp bin com.demo.stage1.UQLParser '(actor='Tom Cruise' or actor='Kelly McGillis') and keyword=drama'

This should produce the output in listing 9.

Listing 9. Output from UQL1Parser, query (actor='Tom Cruise' or actor='Kelly McGillis') and keyword=drama

This output is useful because it illustrates the recursion through queryTerm and expression. The first instance of queryTerm is actually an expression made up of two queryTerms. Figure 1 shows a pictorial view of this parse path.

Figure 1. Graphical representation of parse of user query

If you're curious as to what Java code was produced by JavaCC, by all means, have a look (but don't try to change any of it!). Here is what you'll find.

UQLParser.java ? In this file, you will find the code that you placed between PARSER_BEGIN and PARSER_END in your UQLParser.jj file. You will also discover that your JJ production methods have been mutated into Java methods.

For example, expression() rule has been expanded into the code found in Listing 10.

Listing 10. UQLParser.java

It bears some resemblance to what you originally wrote in that queryTerm(), AND, and OR make an appearance but the rest is parsing detail that JavaCC has added around it.

UQLParserConstants.java ? This file is easy to figure out. All of the tokens that you have defined are here. JavaCC has merely listed them in an array and provided integer constants to reference into that array. Listing 11 shows the contents of UQLParserConstants.java

Listing 11. UQLParserConstants.java

UQLParserTokenManager.java ? This is one scary file. JavaCC uses this class as the tokenizer. This is the piece of code that determines what the tokens are. The primary routine of interest here is GetNextToken. This is the routine used by the parser production methods to determine which path to take.

SimpleCharStream.java ? This file is used by UQLParserTokenManager to represent the ASCII stream of characters to be parsed.

Token.java ? The Token class is provided to represent the tokens themselves. The next part of this paper will demonstrate the usefulness of the Token class.

TokenMgrError.java and ParseException ? These classes represent exception conditions in the tokenizer and parser respectively.

Stage 2 - Adding behavior to JavaCC code

Note: For this part of the tutorial, refer to the stage2 subdirectory of the code. The JJ file that is presented from here on is JavaCCPaper/stage2/UQLParser.jj. In order to run your sample SQL queries, you should also create the MOVIEDB database using the enclosed moviedb.sql file. Execute the DDL using db2 -tf moviedb.sql.

Now that we've parsed, we need to take action on the individual expressions. The goal of this stage is to produce a runnable DB2 SQL query that will return what the user expects.

The process should start with a boilerplate SELECT containing a blank spot where the parser will fill in the rest. The SELECT template appears in Listing 12. The query produced by the parser may not be as optimal as a human DBA might write, but it will return the correct results expected by the end user.

Listing 12. The SELECT statement

What the parser fills in depends on the path it takes through the tokenizer. For example, if the user enters the query from above:

(actor='Tom Cruise' or actor='Kelly McGillis') and keyword=drama'

then the parser should emit text into the missing portion of the SQL query according to Figure 2. It will echo the parentheses, enter subqueries for the terminal queryTerms and substitute INTERSECT for AND and UNION for OR.

Figure 2. Parser output to SQL query

This would ensure that the SQL query emitted for

(actor = 'Tom Cruise' or actor = 'Kelly McGillis') and keyword=drama

would appear as in Listing 13.

Listing 13. The complete SELECT statement

As mentioned earlier, there are probably shorter, more optimal ways to write this particular query but this SQL will produce correct results. Generally, the DB2 optimizer can smooth out performance deficiencies.

So, what needs to be added to the JavaCC source to produce this? You must add actions and other supporting code to the defined grammar. An action is Java code that is executed in response to a particular production. Before adding actions, first add a method that will return the completed SQL to the caller. To do this, add a method called getSQL() to the top part of the JavaCC file. You should also add the private StringBuffer sqlSB to the parser's internal members. This variable will represent the current SQL string at any stage of the parsing. Listing 14 shows the PARSER_BEGIN/PARSER_END section of UQLParser.jj. Finally, add some code in your main() test method to print out and execute the generated SQL query.

Listing 14. PARSER_BEGIN/PARSER_END section

Now fill in the actions taken by the parser. We'll start with the easy one first. When an expression is being parsed, the parser should emit the word 'INTERSECT' whenever it parses 'AND' and 'UNION' whenever it parses 'OR'. To do that, insert self-contained blocks of Java code after the <AND> and <OR> tokens in the expression production. The code should append INTERSECT or UNION to the sqlSB StringBuffer. This is illustrated in Listing 15.

Listing 15. Actions taken for expressions

Several actions need to be taken within the queryTerm()production. These tasks are as follows:

Map the search names to their respective DB2 tables and columns
Save the comparator token.
Put the comparands into a form that DB2 can understand, for example remove the double quotes from QUOTED_STRING tokens
Output the proper subselect into sqlSB
For the recursive expression case, emit the parentheses as is.

For all of these tasks, you will need some local variables. These are defined between the first pair of braces in the production, as shown in Listing 16.

Listing 16. Local variables for queryTerm()

The first task can be accomplished with the code in Listing 17. Set the proper DB2 table and column associated with the token encountered.

Listing 17. Mapping search names to DB2

The second task can be accomplished with the code in Listing 18. Save the token so that it can be used in the SQL buffer.

Listing 18. Saving the comparator

The third task can be accomplished with the code in Listing 19. Set the comparand value accordingly, stripping the double quotes from the QUOTED_STRING token if necessary.

Listing 19. Preparing the comparands

The fourth task can be accomplished with the code in Listing 20. The complete query term is appended to the sql buffer.

Listing 20. Writing the SQL expressions

Finally, for the case of recursive expressions, the parser should simply echo parentheses when it sees them within the expression recursion, as illustrated in Listing 21.

Listing 21. Echoing parentheses

Listing 22 shows the completed queryTerm() production.

Listing 22. The complete queryTerms() production

Compile and run UQLParser.jj as before. Visit UQLParser.java and note how your production rules have been neatly inserted into the generated code. An example of the expanded expression() method is shown in Listing 23. Note the code after the jj_consume_token calls.

Listing 23. expression() method in UQLParser.java

Run this code as before. You will have to include db2java.zip in the CLASSPATH. This time, when you run

java ?cp bin;c:/sqllib/db2java.zip com.demo.stage2.UQLParser '(actor='Tom Cruise' or actor='Kelly McGillis') and keyword=drama'

it should produce the output in listing 24.

Listing 24. Output from UQL2Parser, query (actor='Tom Cruise' or actor='Kelly McGillis') and keyword=drama

Try a few more queries to get used to your parser. Try a query using the NOTEQUAL token, as in actor<>'Harrison Ford'. Try some illegal queries like 'title=' to see what happens. With very few lines of JavaCC code, you have produced a very effective end-user query language.

Final Points to Consider

JavaCC, in addition to providing the parser builder, also provides the JJDOC tool for documenting your grammar in Backus-Naur Form. JJDOC can make it easy for you to provide your end users with a description of the language they are using. The ant file provided in the accompanying code has a 'bnfdoc' target as an example.

JavaCC also provide a tool called JJTree. This tool provides tree and node classes which make it easy for you to partition your code into separate parsing and action classes. Moving forward with this example, you might consider writing a modest optimizer for your query to eliminate unnecessary INTERSECTs and UNIONS. You could accomplish this by visiting the nodes of the parse tree and consolidating similar adjacent nodes (actor='Tom Cruise' and actor='Kelly McGillis', for example).

JavaCC has a rich library of grammars. Before writing a parser yourself, be sure to look in JavaCC's examples directory for a possible ready-built solution.

Be sure to read the Frequently Asked Questions on the JavaCC Web page and visit the javacc newsgroup at comp.compilers.tools.javacc to get a better understanding of all the capabilities and features of JavaCC.

Conclusion

JavaCC is a robust tool that can be used to define grammars and easily incorporate the parsing and exception handling for that grammar into your Java business application. With this article, we have demonstrated that JavaCC can be used to provide a powerful, yet simple query language for end users of a database system.

Downloadable resources

JavaCCPaper.zip'>Code sample (JavaCCPaper.zip | 8.47KB)