How to Use Polyglot

Introduction

Polyglot is an extensible Java compiler front end. The base polyglot compiler, jlc (“Java language compiler”), is a mostly-complete Java front end; that is, it parses and performs semantic checking on Java source code. The compiler outputs Java source code. Thus, the base compiler implements the identity translation.

Language extensions are implemented on top of the base compiler by extending the concrete and abstract syntax and the type system, and by defining new code transformations. The end product is a Java abstract syntax tree (AST) that is output into a Java source file, which is compiled with javac. (For historical reasons, some extensions just override some portions of the Java output code to handle the extended syntax of the particular language extension being compiled rather than rewriting the AST.)

Architecture

The Polyglot compiler is structured as a set of passes over source files that ends with the output of Java source code. The passes parse the original source language and create an AST, rewrite the AST to eliminate any ambiguities, type check the AST, possibly rewrite the AST to another AST, then output the AST as Java source code.

When the compiler is invoked (through polyglot.main.Main.main()), it parses the command line (setting options in polyglot.main.Options), then creates a compiler object (an instance of polyglot.frontend.Compiler) to manage the compilation process. An important job of the command line parser is to identify the language extension (specified on the command line with -ext L, where L is the name of the extension), and load the extension (from polyglot.ext.L.ExtensionInfo). The compiler uses the extension to determine several important features of the language, including its source file extension, AST node factory, type system, and pass schedule.

Parsing is done with the Java CUP parser generator and a Polyglot extension to CUP called PPG. PPG allows CUP files to be selectively extended to create parsers for extension languages by providing operations on a CUP grammar, including adding, dropping, and renaming of productions. The JFlex lexer generator is used to create a lexer for the source language. The semantic actions in the parser create an AST through a NodeFactory, which is a class containing factory methods for creating AST nodes.

After the AST has been created by the parser, a series of passes is performed upon it. The passes for a language extension, as well as the order in which they should be run, are defined in the extension's ExtensionInfo class. The compiler object runs the passes in the order specified so that that dependencies between compilation units are satisfied. Most passes are implemented using a modified version of the Visitor design pattern, described later (see also TR 2002-1871). The default set of passes is:

After many of these passes are “barrier” passes (implemented by BarrierPass). A barrier pass compiles all source files on which a given source depends up to the same barrier. This ensures that enough of the type information of dependent sources has been computed before the compilation continues.

The ambiguities referred to in the above passes are ambiguities resulting from classification of names in Java. Some names are syntactically ambiguous because their meaning cannot be determined without some semantic analysis (see JLS2 6.5.2). Extensions may also introduce new ambiguities that require resolution.

Extensions will usually insert passes before type checking to perform some initial semantic analysis or after it to do some final semantic analysis, and between exception checking and translation to rewrite the AST.

Source code hierarchy

All Polyglot code is in the package polyglot. The subpackages are as follows:

ast
the AST node interface files. All AST nodes implement the polyglot.ast.Node interface.
types
the type system interface.
types/reflect
class file parsing code.
visit
visitor classes which iterate over abstract syntax trees.
frontend
compiler pass scheduling code.
main
code for the main method of the compiler in the class polyglot.main.Main. It includes code for parsing command line options and for debug output.
util
utility code. This includes the parser generator in util/ppg.
lex
lexer utility code.
parse
parser utility code.
ext
code for language extensions. Source code for a language extension lives in the package polyglot.ext.<ext-name>. The default language extension is the "jl" extension which implements Java parsing and type checking. Extensions are usually implemented by inheriting from the "jl" extension code. Extensions usually have the following subpackages:
ext.<ext-name>.ast
AST nodes specific to the extension
ext.<ext-name>.extension
New extension and delegate objects specific to the extension
ext.<ext-name>.types
type objects and typing judgments specific to the extension
ext.<ext-name>.visit
visitors specific to the extension
ext.<ext-name>.parse
the parser and lexer for the language extension

In addition, an extension must define the class ext.<ext-name>.ExtensionInfo, which contains the objects which define how the language is to be parsed and type checked. There should also be a class ext.<ext-name>.Version defined, which specifies the version number of the extension. The Version class is used as a check when extracting extension-specific type information from .class files.

AST nodes, extensions, and delegates

To allow for greater flexibility in overriding the behavior of an AST node, each node has a pointer to a delegate object and a (possibly null) list of extension objects. Extension objects are useful for adding a field or a method to many different AST nodes. They provide functionality similar to mixins. Their purpose is to allow a uniform extension of many AST nodes, not to be the primary vehicle through which a language extension is implemented. Delegate objects are similar to extension objects and are used for overriding existing methods of many different AST nodes. For more details, see the tech report (Cornell CS-TR 2002-1883).

In order for the delegates to override the AST node, most calls to the AST node object should be dispatched through the delegate object. The default delegate of every AST node just calls the corresponding method in the AST node.

So for instance, to invoke the typeCheck() method on an AST node n, we do:

      n.del().typeCheck(type_checker);

instead of directly calling:

	n.typeCheck(type_checker);

To reduce the proliferation of classes, all nodes in the base compiler use the same delegate class. For each compiler pass, the delegate invokes a method in the AST node that implements the pass. Thus, in the base compiler, passes are implemented in the AST nodes themselves. Besides reducing the number of classes, this approach also permits more convenient access to instance variables of the nodes; delegates access the instance variables of their associated node through accessor methods.

In writing a language extension, the designer should avoid using this approach and put the pass implementation in the delegates themselves; this leads to less work in the number of AST nodes that need to be extended.

In deciding whether to put add functionality via inheritance, an extension object, or a delegate object, use the following guidelines:

  1. If extending the interface of many different AST node classes, including adding a member to the common base class of several classes, use an extension.
  2. If overriding an existing method of many different AST node classes, use a delegate.
  3. Otherwise, use inheritance.

If the designer chooses to use delegates or extensions, delegate factories and extension factories simplify the task of instantiating appropriate delegate and extension objects respectively. See below for more information on node, extension and delegate factories.

Writing an extension

Suppose you want to create language L that extends the Java language. First, you need to design L. Your design process should include the following tasks:

  1. Define the syntactic differences between L and Java, based on the Java grammar found in polyglot/ext/jl/parse/java12.cup.
  2. Define any new AST nodes that L requires. The existing Java nodes can be found in polyglot.ast (interfaces) and polyglot.ext.jl.ast (implementations).
  3. Define the semantic differences between L and Java. The Polyglot base compiler (jlc) implements most of the static semantic of Java as defined in the Java Language Specification 2.
  4. Define a translation from L to Java. The translation should produce a legal Java program that can be compiled by javac.

Next, you can implement L by creating a Polyglot extension. Implementing the extension will require the following tasks.

  1. Modify build.xml to add a target for the new extension. This can usually be done by copying and modifying the skel target.

    (Optionally) Begin with the skeleton extension found in polyglot/ext/skel. Run the customization script found at polyglot/ext/newext, which will copy the skeleton to polyglot/ext/L, and substitute your language's name at all the appropriate places in the skeleton.

  2. Implement a new parser using PPG. To do this, modify polyglot/ext/L/parse/L.ppg using the syntactic changes you defined above.
  3. Implement any new AST nodes. Modify the node factory polyglot/ext/L/ast/LNodeFactory_c.java to produce these nodes.
  4. Implement semantic checking for L based on the rules you defined above.
    1. If L involves changing the semantics of Java, you will probably want to implement these as part of the type check pass already defined by Polyglot.
    2. If L introduces new semantics that are orthogonal to Java, you may wish to implement an entirely new pass that runs separately from the type checker.

    Semantic changes that are localized to an AST node will probably be implemented by overriding that node's typeCheck() method. Semantic changes that affect more fundamental properties of the Java type system will probably be implemented by overriding appropriate methods in polyglot/ext/L/types/LTypeSystem_c.java.

  5. Implement the translation from L to Java based on the translation you defined above. This should be implemented as a visitor pass that rewrites the AST into an AST representing a legal Java program.

Let's make this more concrete by introducing an actual extension. We'll use the “Primitives as Objects” (Pao) extension, which extends Java 1.4 with the ability to use primitive types (e.g., int, float) as Objects via autoboxing. For example, in Pao we can write:

    Map m = new HashMap();
    m.put(1, 2);
    int x = (int) m.get(1);

The changes to Java needed to support this feature are quite minimal.

  1. We modify the grammar to allow instanceof to operate on primitive types. The existing production for instanceof in java12.cup is:
    	    relational_expression ::=
                ...
            |   relational_expression:a INSTANCEOF reference_type:b
            ;
    
        In order to allow primitives, we should change this to:
    
           	relational_expression ::=
                ...
            |   relational_expression:a INSTANCEOF type:b
            ;
    
  2. We modify type checking so that primitive values may be used at type Object. That means for all primitive types P where P != void, P <: Object (Polyglot defines void as a primitive type, but void has no values). We'll want to use this relationship in assignments and casting, as shown in the example above. Also, we'll need to allow primitive types to appear inside an instanceof operator.
  3. We rewrite the AST to make it a legal Java program. This means that anywhere we see a primitive value being used at Object, we should box the value and insert a cast to Object. We also need to unbox primitives when casting from Object to a primitive type. For completeness, we also rewrite the operation == to have it compare boxed values by value rather than by pointer. This gives the illusion that all primitives with the same value are boxed into the same object.

We create the extension as follows. The complete extension is in the Polyglot distribution at polyglot/ext/pao.

  1. We use the newext script to generate a skeleton for the extension.
            $ cd $POLYGLOT/polyglot/ext/pao
            $ sh ./newext pao Pao pao
            $ cd $POLYGLOT/polyglot/ext/pao
  2. We modify parse/pao.ppg to redefine the instanceof production to allow any type to be used in an instanceof expression. This required only appending the following code to pao.ppg:
  3.       extend relational_expression ::=
    	      relational_expression:a INSTANCEOF type:b
    	      {: RESULT = parser.nf.Instanceof(parser.util.pos(a),
    						a, b); :}
    	      ;
    
          drop { relational_expression ::=
    	      relational_expression:a INSTANCEOF reference_type:b; }
    
    The remainder of the file is boilerplate code.
  4. We next extend the Java type system to handle Pao's semantics.
        $ cd $POLYGLOT/polyglot/ext/pao/types
    
    We edit PaoTypeSystem_c.java to override the factory methods for primitive types and top-level class types. We also insert methods to provide access to the runtime boxing classes. We next create a subclass of PrimitiveType that overrides the methods: descendsFrom(), isImplicitCastValid(), and isCastValid() to allow primitives to be used as Objects. We also create a subclass of ParsedClassType to allow primitives to be cast to Object.
  5. We create a new extension interface, PaoExt, that extends the Ext interface. This extension interface has the signature for a new method, rewrite(), which we will use to rewrite the the new Pao code into valid Java code. We also create a class PaoExt_c which extends Ext_c and implements PaoExt. The default action for the rewrite() function is to return the node unchanged, which is the behavior that is desired for most nodes.
  6. We override type checking for the instanceof operation. To do so, we create a new delegate, PaoInstanceofDel_c, that subclasses the JL_c class in the base compiler. In it, we override the typeCheck() to allow primitive types to occur in the instanceof expression. JL_c implements all other methods of the JL interface by dispatching back to the node. 7. We define the translation that will take our Pao language to standard Java by defining the implementation of the rewrite() function.

    By the translation rules that we have defined, three things will need to be rewritten: casts, instanceof operations, and the == and != operations.

    In PaoInstanceofExt_c, we override the rewrite() method to allow for instanceof operations on primitive types.

    We also create a PaoCastExt_c which extends PaoExt_c, in which we override the rewrite() method to box and unbox primitives appropriately to allow casting to and from primitive types.

    In addition, we create a PaoBinaryExt_c that also exends PaoExt_c, which overrides the rewrite() method to rewrite == and != expressions to call Primitive.equals(o, p) when comparing two Objects or boxed primitives. This method allows boxed primitives to be compared using == and !=.

  7. We add a pass to insert explicit casts to Object when assigning a primitive to an object. We call this pass PaoBoxer and implement it as a visitor.

    PaoBoxer is a subclass of AscriptionVisitor, which contains code to locate places where expressions are used. The ascribe() method is called for each expression and is passed the type the expression is used at rather than the type the type checker assigns to it. For instance, with the following Pao code:

            Object o = 3;
    ascribe() will be called with expression 3 and type Object. We override ascribe() to insert casts when assigning a primitive to an Object. We override the visitors leaveCall() method to call the rewrite() method if the node's delegate is an instance of PaoDel. This makes sure that all the appropriate nodes are rewritten to ensure a proper translation.
  8. We create a new NodeFactory, PaoNodeFactory_c, that extends NodeFactory_c. In this new NodeFactory we override the defaultExt() method to make the default delegate the PaoDel_c, and also override the InstanceOf, Cast, and Binary methods to return instantiations of the nodes with the PaoInstanceofDel_c, PaoCastDel_c, and PaoBinaryDel_c delegates.
  9. We create the ExtensionInfo that defines our extension.
            $ cd $POLYGLOT/polyglot/ext/pao

    The skeleton generator created most of the necessary code. We modify the passes() method to add our new boxing pass. We also create a Version class that defines the version of Pao that is being worked on.

Node, Extension and Delegate Factories

Node factories are used to create instances of AST nodes. Extension and delegate factories simplify the task of instantiating appropriate delegate and extension objects for the AST nodes.

Language extensions will typically implement node factories by extending the NodeFactory_c class in the package polyglot.ext.jl.ast. The NodeFactory_c class can be given a delegate factory and/or an extension factory to use. The classes AbstractDelFactory_c and AbstractExtFactory_c in the same package provide convenient base classes for language extensions to extend.

For any AST node type <node>, the node factory typically has one or more methods called <node>, to create instances of <node>. The implementation of these methods in NodeFactory_c has the following form:

    public <node> <node>(Position pos, ...) {
        <node> n = new <node>_c(pos, ...);
        n = (<node>) n.ext(extFactory.ext<node>());
        n = (<node>) n.del(delFactory.del<node>());
        return n;
    }

Note that first an object that implements the interface <node> is created: <node>_c. An extension object for the newly created AST node is obtained by calling the appropriate method on the extension factory. A delegate object is obtained by a similar call to the delegate factory. The extension object and/or the delegate object returned by these calls may be null.

The AbstractExtFactory_c class implements the ext<node> methods and provides convenient hooks for language extensions to override. The implementation of the ext<node> method in AbstractExtFactory_c has the following form:

    public final Ext ext<node>() {
        Ext e = ext<node>Impl();
        return postExt<node>(e);
    }

The ext<node>Impl() is responsible for creating an appropriate Ext object. The default implementation of these methods in AbstractExtFactory_c is simply to call the ext<super>Impl() method, where <super> is the superclass of <node>. Thus, for example, the implementation of extArrayAccessImpl in AbstractExtFactory_c is:

    protected Ext extArrayAccessImpl() {
        return extExprImpl();
    }

For example, a language extension that needs to provide extension objects for all expressions and also for class declarations would thus need to override only two methods of AbstractExtFactory_c: extExprImpl() and extClassDeclImpl(). Another example: if a language extension needs to use a single Ext class for all AST nodes, then only the single method extNodeImpl() needs to be overridden.

The postExt<node>(Ext) methods provide hooks for subclasses to manipulate Ext objects after they have been created. The default implementation of these methods in AbstractExtFactory_c is simply to call the postExt(Ext) method, where is the superclass of <node>.

The structure of the delegate factory AbstractDelFactory_c class is analogous to that of AbstractExtFactory_c.