Polyglot Compiler Tutorial

Introduction

Polyglot is a highly extensible compiler front-end for the Java programming language. For more than ten years, researchers have used Polyglot to develop Java language extensions. While Polyglot originally targeted Java 1.4, its own extension mechanisms have been recently been used to add support for modern Java features such as generics and annotations. Polyglot has proved to be a very useful tool for experimenting with new language features and for building other language-processing tools.

One particularly useful Polyglot extension is the Accrue interprocedural analysis framework. The Accrue framework simplifies implementation of interprocedural analyses of programs written using either Java, or extensions to Java. These analyses can be used for program understanding or as part of the implementation of the language.

In this tutorial, we explore how to use Polyglot and the Accrue framework to build language extensions and program analyses.

Design philosophy

Polyglot has been successful because of its design philosophy:

It is designed to support building complex language extensions that significantly modify the behavior of the base language, Java.
Polyglot is built in Java and does not require coding in a specialized language.
Its design patterns support modular extensibility in which the extended languages can be implemented while using the original Polyglot code as an unmodified library.
Further, its extensibility is scalable: coding effort is proportional to the amount of functionality added.
Compiler extensions can be layered on top of previous extensions, allowing languages to be built up incrementally.

Support for complex language extensions

Polyglot is not just a preprocessor; it supports the development of complex language extensions that add new features to the Java language, including to its type system. The base Polyglot framework implements an extensible compiler for the base language Java 1.4. This framework, also written in Java, is by default simply a semantic checker for Java. An implementation of a language extension may extend the framework to define any necessary changes to the compilation process, such as extending the abstract syntax tree (AST) and adding new compiler passes that analyze and transform the program.

Polyglot has been used to build extensions that change Java in very significant ways, such as supporting information flow labels (Jif), aspects (abc) and distributed computation (X10, Fabric). In fact, Java 5 and Java 7 are implemented as successive extensions to the base compiler. Because Polyglot implements Java 7, it is able to compile itself.

The standard back end of Polyglot generates pretty-printed Java code. It is also possible to add new back ends. In the usual mode of use, all static checking is performed by Polyglot and the extension code, and the back end compiler is handed fully correct Java code. Error messages are generated with respect to the original source code rather than relying on the back end compiler to generate messages, which would likely be less understandable.

Developing in Java with design patterns

Unlike some other recent extensible compilers, Polyglot is built using a standard programming language, Java. It's not necessary to learn a new programming language, and existing libraries and IDEs can be used to develop a Polyglot-based compiler. Polyglot was originally implemented using Java 1.4, but it has been able to grow along with Java and to take advantage of the new language features added in later versions of the language, such as generics.

Avoiding domain-specific language support does have a price, however. Even though object-oriented languages like Java are designed to support extensibility, a compiler is a particularly challenging kind of software to build in an extensible way. Compilers contain both complex data structures and complex algorithms, both of which may need to be extended. To make this possible without relying on support from specialized language features, Polyglot is implemented using a distinctive set of design patterns.

Patterns for modular, scalable extensibility

The difficulty of extending in a type-safe way both types and the procedures that manipulate them was observed early by Reynolds and is often called the “Expression problem”. This problem is encountered when extending compilers: the types to be extended are the abstract syntax tree nodes used to represent the program, and the procedures to be extended are the compiler passes that traverse and transform this AST.

Solutions to the Expression problem often have the problem that they are not scalable, in the sense that the amount of code needed to construct an extension is proportional to the size of the code base being extended, rather than to the size of the change being made.

To provide modular, scalable extensibility, Polyglot uses several design patterns:

Careful separation of interfaces and implementations. For example, all AST nodes (e.g, Node) are represented by interfaces with a standard implementation (e.g., Node_c) that can be replaced.
A modified version of the Visitor pattern supports incremental, functional-style (side-effect-free) translation of ASTs. This pattern makes it convenient to split the work into a sequence of small compiler passes that each does only a small, modular task.
The Abstract Factory pattern is used to avoid binding the syntax of the language to specific classes representing abstract syntax nodes. Instead, these objects are created using a NodeFactory object.
Extension objects allow Polyglot to mix in additional state and operations to existing abstract syntax tree nodes. Each layer of extension may add its own layer of extension objects to each node in the language. They are created by ExtFactory factory objects.
Language dispatcher objects handle the dispatching of AST node operations to the appropriate extension object, performing the transformation that is appropriate to the current language extension.