Code Generators

This page discusses code generators, and how the DMS Software Reengineering Toolkit can be used to build sophisticated generators in a very general way. DMS has been used to build code generators from graph languages to industrial controllers, from nearly-side-effect free "functional" languages to task parallel languages, from XML DTDs to Java (extremely fast XML readers), etc.

A code generator accepts an input specification, often written in some domain-specific language, and produces code in typically a different (computer) language than that of the input specification. (It may produce code and documents in several different languages!) The key question is, what kinds of machinery are needed to read the input language, check for well-formedness, analyze for special code generation opportunities, and generate efficient, valid output code?

Which input (specification) language will the generator accept?

The first question is, what input specification language is to be processed? Obviously this depends mainly on the purpose of the code generator and the expected community of users. But properties of the input language matter, too. The code generator must be able to read the language, which it cannot do if the language isn't well defined, or if the language has some properties that exceed the code-generator's ability to parse (e.g., Natural language), If this language isn't extremely well defined, then some domain analysis will necessarily occur during the building of the code generator, or the users won't understand the meaning of what they say, and any benefits expected from code generation will likely be lost. The code generator must be able to reason about (analyze) the specification, too.

Input languages divide broadly into several categories:

  • Simple languages such as command lines, arithmetic and regular expressions
  • Classic, legacy text based languages like C, Java, SQL, HTML, COBOL, Verilog and Prolog
  • Hypermodern text based languages such as Scala and Ruby
  • XML-like languages, such as XQuery and configuration files
  • Graphical Languages such as UML, StateCharts, Simulink and LabView
  • Binary files, such as Java .class files

From our perspective, these are all specification languages, because they specify something of interest with respect to the code generation task.

Many code generators are specialized to accept only one particular kind of specification and produce only one type of generated result. For simple generators, this may be enough. But for sophisticated generation, these limitations are often caused by the narrow technical foundations, usually chosen with limited budget or timeframe, on which such generators are built. If the generation goal is ambitious, one wants strong machinery, not just what you can get easily. DMS can process any of these specification types, and produce fairly arbitrary outputs, by virtue of providing strong foundations for the variety of activities required by a code generator.

How will the specification be represented inside the generator?

Processing the raw text of the specification directly to generate code has proven be ineffective for all but the most trivial code generators. If not text, then how should the specification will be represented internally? There are two traditional choices: as a (syntax) tree, or as a graph. Syntax trees are often chosen for text-based languages because they are relatively easy to produce with string language parsers such as YACC. Graphs are often chosen for graphical languages, as being a direct representation.

DMS has a hypergraph internal representation, with value-carrying nodes and anonymous arcs that connect named ports on each node. Hypergraphs can can represent graphs whose individual node types certain relations with other node types easily. It can represent graphs with complex arcs (such as UML) by representing each arc as a hypergraph node. A specialized version of the hypergraph easily represents abstract syntax trees (ASTs), with ports representing parent and specific child links, which are the favored method of representing text-based languages after parsing. Hypergraphs are incredibly flexible for representing specifications.

What is needed to parse the specification?

Parsing means "read the (external) specification and construct the (internal) corresponding representation, diagnosing any illegal syntax in the specification".

Parsing String Languages

For simple and classic languages, one needs parsing technology that can read the source text and produce ASTs. The shape of the specification language is usually defined using some kind of BNF, sometimes an EBNF.

Many code generators that use a text-based specification are built using relatively weak parsing technology such as YACC, JavaCC or ANTLR, all of which have limitations on the nature of the BNF they will accept. YACC is limited to LALR(1) grammars, which cannot directly handle languages requiring indefinite amounts of lookahead, such as C or FORTRAN. ANTLR at least addresses bounded lookahead directly. However, JavaCC and ANTLR reject any BNF with left-recursion. Basically these tools cannot handle the defining grammars for the specification language; it is extremely unusual to find a BNF that these tools will accept directly. These issues can resolved, usually by considerable effort on the part of the domain engineer, who must either hack the grammar, hack the lookaheads, or hack around the limitations of the parser generator. (This hacking famously gives rise to the folk theorem that C++ is hard to parse. The theorem is false; what is true is that C++ is hard to parse using weak parsing technology.)

Given this headache, why do people build parsers with these tools? The classic answer: they were the only tools found looking under the lamppost, and they are free.

In contrast, DMS can easily parse any string language which can be described by a context-free grammar (no "no left recursion" constraints, no limit on lookahead, no complaints about ambiguous parses) using DMS's built-in GLR parser. The DMS lexer is designed to produce language lexemes based on regular expressions over arbitrary sequences of Unicode characters, with input streams configurable for a wide variety of character sets from ASCII to UTF8 to EBCDCI to Microsoft code pages. Chinese identifiers are fine.

It is often the case that the specification language is a production programming language, as the code generator's job is to pick out particular structures and generate corresponding code. One example is generating serialization code for data structures; one needs to be able to parse the computer language and recognize the data structure declarations. Writing a front end for a production language takes a lot of effort, to handle the actual grammar used versus the one documented by the standard, to handle dialects, to implement necessary preprocessing (C, C++, COBOL), etc. If the task is build a generator with a computer language as an input, trying to build your own, robust language parser adds a huge amount of time and effort to the process. To avoid that cost, DMS is available with robust front ends for many languages off the shelf, including C, C++, C#, Java, COBOL, JavaScript, Python, Fortran, .... If the specification language is not one of these, DMS's GLR parser will enable a short grammar definition phase and easy grammar modifications as necessary.

Most parsing engines are designed to ignore comments because to parsers are generally designed for compilers, which treat comments are just whitespace. If comments aren't needed for the code generation task this is acceptable. If comments contain annotations or need to be partly reflected into the generated code, then they need to be captured, and the parser design makes this much more difficult. The DMS lexer and parsers automatically capture comments and attach them as auxiliary data to appropriate AST nodes.

With traditional parser generators, if the parser engineer succeeds in getting the parser to "parse", she must then explicitly specify how to build the AST, typically by writing, for each grammar rule, custom (Java, C, C++) code that constructs AST nodes by assembling AST pieces from other parts of the grammar. For complex specification languages (COBOL, Verilog) with thousands of grammar rules, this means designing (and documenting) thousands of node types. One invariably discovers needs to modify the grammar, and that leads to the need to redesign subsets of the node types.

DMS automatically constructs a tree isomorphic to the grammar, and stamps each node with file source position information (file/line/column). No need to define nodes or how to build compose trees using children, or to change such definitions when the grammar changes. The grammar is the documentation for the set of AST nodes. DMS automatically optimizes the representation to drop terminal nodes that have no associated value, remove unary productions with no semantic import, and constructs list nodes for list-forming grammar rules, producing small trees. Building ASTs with DMS is trivial.

Parsing Graph Languages

Graph languages tend to be entered into tools in two ways: via a graphical UI, and via an export format. For a GUI, mouse clicks select icons, connection points and connections, and text entry provides annotations. A custom GUI editor collects this data and stores it into the chosen representation. DMS has a prototype graphical UI that has been used to enter some types of graphical languages like this, by constructing and connecting appropriate hypergraph nodes.

Export formats are basically text languages that specify nodes, arcs, and annotations. These can be parsed initially as that plain text language, followed by a post processing step that constructs the actual graph representation by interpreting the parsed text elements. This is straightforward to accomplish with DMS using conventional text parsers and a DMS-based attribute grammar evaluator (a scheme for passing information around a tree). DMS has built-in support for constructing attribute grammar evaluators. It is popular to use XML-based export formats such as XMI for UML; these basically require parsing the XML document, building the corresponding tree and processing it with an attribute grammar evaluator designed for the specific DTD or Schema. DMS provides an XML parser as a foundation for processing XML documents, whether used for graph descriptions or merely as a kind of text language with brackets.

Parsing "Binary" Languages

Sometimes a code generator will need to process binary files, such as Java .class files. No matter what technology one uses, for this kind of task one simply has to write a custom reader that implicitly understands the format of the particular class file. DMS provides a procedural language, PARLANSE, in which such arbitrary custom code can be implemented. As an example, DMS's Java domain provides .class file readers in addition to the standard Java parser definition.

Parsing Multiple Languages

Complex generation tasks may require the generator to read several (or even many) specification files, in a variety of different notations. While one can organize standard parser generators to do this, this can be awkward. DMS is perfectly happy to process an arbitrary set of specification languages concurrently. It has also been used to process specifications with thousands of files.

PrettyPrinting a representation

Most code generators don't ever print their internal representation in anything except as a data structure dump for debugging purposes. This can make it hard to understand why the code generator produced an particular output.

For every domain parser, DMS insists on the existence of a prettyprinter. This is a mechanism that converts the internal representation back into the external form from which it is normally derived. A language prettyprinter is an inverse function for the language parser.

For text languages, a prettyprinter regenerates the source text from the internal representation, This is accomplished using a DMS prettyprinter (text) box specification language, which annotates DMS grammar rules. The box specification describes how rectangular boxes of text can be composed, to produce a final text document (one giant box of text). The box language allows one to specify primitive boxes corresponding to language tokens found in the tree, to define horizontally concatenenated boxes, vertically laminated boxes, and indented boxes. This is sufficient to regenerate nicely indented source text for virtually any typical text-based computer language. If comments and formatting information is present, they are reproduced in appropriate places.

You can see examples of string parser and box prettyprinter grammers specifications for DMS on this example of algebra implemented using DMS.

The prettyprinter concept is also used to generate the final result from a code generator, which is typically in a different representation than the input specification uses.

How does the code generator know the meaning of the specification?

Theorists always want to know how the formal semantics of a specification language are captured and used in a trustworthy way. The practical answer for virtually all code generators is implicitly in what the generator does; it is simply too hard to write down the formal semantics (even if you can find them [try this for C++]). If you insisted on capturing formal semantics, DMS will let you define your semantic specification scheme as yet another domain, and let you build parsers and processors of the semantic language. We've never found this necessary.

How can the specification be analyzed for errors or special opportunities?

Many code generators are simple; if they find interesting idioms in the input, they produce an output controlled directly by that input. Some are simple because they only need to be simple, but many are simple because it is too hard to build any kind of sophisticated analysis if one has to hand-code walking over the tree/graph representation, so people just don't. But that means the full ambition for the generator isn't realized; it may not even be able to properly diagnose all the semantic faults in the specification, leading to longer debugging cycles for users of the generator.

DMS provides a wide variety of methods for analyzing the representation of a specification:

  • Write ad hoc procedural PARLANSE code to climb over the AST or the hypergraph in PARLANSE, and check the tree structure for interesting configurations. This is mostly unnecessary, because of the other techniques available below.
  • Match any part of the tree against a surface syntax pattern, specified in DMS's RSL language. This allows the generator engineer to write specification fragments of interest in the specification notation, and have DMS determine if a particular tree matches the fragment. For instance, the RSL pattern " \x * (\x +1 ) " for a string language allowing conventional expressions would match a tree with "*" as root operator, left child being an arbitrary subtree, right child having root "+" with right-left child being identical to root-left child, and right-right child matching the literal value "1". This allows the generator engineer to operate in large part without knowing the precise structure of the tree.
  • Attribute grammar evaluator, which automates visiting all tree nodes, and passing arbitrary selected data up and down the tree, calling PARLANSE procedures to collect ad hoc information or report errors
  • Construct symbol tables (often by using an attribute grammar evaluator), and use the symbol table data (definition, associated type) as an input to any analysis process, including another attribute evaluator
  • For procedural languages, construct control flow graphs, compute dominators. This is accomplished with another attribute-evaluator, and a DMS library for constructing control flow graph elements and weaving them together.
  • For procedural languages, with a control flow graph, construct def-use and use-def chains, using DMS libraries that walk control flow graphs and build such structures, to allow information flows to be analyzed. DMS can even build custom points-to analyzers and call graphs.
  • Compose any of the above analysis results to compute an arbitrary composite analysis, using PARLANSE code.

The analysis result may be reported to a user, used to print an error diagnostic, or cached to control sophisticated code generation.

How does one generate code for the specification?

Typical tree- and graph- based code generators produce code by procedurally walking the representation collected by parsing in some specific order, and spitting out raw text. This works for very simple code generation schemes. Because of the fixed order, one cannot collect complex information from "far away" in the specification to handle special cases well, or to carry out local optimations on the generated result.

If one wants sophisticated code generation, one needs to construct the code generator like a real compiler: after parsing, construct arbitrarily complicated analyses (as described in the preceding section), and then translate the specification representation into a representation appropriate for the output language. By doing this, the output representation can be further analyzed, and modified to implement optimizations, or mapped to yet another representation to allow staged translation. This is useful if one wants to input C code, generate VHDL as an intermediate representation, and then generate FPGA specifications as a final result.

DMS handles this simply by accepting a second "parser" definition that can parse the output language. Code generation is then accomplished by an arbitrary walk of the input representation, and construction of output representation language fragments which are ultimately stitched together to form a complete output representation, that may be processed further the same way. The order in which these are generated can be arbitrary, as long as they are ultimately glued together to form one or more output representation instances. Of course, one may wish to generate output representations for multiple languages; DMS can handle this easily.

One may wish to optimize assembled output representation fragments. DMS can run analyzers over those fragments to find optimization opportunities. When an optimization opportunity is found, one can write procedural code to modify the representation, or apply DMS RSL source-to-source transformations to implement them. One simple example: the transformation " \x ** 2 " --> " \x * \x " implements the strength reduction optimization that converts taking a (second) power, into the usually cheaper operation of multiplying by the same value. Writing optimizations using RSL helps the generator engineer specify complex transformations more easily, in a more visually inspectable form, and minimizes her effort to understand the precise shape of the AST, even if it changes!

When no more translation/optimization transforms can be applied and the final representation is reached, DMS can apply a language prettyprinter for the output representation instances to produce the final output(s) desired from the code generator.

DMS: Generalized Compiler Technology for Code Generation

We have sketched the needs of a code generator, and how simple code generators work compared to DMS. DMS can provide multilingual specification readers, highly complex analyis and sophisticated transformations, and produce outputs in multiple languages as desired. The DMS Software Reengineering Toolkit provides all the necessary machinery to build high quality code generators.

For more information: [email protected]    Follow us at Twitter: @SemanticDesigns

Code
Generation