Program Transformation

The purpose of this page is to generally sketch what program transformations are, and to provide an idea of how the DMS Software Reengineering Toolkit makes them available for use in custom tools.

Basics: Program Transformation (verb) is the process of converting one program to another. This process is implemented using sets of individual program transformations (noun), which are bits of knowledge that map substructures in one formal system (of which programming languages are a special case) to another (possibly the same) formal system.

A very simple example of a (source-to-source) program transformation would be the following (slightly simplified from the real DMS):

          transform additive_identity: \X + 0 -> \X

meaning transform a (structure we call an) expression having an arbitrary sub-expression designated \X, an add operator, and the constant zero, into just the sub-expression \X in the same language. The name additive_identity is used so people can name the individual transforms in a large set.

If additive_identity were applied twice to the following source code expression:

          ((Q*7+0)*Z+0)/(S+0)

the following result would be produced:

          ((Q*7)*Z)/(S+0)

Real Tools: There is usually some kind of transformation engine (e.g., DMS) that interprets the bits of knowledge to carry out their instructions on the substructures. A program transformation system is set of tools including such an engine.

Utility: Obviously, if one only has a single simple program transformation such as our example, one can only do trivial things, much like a 9th grader learning his first algebraic rule. But, just as mathmaticians with lots of algebraic knowledge can do spectacular mathematical reasoning, sets of transformations can be used for an astonishing variety of software engineering tasks, including optimizing code, translating computer languages (see B-2 Stealth Bomber software migration as a spectacular example), doing symbolic reasoning, making massive source changes, code generation etc. While generally considered thought of as unique and isolated technologies, both aspect weaving and refactoring are really just special cases of program transformation and thus can be implemented straightforwardly using a program transformation system.

Languages, structure, and meaning: In general, a formal system of notation is a language. All languages have some kind of structure, and a meaning, or "semantics" of the language, often composable from meaning of the language structures. Program transformations are usually semantics- (or "correctness-") preserving [optimization, translation, refinement, refactoring, ...] but can be used to alter the semantics of the a "program" (language instance) [add a parameter to a procedure, adjust all array bounds by one, remove all code related to a function, ...].

Application Conditions: Most individual program transformations can't be used in every possible circumstance under which they appear to apply (e.g., they might violate correctness), so many of them have an application condition which must be checked before they apply. A (source-to-source) transformation example for the C language:

          transform autoinc:
               \X = \X + 1 -> \X++
               if NoSideEffects(\X)

The application condition for this transformation is that the same exact expression \X appears on both sides of the equal sign, and the expression that \X represents doesn't have any side effects, and doesn't call any procedures that have side effects. (There's an entirely different discussion on how one should specify and implement application conditions.) If the application condition is satifiied, this would transform

          P[Q(N,7)]=P[Q(N,7)]+1

to

          P[Q(N,7)]++

Application conditions for most real programming languages require flow analysis to understand how information is passed between elements of the program. As an example, such flow analyses would be used support the determination that \X has no side effects by analyzing the procedures it calls.

Structural processing: A practical program transformation system must read in the language to be transformed, not as a simple text string, but rather according to the structure of the language, and then carry out program transformations directly on that structure. Structures of string-based languages are often represented by abstract syntax trees; the graph for a syntax tree for an expression fragment in a typical programming language such as Y = X + 1 would typically be


            =
           / \
          Y   +
             / \
            X   1

Practical systems such as DMS represent such languages internally as Abstract Syntax trees for string-like languages, or more generally as graph which allows them handle graphical languages too.

Reliability: This focus on structure makes program transformations a much more reliable mechanism for making accurate changes than string-hacking (as exemplified by PERL or SED), because the program transformations are expressed in terms of the natural structures of the language. What this means is that it is much harder to make stupid errors in syntax with program transformation systems, such as transforming the content of comments simply because it looks textually right. String hacking tools are generally attractive because of thier easy availability and easy understanding, and usually make terrible tools for source program changes.

Correctness-checking: Now, one can make mistakes in terms of preserving semantics (imagine leaving the if condition off the auto-increment transformation above). Program transformations themselves are neutral in this regard; they simply change structures to structures. Verifying that program transformations are meaning preserving is not what a program transformation system typically does. It might have auxiliary support to help with this. However, having the program transformations expressed in terms of language structures makes validating such properties much easier than validating string-to-string hacks.

Source-to-source transformations allow the expression of desired changes to be written using the syntax of the source and target languages; refer to the autoinc example. Since this makes the language structures explicit, these tend to be much easier to write and validate. Most source-to-source transformation systems express individual program transformations as pairs (both sides of the arrow) of parameterized language fragments, called patterns, where the parameters represent placeholders (such as \X that may be matched against other, appropriate language fragments. The arrow-left-side pattern used for matching, and the right pattern used for replacement after substituting matches found in the left side. An example of a source-to-source transformation mapping between a Basic-like language and an assembly language having pushdown stacks:


          transform map_for_stmt_to_assembly:
                for \v=\i to \u step \s do \b
                 ->
                     \i
                top: STORE \v
                     \b
                     PUSH \v
                     ADD  \s
                     DUP
                     CMP  \u
                     JLE top

Procedural program transformations are bits of procedural code (written in conventional languages such as C++ or LISP) that navigate around the structures (e.g., crawling over the tree graph) by explicitly specifying checks for node types and transitions across arcs, and specify individual changes to node types and arcs. A procedural transformation for the auto-increment might be coded as


   procedure autoinc(t: tree)
      begin
        if IsEqualNode(root(t).nodetype)
           and IsPlusNode(rightchild(root(t)))
           and IsNumberNode(rightchild(rightchild(root(t))))
           and NumberValue(rightchild(rightchild(root(t))))==1
           and EqualTrees(leftchild(root(t)),leftchild(rightchild(root(t))))
           and NoSideEffects(leftchild(root(t)))
        then
           nodetype(root(t))=AutoIncNode;
           delete(rightchild(root(t));
        endif
      end

It should be clear that this procedural transform will simply ignore the Y = X + 1 tree above. It should also be clear that one has to write a lot more code to express the same transformation as a procedure, and that it is a lot less readable.

There are many, many systems that implement such procedural transformations, including conventional compilers (which are extremely successful at transforming higher level languages into lower level assembly languages). These systems are common because once the language structures are implemented in data structures in a programming language, it is trivial to start coding in that programming language to navigate/modify those data structures. As a consequence, many language-specific program transformation systems such as OpenC++ are implemented this way. The good news is that the technology for these is well understood and widespread; the newly-coined term CodeDom has recently started to appear to describe software tools that represent compiler-like "abstract syntax trees" (an already perfectly good term known by the compiler community) and allow one to procedurally crawl over them. The bad news is that it is still painful to write transformations that way because the transformation engineer must know every tiny detail of the structure representation and get them all right. (We remark that many codedoms don't really represent enough detail to do this right, e.g., the C# CodeDom in MS Visual Studio 2008 doesn't provide any detail about the structure of an expression.)

Implementing source-to-source transformation systems is considerably harder than implementing procedural transformation systems, because of the need to implicitly extract the structure from the pattern, carry out the structure matching, application condition checking, and update the structures, but it should be clear that using source-to-source transformation systems is considerably easier than using a pure procedural transformation system. We expect the comparison between the procedural transform for autoinc and the source-to-source version makes this obvious. It is also painful to validate procedural transformations for semantics preservation because they are expressed in micro-procedural actions on structures. DMS provides a procedural program transformation interface, but it is generally used sparingly.

Transformation sets: Because source-to-source transformation patterns are finite, the structures they match are typically finite, as are the replacement fragments. The implication is that individual source-to-source program transformations can match against and substitute at most a limited "radius" on an underlying graph structure. An arbitrarily big change therefore cannot be carried out by a single source-to-source program transformation. Often, many program transformations are grouped together, to cooperatively achieve such larger effects. One can think of such a set as representing a more complex transform that carries out the effect of applying all of its members repeatedly until none apply. DMS can express sets of transformations directly. An example used to simplify algebraic expressions:


         transform fold_additive_constants:
               \c1 + \c2 -> \sum\(\c1,\c2\)
               if constant(\c1) & constant(\c2)

         transform division_by_self:
               \d / \d -> 1
               if NotZero(\d)

         transform eliminate_parentheses:
              ( \x ) -> \x

         set SimplifyExpression =
             { additive_identify,
               fold_additive_constants,
               division_by_self,
               eliminate_parentheses }

Metaprograms: It is often important to sequence the order in which transformations are applied. Metaprograms are the means for accomplishing this. Metaprograms are often written in procedural languages, so it is easy to see how they are implemented in procedural transformation systems. Source-to-source transformation systems sometimes have special metaprogramming languages for specifying this sequencing; the set notation above is a simple example of this. DMS provides procedural metaprogramming facilities that can invoke analyzers, procedural transforms and source-to-source transforms, in various orders depending on what is convenient and needed to accomplish complex tasks.

Language Front Ends: To apply transformations to real computer software, the transformation tool must have a high-fidelity front-end capable of parsing the computer software with the same level of detail as the compiler for that software. Builing front ends for real languages is extremely hard because of the amount of detail that has to be just right; C++ because of its sheer complexity is a kind of holy grail in the front-end building space. A good program transformation system has a means for acquiring robust front-ends, and ideally, a set of such front ends exist to allow immediate application of the program transformation tool to real software engineering tasks. DMS provides a wide range of industrial strength front ends for the most commonly used programming languages including C++, Java, C# and COBOL, including many of thier dialects.

More general details about DMS can be found here. You can also see a simple example of DMS usage, which shows how the source-source-transformations are coded and used.

A list of program transformation systems other than DMS and further discussion can be found at Program Transformation.org.

For more information: [email protected]    Follow us at Twitter: @SemanticDesigns

Program Transformation
Mapping from one Formal Language Structure to Another