DMS compared to other Analysis and Transformation Technologies

The DMS Software Reengineering Toolkit ("DMS") is used to automate program analyses and transformations. It consequently shares a number of properties with other compiler technologies, but goes beyond them often in a variety of ways. This section explores other technologies that are available in practice and compares DMS to them. It is important to remember that DMS and other technologies often serve different purposes, so a difference isn't necessarily bad; different tools meet different needs.

What we hope you will note is that DMS functionally outperforms or is competitive with an amazing variety of available language processing tools. The implication is you can have a mountain of incompatible tools or DMS, your choice. Having only one means minimizing your learning curve. In addition, DMS comes with the support that goes with a commercial product.

We compare DMS to the following technologies:

  • Lexer Generators (e.g., Flex)
  • Parser Generators (e.g., YACC, Bison, JavaCC, CUP, ANTLR/PCCTS)
  • Program Transformation Systems (e.g., TXL, ASF+SDF, Stratego, Fermat, CodeWorker)
  • C/C++ Language Front Ends (e.g., Edison Design Group, GCC, Columbus, GCCGXL, ... )
  • C++ Transformation Frameworks (e.g. CodeBoost, ROSE, Puma, OpenC++, Clang)
  • Compiler Frameworks (e.g. LCC, GCC, SUIF, SGI Open64, Clang, ...)
  • DSL (domain-specific language) compilers
  • Aspect Languages (e.g., AspectJ, AspectC, AspectC++)
  • Programming Languages (e.g., Lisp, Java, C#, C++, Fortran, ...)
  • Graph Rewrite Systems
  • Modelling Tools (UML, Models, ...)
  • Application Code Generators (UML, Models, ...)
  • Machine Code Generators (Burg, ...)
  • Migration Engines
  • Quality, Security, Source Code Scanning tools (e.g., ... )
  • Reverse Engineering Tools
  • Compilers (e.g. GNU, Ada, Prolog)
  • Rewriting Engines
  • Tree Walkers

The list is rather long. We believe we have characterized these other systems fairly. If you think a description is inaccurate, or could be improved, please send a note to [email protected]

Lexer Generators

Lexer Generators are used to construct finite state automata for recognizing the words of a language. Relevant issues include:

  • Expressibility of the regular expressions
  • Lexical format and comment capture
  • Lexical mode management
  • Character sets handled
  • Conversion actions performed by lexer
  • Support for include files
  • Multiplicity
  • Error reporting and recovery
  • Performance
  • Target language

Expressibility refers to what can be said in the regular expression language; most lexer generators allow characters sets, Kleene star and plus, and alternation. DMS is unusual in providing complements ("not") of regular expressions. DMS also allows a token definition to extend what has been collected or lookahead an arbitrary distance for the presence of some trailing indicator.

Lexical format and comment capture refers to what happens to the "shape" of the processed text. Most lexers simply toss away this information. DMS distinguishes white-space to be ignored, from comment text that must be captured and retained with the lexemes. DMS also captures lexeme formatting information such as radix of numbers, type of quotes on strings, file/line/column information, etc. This information is useful for error reporting, and for regenerating the original text around a place that DMS has transformed.

Lexical mode management refers to how the lexer handles switching from lexing with one set of rules to another. Consider what happens if you have C++ with embedded assembly code: the C++ code is likely to be lexed according one set of rules, and the assembly code according to another. Lexical mode management indicates how control can be passed from one mode to another. Many lexers offer multiple modes, with simple transfers from one mode state to another. Complicated languages such as COBOL and HTML with embedded scripts often require many, many modes with complex interactions and embeddings.

Each lexer must operate with some input character set. Most lexers are restricted to 7 bit ASCII, possibly with simple extension to ISO-8859-1 because of the ubiquity of 8 bit bytes in European languages. Lexing the essential 65536 characters of Unicode is harder, because one cannot simply have tables indexed by an 8 bit code, but it is important in times of globalization to handle everybody's language with aplomb.

After recognizing a string, it is often useful to process it into the target system's native representation (e.g, convert "5.73E-27" into its equivalent IEEE float). Conversion support indicates what the lexer provides to help with this. In many languages, a keyword may be an identifier in certain contexts; DMS allows the lexer to ask the parser if this is possible, and thus correctly classifies the name as keyword or identifier.

For complex inputs, often one file will "include" another. This indicates support provided by the lexer for this.

Multiplicity indicates if the lexing engine can handle multiple lexer instances (of the same lexer, or of different language lexers) in a single program execution.

Performance has to do with the rate at which the lexer processes characters. For very big inputs, this matters. For small inputs, this is irrelevant. For most source code analysis and processing tasks, very fast lexing isn't necessary; most of the time will be spent in the analysis or the transformation of the code.

Each lexer generator targets certain languages.

Lexer GeneratorExpressibilityFormat CaptureMode ManagementCharacter SetsLiteral
Conversions
Include
files
MultiplicityError
Recovery
PerformanceTarget Language
DMSstandard,
not regexp,
recognized lexeme extension,
arbitrary scan-ahead,
Unicode character groups
comments, radix,
lead zeros,
quote type,
keyword spelling,
file/line/col
goto, gosub, push, pop7 bit ASCII,
ISO8859-1 to -16,
UTF-8, UTF-16,
EBCDIC with 80 column breaks,
Windows code pages 1250-1258,
Japanese Shift-JIS,
custom
character,
integer,
precise float,
rationals,
string,
escaped string,
keyword/identifier
fill lexeme
nestedYesAutoMedium
MMAP file read
PARLANSE
FlexstandardNogotoASCII,
ISO8859-1
nonenoneYesAutoFastC

Parser Generators

Parser Generators accept grammars and produce parsers. Key issues are:

  • Notation Expressiveness
  • Grammars Available
  • Class of language parseable
  • Semantic Conditions
  • Semantic Actions
  • Substring/Pattern Parsing
  • Tree Building
  • Error reporting and recovery
  • Performance
  • Target language

Notation Expressiveness refers to the type of extended BNF allowed by the parsing engine. Most parser generators allow a left-hand side to be tied to right hand sides having at least token sequence alternation, often Kleene star/plus, sometimes operator precedence, possibly in-lined predicates. Many allow inlined actions to enable custom code for processing on phrase recognition (e.g., tree building). The DMS BNF grammar actually can accept all of these, but for most DMS work, DMS uses of a remarkably simple form of BNF which meshes extremely well with attribute grammar evaluators that DMS can build on top of this syntax. DMS can be used to transform a more complex EBNF into the standard DMS BNF.

Some uses of a parser generator are for small custom languages, driving an application. DMS's emphasizes arbitrary languages with focus on analysis and transformation. Often, one wishes to process application source code in standard programming languages such as C, COBOL, etc. What matters here is having a robust language definition, handling the variety of dialects that these languages really have, and supporting the additional infrastructure required to really "read" these languages, such as include files, macros, preprocessors, class file readers, etc. DMS is distinguished by having a very large list of languages and dialects that have been validated against large volumes of production code, notably C++ with full preprocessing capabilities; uniquely, the DMS C++ preprocessor has the ability to NOT expand preprocessor directives, enabling transformation of the source code as seen by the programmer.

Most parser generators support a limited class of languages. YACC supports LALR(1), CUP supports LL(1). The good news about such parser generators is that they have great formal characterizations of the class languages they can process. The bad news is, that most real languages do not fall in these categories. LL parser generators are often characterized as some easier to construct, but they suffer from the problem that they cannot handle left-recursive BNF rules. Worse, real languages have ambiguous syntax, which means several possible parses of some phrases are possible; since many of these parsing technologies choose the accidental "first available parse", they typically cannot choose the one necessary parse unless backtracking is built-in (which it is often not).

With such parser generators, the grammar author must twist his grammar, his lexer, his brain, and code odd custom semantic conditions and actions to get a real working grammer. One consequence of this is that "grammar" definition for individual parser generators are almost always difficult to transfer to another parser generator. DMS parses arbitrary context-free grammars, by vitue of using GLR technology. This means that DMS grammars are shaped by author convenience, not by the parsing technology, and so tend to considerably closer to what is found in the standards documents describing the languages. One example of this: DMS is able to parse C++ with essentially the ISO grammar rules; this is widely considered to be impossible with LALR or LL technology. It also means that DMS is extremely agile, in being able to accept new language definitions fast. Proof is in the wide variety of languages that DMS accepts.

Often, semantic conditions are used to eliminate potential phrase parses, to avoid ambiguity. Most conventional parser generators depend heavily on these, especially top-down tools such a JavaCC and ANTLR. DMS can simply accept all the ambiguous parses and remove the inappropriate ones later, after type information is available. In those cases where it is easy to acquire information that can eliminate incorrect parses (such as loop-end to loop head matcing by line number in FORTRAN), DMS can use semantic predicates to eliminate inappropriate parses during parsing.

Semantic actions are often attached to grammars to carry out the key work (trivial code generation, command execution, or tree building). Virtually all parser generators provide a way to write such semantic actions, which of course means you must write them. Since the goal of DMS parsing is only to produce trees, DMS doesn't have semantic actions, so you don't have to write them. Less code is time saved and bugs not chased.

Most parsers are useful only for parsing complete source documents. For analysis and reengineering tasks, it is extremely useful to parse "patterns", which are code fragments with placeholders. DMS automatically constructs a pattern parser from the main parser. Once you've coded the main parser, you are ready to process code fragments. None of the conventional parser generators do this.

Most parser generators don't construct trees, although they provide support for constructing trees by library calls from semantic actions. DMS simply automatically constructs trees. The trees are abstract in the sense that terminals and useless unary productions can be eliminated, but are otherwise isomorphic to the grammar, which is in turn a clean implementation of the language standard, so the DMS engineer "knows" what his trees look like by referencing the standards documents. This is very useful when you have languages like C++ with hundreds of grammar rules. In constructing the trees, DMS automatically attaches format information from lexemes to the value-carrying leaves of the tree, and automatically attaches comments collected by the lexer to the appropriate place in the tree. Finally, each tree node is stamped with the domain (language) that produced it (this allows multiple languages in a single image, and indeed, in a single tree), the tree node type, and complete information about the source file/line/column that produced it. All this information enables reconstruction of the source code, and/or reporting accurately where analysis finds a problem or where transforms are made, as appropriate.

When processing lots of source, one invariably encounters malformed source code. Most parser generators provide various ad hoc means for recovering from such errors; YACC inserts an "error" token and the grammar engineer must write productions to swallow it. DMS provides automatic error recovery; it attempts to find a token it can insert that will let the parse proceed, or failing that, deletes the obstacle token and moves on. So the grammar engineer need do nothing at all to get useful error recovery.

Performance is always an issue when one tries to read a large amount of source (a 2000 line C++ compilation unit may implicitly read 250,000 lines of include file support!). Many of the conventional parser generators produce very fast parsers (although the typical bottleneck is lexing tokens and eliminating white space!). The DMS parsers are only moderately fast, because of the GLR property, automated tree building, and automated format/comment capture. This is fine for reengineering tasks; you don't actually process a million lines of code with DMS until you are ready, and then it doesn't matter if it takes a few minutes.

Parser generators typically produce parsers in some favored target language. YACC and Bison produce C-style code. JavaCC produces only Java. ANTLR produces C++ and Java. DMS parsers are really designed for the language underlying DMS, PARLANSE; however, DMS transformations can easily be applied to DMS grammars to produce recursive descent parsers in any of the languages for which DMS has a front end.

Parser GeneratorEBNF NotationGrammars
Available
Language ClassSemantic ConditionsSemantic ActionsSubstring/
Pattern Parsing
Tree BuildingError RecoveryPerformancePretty Printer/
Fidelity Printer
Target Language
DMSSimple BNF;
Token Precedence;
Auto-List Formation
Production Quality
preprocessor/C/C++0x,
C#,Java 1.5,
Fortran 77/95,
Pascal, Ada,
COBOL, Natural, JCL,
JavaScript, PHP4/5,
VB.net,VBScript,VB6,
Python 2/3,
HTML, XML,
SQL 2011,PLSQL,
Verilog, VHDL, SystemC
Full Context Free
via GLR
PARLANSE;
Check Reduced-To Tree
or Left Context
None;
(Implicit Tree Building)
Yes/YesAuto Build;
Auto Lists;
Unary elimination;
Terminal Elimination;
Ambiguity Handling with
Shared Subtrees
Auto
Token Insert
or Delete
Medium (includes tree building); pretty fast DMS lexerYes/YesPARLANSE
YACCStandardC;
but not C++
Many others scattered across web
LALR(1)+NoC codeNo/NoHand-codedSkip to Statement End?;
Optional Error Token
Fast, usually because of fast lexerNoC
BTYACC
(built on YACC)
StandardAll available for YACC
Not C++
Superset of LALR(1);
Choose first of ambiguous alternatives
Yes?C codeNo/NoHand-codedSkip to Statment End;
Optional Error Token
Little slower than YACCNoC
BisonStandardC (via GNU)
Older C++ (via old GCC)
Many others
LALR(1)+;
GLR option
NoC codeNo/NoHand-codedSkip to Statment End;
Optional Error Token
Fast (usually because of fast lexer)NoC
ANTLRStandardJava
C++ (draft)
Many others but
not production
LL(*)JavaJavaNo/NoHand-codedAd hocFast, mostly due to integrated lexerNoJava
JavaCCStandardJava
Not C++
Many others
varying qualilty
LL(k)?JavaJavaNo/NoHand-codedAd hocMedium (JIT-compile code, even for lexer))NoJava
XT Scannerless GLRStandardJava
Not C++
Academic quality
Context freeCC (usually tree building)No/Yes via ASF+SDFHand-codedUnknownSlow; parses using characters as tokensYes, via ASF+SDFC
ElkhoundStandardANSI? C++;
C via C++ subset
Academic quality
Context freeC++C++ (usually tree building)No/NoHand-codedUnknownFast (usually because of fast lexer)NoC++

Program Transformation Systems

Transformation systems are intended to process source code to produce improved code in the same or different languages. Often they support some kind of some analysis (to find and validate places that need improvement), and so they are also often used as simply analysis tools. Issues include:

  • What kind of language, graphical or textual, does the tool process?
  • What language can the transformation system process?
  • What is the implementation language for the transformation system, and why was it chosen?
  • How are transforms encoded?
  • How are analyses encoded?
  • How are transforms and analyses sequenced to achieve the desired result? ("Metaprogramming")
  • How scalable is the tool (along various axes of scalability)
  • Has the tool been used for production purposes?

Program transformations come in two forms: procedural (arbitrary functions applied to compiler data structures) and source-to-source (maps between concrete surface syntax forms of code). Procedural transformations have enjoyed enormous success for some 50 years in classical compilers [AhoUllman] for translating source code to binary and optimizing code generators. Traditionally transformations have been considered interesting only if they were correctness preserving. Compilers also traditionally construct and cache crucial auxiliary information such as symbol tables and control and data flow analysis results to support complex context-dependent transformations that preserve correctness.

More recently, procedural transforms have been proposed to support mechanical refactoring [12] as a technique for restructuring programs. Manual versions of this have been popularized [7] as a methodology with suggestions for tool support. Tools for refactoring Smalltalk [18] and Java have started to appear. The better Java tools [23,24] do some sophisticated refactorings such as ExtractMethod; others in the market require some manual validation of the steps. Some experimental work has been done in refactoring C++ [21], and some commercial C++ refactoring tools are available [XRefactory]. All of these above tools can only apply predefined sets of transforms. OpenC++ [ ] uses a custom recursive descent C++ parser front end that builds ASTs with customizable procedural transforms to enable some simple transforms on C++ syntax. The results of these tools must be manually (re)validated, because most of them do not construct the auxiliary information needed to ensure that the transforms are truly correctness preserving

Many tools that analyze source code based on parsing to compiler data structures and apply procedurally encoded analysis have been constructed. More recently, sophisticated reengineering tools such as the ATX Reengineering environment [11], based on parsing to compiler data structures and applying predefined procedural transforms have started to appear.

Source-to-source program transformations were originally conceived as a method of program generation in the 1970s [2] and lead to the notion of domain analysis to support transformational domain-specific program generation in the seminal Draco system [12], which accepted language ("domain") definitions and source-to-source transformations for each language, using recursive-descent parsing. The technology has continued development, moving into scientific computing [10] and is commercially used for sophisticated financial instruments [Kant Synapse 2005].

The idea that program transformations could be used for software maintenance and evolution by changing a specification and re-synthesizing was suggested in the early 80s [Balzer85 need paper ref]. Porting software and carrying out changes using transformations were suggested and demonstrated in the late 80s [1]. Simple source-to-source "evolution" transforms used to change specification code were suggested by Feather [9]. Evolution transforms lead to the key idea of using both correctness preserving transforms to change nonfunctional properties of software, and to use noncorrectness preserving transforms to change functional properties of a software system. Theory about how to modify programs incrementally using "maintenance delta" transformations and previously captured design information was suggested in 1990 [3][4]. This work defined an array of transformation support technology, which inspired the DMS Software Reengineering Toolkit.

Source to source transformation as a serious tool for software evolution is largely unrealized in practice, although there have been some very interesting applications. Following the Draco lead, Refine [6,17] was a ground-breaking software transformation engine originally intended for program generation from high level specifications, but found modest success when used for some blackbox commercial automated migration projects on legacy code [Newcomb85]. Refine tended to be limited by its LALR parser generator, which does not handle most legacy languages very well.

TXL [Cordy early, CordyLCT04?] was used to find and repair Y2K flaws on reportedly 4 billion lines of legacy code, much of it COBOL, and is still available [ref]. TXL uses a backtracking parser, which makes it somewhat more robust for parsing legacy languages, but does not handle ambiguities well. The TXL transformation philosophy gets around this partly by partitioning massive transformations into separate stages each operating on an abstracted, and therefore less complex grammars specific to that stage.

The ASF+SDF system [??] enables the specification of arbitrary languages, using GLR parsing technology, which can handle arbitrary context free grammars. ASF+SDF parsers working at the character level, allowing convenient composition of arbitrary sets of grammar rules. Transformations are specified not in terms of concrete surface syntax, but rather in terms of abstract syntax. Stratego [Visser] augments ASF+SDF with a domain-specific language for specifying complex traversals of the abstract syntax tree and sequencing of transforms. Both ASF+SDF and Stratego have been used as porting and mass source code change tools in research projects, but apparently not commercially.

Two key stumbling blocks for transforming C++ is problem of parsing it, and the even more difficult engineering task of constructing an accurate symbol table, both because of the complexity of the language semantics and because of the differences in the C++ dialects in use. CodeBoost [ ] combines the Edison Design Group (EDG) commercial front end C++ parser with Stratego to research the application of library-call optimizing transforms to be applied to C++ code; we believe transforms are specified in terms of abstract syntax. Boost does not appear to have been used to produce production code. Rose [ Quinlan ] combines the EDG front end with recursive calls on that front end to enable writing C++ source patterns used by C++ procedural transforms, to help optimize scientific codes written in C++. Besides being a production C++ parser, the EDG front end provides a symbol table, crucial to implementing sophisticated context-sensitive transforms. AspectC++, built on top of Puma, combines a hand-constructed C++ parser and symbol table adequate for research tasks, to enable customized procedural transforms to implement AspectJ-like transforms on C++ code. AspectC++ is not used for general C++ transformation.

The only available production quality C++ parsers known to us are the GNU hand-coded parser, the EDG parser and the one for DMS; all three are successful in our opinion not only because of the ability to parse, but especially because of the invested effort to engineer production quality symbol tables. While GNU is very good as a compiler, it is not designed as a general program transformation system. EDG's front end is just that; it builds an AST and a symbol table, but provides nothing beyond that point to support transformation. ASF+SDF's parsing machinery, like DMS's is GLR and can arguably be used to parse C++ directly, but that does not produce a symbol table. Of all of these systems, only DMS can accept true surface-syntax transforms and apply them. Rose comes close, but can only work with C++, whereas DMS's source-to-source engine is actually independent of the actual grammar, but can still use the grammar syntax. The DMS parsers, being reengineering parsers, capture macros, preprocessor conditionals, include files, comments and lexical formatting information in a regenerable way, which none of the other tools accomplish.

One atypical aspect of the DMS Toolkit is the explicit encouragement to mix both source to source and procedural transformation. While this might sound awkward, combining the succinctness of source-to-source transforms with procedural control and the more direct capability of procedural transforms to manipulate complex structures is remarkably powerful.

We believe the Boeing Migration tool (REF) would not have been practical without the robust DMS C++ front end, the reengineering parsing machinery, and the combined power of the mixed rewrites.

Transformation SystemStatusGrammars
Available
Transformation StyleAnalyzersScalePerformanceImplementation Language
DMSProduction;
Commercial;
Active enhancement
Production
preprocessor/C/C++0x,
C#, Java1.5,
Fortran 77/95,
Pascal, Ada,
COBOL, Natural, JCL,
JavaScript, PHP4/5,
VB.net,VBScript,VB6,
Python 2/3,
HTML, XML,
SQL 2011,PLSQL,
Matlab M, Simulink,
Verilog, VHDL, SystemC;
Extensible to others
Source-to-Source
and/or Procedural;
Procedural Metaprograms
Parallel
Attribute Grammars;
Control Flow;
Data Flow;
Local Points-to;
Global Points-to;
Call Graph;
Symbolic Range;
Custom PARLANSE Code
Multilingual;
Tens of
thousands
of files;
Parallel implementation
Excellent;
Uses SMP
PARLANSE (parallel)
TXLVery stable;
Academic;
Used commercially in past
Java
but not C++;
Others as partial "island" grammars;
Extensible to others
Source-to-Source;
Functional style
Implement using transformsBig single files;
Multilingual via
union grammars
GoodTuring
ASF+SDFStable;
Academic
Java
but not C++
Many university quality;
Extensible to others
Abstract Syntax to Abstract SyntaxImplement via transforms;
Escape to C
UnknownExcellent for rewrites?
Modest for parsing due to Scannerless GLR
C
StrategoStable;
Academic;
Active enhancement;
Built on AST+SDF
Java
but not C++
Many university quality;
Extensible to others
Abstract Syntax to Abstract Syntax;
Dynamic Rules;
Metaprogramming via explicit Strategies
Implement via transforms;
Escape to C
UnknownExcellent for rewriting;
Modest for parsing due to Scannerless GLR
C
CodeBoostStable;
Academic;
Built on OpenC++ and Stratego
C++ via OpenC++Stratego styleImplement via transforms;
Escape to C?
UnknownUnknown; Stratego transforms likely excellentC
OpenC++Stable;
Open Source
Most of ANSI C++
No preprocessor;
Limited template handling
Procedural analysis/transformation
in C++ via Visitor pattern
Ad hoc coded in C++Single source file plus definitions from includes?UnknownC++
ROSEStable?;
Research;
Active enhancement
C/C++ via EDG front end;
Fortran90?;
Not extensible
Source-patterns for matching and instantiating;
Procedural transforms in C++
Procedural in C++Large numbers of filesUnknownC++
FermatStable;
Research;
Applied commercially by SMLtd
C, COBOL
with University qualitity;
Not easily extended because of requirement to map to WSL
Encodes programs in WSL language;
WSL source-patterns for matching and instantiating;
Procedural transforms in MetaWSL
Procedural in MetaWSLTens of thousands of files (assembler)Several thousand lines per minuteMetaWSL, PERL
TomStable;
Active enhancement
No conventional languages;
Handles small DSLs well
Tom patterns/rewrite rules
embedded in host language Java
Tom rewrite rules;
Escape to Java?
UnknownFastOCAML?
CodeWorkerStable;
One man show;
Active enhancement
Java?
but not C++
Several partial grammars;
Arguably extensible to others
Procedural in "CodeWorker" languageProcedural in "CodeWorker"UnknownUnknownJava?
XML/XSLTProduction;
Active enhancement
JavaML;
GCCXML (incomplete);
Extensible to others by using some parser generator with ad hoc XML export
XSLT functional transformationWeak analysis via functional transformationPoor because of XML document sizesPoorC/C++/Java
Rascal/MPAcademic;
Experimental;
Active enhancement
Several DSLs;
Java?;
Extensible to others by using Rascal GLL parser generator
Mixed source-to-Source with arbitrary Rascal procedural codeCustom coded in Rascal relational calculusPoor (see performance)Interpreted; reported "very slow" by implementersJava?

C++ Front Ends

C++ parsers ("front ends") are intended to read C++ source code and build compiler data structures representing that code, with the expectation that a custom tool will be build to process the compiler data structures, for analysis or C++ transformation.

Issues for C++ front ends are:

  • What dialects of C/C++ are handled?
  • Is there an integrated preprocessor?
  • Does the front end capture comments and lexical format information?
  • What compiler data structures are built? (AST, Symbol tables, ...?)
  • How can those structures be processed?
  • What kind of support for analysis is available?
  • Is the tool designed to process just one file, or entire systems of files?
  • Can the front end operate with other languages (SQL, ...?)

C++ Front EndsStatusC/C++ Dialects AvailablePost-Parsing CapabilityScalePerformanceImplementation
Language
DMSProduction;
Commercial;
Active enhancement;
Integrated Preprocessor
with "no expand" option
Production;
C & C++;
C++0x;
ANSI,GCC,
MSVC6,
MS Studio 2005,
SystemC;
Many other languages
Procedural tree processing;
Name/Type resolution with or without preprocessing;
Source-to-Source Transformation;
Attribute Evaluation;
Access to DMS control, data, points-to analyzers;
Regeneration of source with comments and preprocessor directives
Tens of
thousands
of source files;
Multilingual processing
Parallel implementationPARLANSE (parallel)
EDGProduction;
Commercial;
Active enhancement;
Integrated Preprocessor
Production;
Most C/C++ dialects
Procedural tree processing; Name/Type Resolution after preprocessingSingle source file, includes processed but lost?Fast, hand tunedC
GCCProduction;
Open Source;
Active Enhancement;
Integrated Preprocessor
Production;
GCC, ANSI
Arbitrary procedural analysis and transformation;
but tangled up in GCC implementation
One source file plus includesFast, hand tunedGCC C
GXLGCC
(built on GCC)
Stable?;
Academic;
Integrated Preprocessor
Production;
GCC, ANSI
XML export of declarations;
No information about function bodies
Declarations from single source file and its includesPoor; XML documents are hugeC
Elsa
(built on Elkhound)
Stable?;
No preprocessor;
Academic
C as C++ subset;
ANSI? C++;
Site claims some limits on template support
Name/Type resolution after preprocessing;
Procedural tree processing in C++
Single file plus includesUnknownC++
ColumbusStable?;
No preprocessor;
Academic/Commercial?
ANSI C?/ANSI C++;
Limited template support?
Procedural tree processingSingle file plus includesUnknownC++?
OpenC++Stable;
No preprocessor;
Open Source;
Most of ANSI C++;
Limited template handling
Procedural analysis/transformation
in C++ via Visitor pattern;
Reneration of preprocessed source
Single source file plus definitions from includes?UnknownC++
ANTLRANSI;
No proprocessor;
Stale C++ definition;
Not used anywhere in practice
C++ (draft);
Some name/type resolution while parsing
Procedural tree processingSingle preprocessed file?Moderate?Java
CLang/LLVMStable;
Open Source;
Active enhancement;
Integrated preprocessor
ANSI C++;
GCC dialect?
Name/type resolution while parsing;
Procedural tree processing;
Regeneration of source (w/ comments? preprocessor?)
Native code generation via LLVM;
Varying flow analysies via LLVM
Single preprocessed file?Fast,hand tunedC++

Compiler Infrastructures

Compiler infrastructures are intended to make it easier to construct compilers or compiler-like tools such as source code instrumenters.

Issues for compiler infrastructures are:

  • What languages can the infrastructure process?
  • What compiler data structures are built? (AST, Symbol tables, ...?)
  • Does the front end capture comments and lexical format information?
  • How can those structures be processed?
  • What kind of support for analysis is available?
  • How hard is it to customize the infrastructure for the new task?
  • Is the tool designed to process just one file, or entire systems of files?
  • Can the front end operate with other languages (SQL, ...?)

Compiler
Infrastructures
StatusLanguages
Available
Post-Parsing
Customization
Capability
ScalePerformanceImplementation
Language
DMSProduction;
Commercial;
Active Enhancement
Production
preprocessor/C/C++0x,
C#, Java1.5,
Fortran 77/95,
Pascal, Ada,
COBOL, Natural, JCL,
JavaScript, PHP4/5,
VB.net,VBScript,VB6,
Python 2/3,
HTML, XML,
SQL 2011,PLSQL,
Matlab M, Simulink,
Verilog, VHDL, SystemC;
Extensible to others
(e.g. DSP assembly);
Easily extended to include DSLs for graphs, code generation, etc.
Procedural tree processing;
Name/Type resolution with or without preprocessing;
Source-to-Source Transformation;
Attribute Evaluation;
Regeneration of source with comments and preprocessor directives;
Control, dataflow, and global points-to analyzers;
Used to build a wide variety of source-analysis or instrumentation tools, including code generators and vectorizers.
Tens of
thousands
of source files;
Multilingual processing
Parallel implementationPARLANSE (parallel)
GCCProduction;
Open Source;
Active Enhancement
GCC; ANSIArbitrary procedural analysis and transformation;
Code-generation focused flow analysis;
Tangled in "wants to be a compiler for every possible machine"
One source file plus includesFast, hand tunedGCC C
LCCStable;
Academic;
ANSI CArbitrary procedureal analysis and transformation;
Tangled in "wants to be a compiler"
One source file plus includesFast, hand-tunedC
SUIFStable;
Academic;
No preprocessor
C and FORTRAN 90?Procedural tree processing;
Name/Type Resolution after preprocessing;Data flow analyses;
Reneration of preprocessed source?
Multiple files?UnknownC++?
Open64Stable;
Open Source
ANSI C, C++?;
FORTRAN 90?;
Arbitrary procedural analysis and transformation;
Data flow analysis;
Reneration of preprocessed source?
Tangled in "wants to be an SGI compiler"
Single source file plus definitions from includes?Likely fast, used as production compiler?C++
CLang/LLVMStable;
Open Source;
Active enhancement;
Integrated preprocessor
ANSI C++;
GCC dialect?
Name/type resolution while parsing;
Procedural tree processing;
Regeneration of source (w/ comments? preprocessor?)
Native x86 code generation via LLVM;
Varying flow analyses via LLVM;
Custom analyzers in C++
Single preprocessed file?Fast,hand tunedC++

Note: This page under construction.

ToDo: References. Links. CodeGenerators.

For more information: [email protected]    Follow us at Twitter: @SemanticDesigns

DMS
Comparison To Other Compiler-like Technologies