DMS compared to other Analysis and Transformation Technologies
The DMS Software Reengineering Toolkit ("DMS") is used to automate program analyses and transformations. It consequently shares a number of properties with other compiler technologies, but goes beyond them often in a variety of ways. This section explores other technologies that are available in practice and compares DMS to them. It is important to remember that DMS and other technologies often serve different purposes, so a difference isn't necessarily bad; different tools meet different needs.
What we hope you will note is that DMS functionally outperforms or is competitive with an amazing variety of available language processing tools. The implication is you can have a mountain of incompatible tools or DMS, your choice. Having only one means minimizing your learning curve. In addition, DMS comes with the support that goes with a commercial product.
We compare DMS to the following technologies:
- Lexer Generators (e.g., Flex)
- Parser Generators (e.g., YACC, Bison, JavaCC, CUP, ANTLR/PCCTS)
- Program Transformation Systems (e.g., TXL, ASF+SDF, Stratego, Fermat, CodeWorker)
- C/C++ Language Front Ends (e.g., Edison Design Group, GCC, Columbus, GCCGXL, ... )
- C++ Transformation Frameworks (e.g. CodeBoost, ROSE, Puma, OpenC++, Clang)
- Compiler Frameworks (e.g. LCC, GCC, SUIF, SGI Open64, Clang, ...)
- DSL (domain-specific language) compilers
- Aspect Languages (e.g., AspectJ, AspectC, AspectC++)
- Programming Languages (e.g., Lisp, Java, C#, C++, Fortran, ...)
- Graph Rewrite Systems
- Modelling Tools (UML, Models, ...)
- Application Code Generators (UML, Models, ...)
- Machine Code Generators (Burg, ...)
- Migration Engines
- Quality, Security, Source Code Scanning tools (e.g., ... )
- Reverse Engineering Tools
- Compilers (e.g. GNU, Ada, Prolog)
- Rewriting Engines
- Tree Walkers
The list is rather long. We believe we have characterized these other systems fairly. If you think a description is inaccurate, or could be improved, please send a note to [email protected]
Lexer Generators
Lexer Generators are used to construct finite state automata for recognizing the words of a language. Relevant issues include:
- Expressibility of the regular expressions
- Lexical format and comment capture
- Lexical mode management
- Character sets handled
- Conversion actions performed by lexer
- Support for include files
- Multiplicity
- Error reporting and recovery
- Performance
- Target language
Expressibility refers to what can be said in the regular expression language; most lexer generators allow characters sets, Kleene star and plus, and alternation. DMS is unusual in providing complements ("not") of regular expressions. DMS also allows a token definition to extend what has been collected or lookahead an arbitrary distance for the presence of some trailing indicator.
Lexical format and comment capture refers to what happens to the "shape" of the processed text. Most lexers simply toss away this information. DMS distinguishes white-space to be ignored, from comment text that must be captured and retained with the lexemes. DMS also captures lexeme formatting information such as radix of numbers, type of quotes on strings, file/line/column information, etc. This information is useful for error reporting, and for regenerating the original text around a place that DMS has transformed.
Lexical mode management refers to how the lexer handles switching from lexing with one set of rules to another. Consider what happens if you have C++ with embedded assembly code: the C++ code is likely to be lexed according one set of rules, and the assembly code according to another. Lexical mode management indicates how control can be passed from one mode to another. Many lexers offer multiple modes, with simple transfers from one mode state to another. Complicated languages such as COBOL and HTML with embedded scripts often require many, many modes with complex interactions and embeddings.
Each lexer must operate with some input character set. Most lexers are restricted to 7 bit ASCII, possibly with simple extension to ISO-8859-1 because of the ubiquity of 8 bit bytes in European languages. Lexing the essential 65536 characters of Unicode is harder, because one cannot simply have tables indexed by an 8 bit code, but it is important in times of globalization to handle everybody's language with aplomb.
After recognizing a string, it is often useful to process it into the target system's native representation (e.g, convert "5.73E-27" into its equivalent IEEE float). Conversion support indicates what the lexer provides to help with this. In many languages, a keyword may be an identifier in certain contexts; DMS allows the lexer to ask the parser if this is possible, and thus correctly classifies the name as keyword or identifier.
For complex inputs, often one file will "include" another. This indicates support provided by the lexer for this.
Multiplicity indicates if the lexing engine can handle multiple lexer instances (of the same lexer, or of different language lexers) in a single program execution.
Performance has to do with the rate at which the lexer processes characters. For very big inputs, this matters. For small inputs, this is irrelevant. For most source code analysis and processing tasks, very fast lexing isn't necessary; most of the time will be spent in the analysis or the transformation of the code.
Each lexer generator targets certain languages.
Lexer Generator | Expressibility | Format Capture | Mode Management | Character Sets | Literal Conversions | Include files | Multiplicity | Error Recovery | Performance | Target Language |
---|---|---|---|---|---|---|---|---|---|---|
DMS | standard, not regexp, recognized lexeme extension, arbitrary scan-ahead, Unicode character groups | comments, radix, lead zeros, quote type, keyword spelling, file/line/col | goto, gosub, push, pop | 7 bit ASCII, ISO8859-1 to -16, UTF-8, UTF-16, EBCDIC with 80 column breaks, Windows code pages 1250-1258, Japanese Shift-JIS, custom | character, integer, precise float, rationals, string, escaped string, keyword/identifier fill lexeme | nested | Yes | Auto | Medium MMAP file read | PARLANSE |
Flex | standard | No | goto | ASCII, ISO8859-1 | none | none | Yes | Auto | Fast | C |
Parser Generators
Parser Generators accept grammars and produce parsers. Key issues are:
- Notation Expressiveness
- Grammars Available
- Class of language parseable
- Semantic Conditions
- Semantic Actions
- Substring/Pattern Parsing
- Tree Building
- Error reporting and recovery
- Performance
- Target language
Notation Expressiveness refers to the type of extended BNF allowed by the parsing engine. Most parser generators allow a left-hand side to be tied to right hand sides having at least token sequence alternation, often Kleene star/plus, sometimes operator precedence, possibly in-lined predicates. Many allow inlined actions to enable custom code for processing on phrase recognition (e.g., tree building). The DMS BNF grammar actually can accept all of these, but for most DMS work, DMS uses of a remarkably simple form of BNF which meshes extremely well with attribute grammar evaluators that DMS can build on top of this syntax. DMS can be used to transform a more complex EBNF into the standard DMS BNF.
Some uses of a parser generator are for small custom languages, driving an application. DMS's emphasizes arbitrary languages with focus on analysis and transformation. Often, one wishes to process application source code in standard programming languages such as C, COBOL, etc. What matters here is having a robust language definition, handling the variety of dialects that these languages really have, and supporting the additional infrastructure required to really "read" these languages, such as include files, macros, preprocessors, class file readers, etc. DMS is distinguished by having a very large list of languages and dialects that have been validated against large volumes of production code, notably C++ with full preprocessing capabilities; uniquely, the DMS C++ preprocessor has the ability to NOT expand preprocessor directives, enabling transformation of the source code as seen by the programmer.
Most parser generators support a limited class of languages. YACC supports LALR(1), CUP supports LL(1). The good news about such parser generators is that they have great formal characterizations of the class languages they can process. The bad news is, that most real languages do not fall in these categories. LL parser generators are often characterized as some easier to construct, but they suffer from the problem that they cannot handle left-recursive BNF rules. Worse, real languages have ambiguous syntax, which means several possible parses of some phrases are possible; since many of these parsing technologies choose the accidental "first available parse", they typically cannot choose the one necessary parse unless backtracking is built-in (which it is often not).
With such parser generators, the grammar author must twist his grammar, his lexer, his brain, and code odd custom semantic conditions and actions to get a real working grammer. One consequence of this is that "grammar" definition for individual parser generators are almost always difficult to transfer to another parser generator. DMS parses arbitrary context-free grammars, by vitue of using GLR technology. This means that DMS grammars are shaped by author convenience, not by the parsing technology, and so tend to considerably closer to what is found in the standards documents describing the languages. One example of this: DMS is able to parse C++ with essentially the ISO grammar rules; this is widely considered to be impossible with LALR or LL technology. It also means that DMS is extremely agile, in being able to accept new language definitions fast. Proof is in the wide variety of languages that DMS accepts.
Often, semantic conditions are used to eliminate potential phrase parses, to avoid ambiguity. Most conventional parser generators depend heavily on these, especially top-down tools such a JavaCC and ANTLR. DMS can simply accept all the ambiguous parses and remove the inappropriate ones later, after type information is available. In those cases where it is easy to acquire information that can eliminate incorrect parses (such as loop-end to loop head matcing by line number in FORTRAN), DMS can use semantic predicates to eliminate inappropriate parses during parsing.
Semantic actions are often attached to grammars to carry out the key work (trivial code generation, command execution, or tree building). Virtually all parser generators provide a way to write such semantic actions, which of course means you must write them. Since the goal of DMS parsing is only to produce trees, DMS doesn't have semantic actions, so you don't have to write them. Less code is time saved and bugs not chased.
Most parsers are useful only for parsing complete source documents. For analysis and reengineering tasks, it is extremely useful to parse "patterns", which are code fragments with placeholders. DMS automatically constructs a pattern parser from the main parser. Once you've coded the main parser, you are ready to process code fragments. None of the conventional parser generators do this.
Most parser generators don't construct trees, although they provide support for constructing trees by library calls from semantic actions. DMS simply automatically constructs trees. The trees are abstract in the sense that terminals and useless unary productions can be eliminated, but are otherwise isomorphic to the grammar, which is in turn a clean implementation of the language standard, so the DMS engineer "knows" what his trees look like by referencing the standards documents. This is very useful when you have languages like C++ with hundreds of grammar rules. In constructing the trees, DMS automatically attaches format information from lexemes to the value-carrying leaves of the tree, and automatically attaches comments collected by the lexer to the appropriate place in the tree. Finally, each tree node is stamped with the domain (language) that produced it (this allows multiple languages in a single image, and indeed, in a single tree), the tree node type, and complete information about the source file/line/column that produced it. All this information enables reconstruction of the source code, and/or reporting accurately where analysis finds a problem or where transforms are made, as appropriate.
When processing lots of source, one invariably encounters malformed source code. Most parser generators provide various ad hoc means for recovering from such errors; YACC inserts an "error" token and the grammar engineer must write productions to swallow it. DMS provides automatic error recovery; it attempts to find a token it can insert that will let the parse proceed, or failing that, deletes the obstacle token and moves on. So the grammar engineer need do nothing at all to get useful error recovery.
Performance is always an issue when one tries to read a large amount of source (a 2000 line C++ compilation unit may implicitly read 250,000 lines of include file support!). Many of the conventional parser generators produce very fast parsers (although the typical bottleneck is lexing tokens and eliminating white space!). The DMS parsers are only moderately fast, because of the GLR property, automated tree building, and automated format/comment capture. This is fine for reengineering tasks; you don't actually process a million lines of code with DMS until you are ready, and then it doesn't matter if it takes a few minutes.
Parser generators typically produce parsers in some favored target language. YACC and Bison produce C-style code. JavaCC produces only Java. ANTLR produces C++ and Java. DMS parsers are really designed for the language underlying DMS, PARLANSE; however, DMS transformations can easily be applied to DMS grammars to produce recursive descent parsers in any of the languages for which DMS has a front end.
Parser Generator | EBNF Notation | Grammars Available | Language Class | Semantic Conditions | Semantic Actions | Substring/ Pattern Parsing | Tree Building | Error Recovery | Performance | Pretty Printer/ Fidelity Printer | Target Language |
---|---|---|---|---|---|---|---|---|---|---|---|
DMS | Simple BNF; Token Precedence; Auto-List Formation | Production Quality preprocessor/C/C++0x, C#,Java 1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Verilog, VHDL, SystemC | Full Context Free via GLR | PARLANSE; Check Reduced-To Tree or Left Context | None; (Implicit Tree Building) | Yes/Yes | Auto Build; Auto Lists; Unary elimination; Terminal Elimination; Ambiguity Handling with Shared Subtrees | Auto Token Insert or Delete | Medium (includes tree building); pretty fast DMS lexer | Yes/Yes | PARLANSE |
YACC | Standard | C; but not C++ Many others scattered across web | LALR(1)+ | No | C code | No/No | Hand-coded | Skip to Statement End?; Optional Error Token | Fast, usually because of fast lexer | No | C |
BTYACC (built on YACC) | Standard | All available for YACC Not C++ | Superset of LALR(1); Choose first of ambiguous alternatives | Yes? | C code | No/No | Hand-coded | Skip to Statment End; Optional Error Token | Little slower than YACC | No | C |
Bison | Standard | C (via GNU) Older C++ (via old GCC) Many others | LALR(1)+; GLR option | No | C code | No/No | Hand-coded | Skip to Statment End; Optional Error Token | Fast (usually because of fast lexer) | No | C |
ANTLR | Standard | Java C++ (draft) Many others but not production | LL(*) | Java | Java | No/No | Hand-coded | Ad hoc | Fast, mostly due to integrated lexer | No | Java |
JavaCC | Standard | Java Not C++ Many others varying qualilty | LL(k)? | Java | Java | No/No | Hand-coded | Ad hoc | Medium (JIT-compile code, even for lexer)) | No | Java |
XT Scannerless GLR | Standard | Java Not C++ Academic quality | Context free | C | C (usually tree building) | No/Yes via ASF+SDF | Hand-coded | Unknown | Slow; parses using characters as tokens | Yes, via ASF+SDF | C |
Elkhound | Standard | ANSI? C++; C via C++ subset Academic quality | Context free | C++ | C++ (usually tree building) | No/No | Hand-coded | Unknown | Fast (usually because of fast lexer) | No | C++ |
Program Transformation Systems
Transformation systems are intended to process source code to produce improved code in the same or different languages. Often they support some kind of some analysis (to find and validate places that need improvement), and so they are also often used as simply analysis tools. Issues include:
- What kind of language, graphical or textual, does the tool process?
- What language can the transformation system process?
- What is the implementation language for the transformation system, and why was it chosen?
- How are transforms encoded?
- How are analyses encoded?
- How are transforms and analyses sequenced to achieve the desired result? ("Metaprogramming")
- How scalable is the tool (along various axes of scalability)
- Has the tool been used for production purposes?
Program transformations come in two forms: procedural (arbitrary functions applied to compiler data structures) and source-to-source (maps between concrete surface syntax forms of code). Procedural transformations have enjoyed enormous success for some 50 years in classical compilers [AhoUllman] for translating source code to binary and optimizing code generators. Traditionally transformations have been considered interesting only if they were correctness preserving. Compilers also traditionally construct and cache crucial auxiliary information such as symbol tables and control and data flow analysis results to support complex context-dependent transformations that preserve correctness.
More recently, procedural transforms have been proposed to support mechanical refactoring [12] as a technique for restructuring programs. Manual versions of this have been popularized [7] as a methodology with suggestions for tool support. Tools for refactoring Smalltalk [18] and Java have started to appear. The better Java tools [23,24] do some sophisticated refactorings such as ExtractMethod; others in the market require some manual validation of the steps. Some experimental work has been done in refactoring C++ [21], and some commercial C++ refactoring tools are available [XRefactory]. All of these above tools can only apply predefined sets of transforms. OpenC++ [ ] uses a custom recursive descent C++ parser front end that builds ASTs with customizable procedural transforms to enable some simple transforms on C++ syntax. The results of these tools must be manually (re)validated, because most of them do not construct the auxiliary information needed to ensure that the transforms are truly correctness preserving
Many tools that analyze source code based on parsing to compiler data structures and apply procedurally encoded analysis have been constructed. More recently, sophisticated reengineering tools such as the ATX Reengineering environment [11], based on parsing to compiler data structures and applying predefined procedural transforms have started to appear.
Source-to-source program transformations were originally conceived as a method of program generation in the 1970s [2] and lead to the notion of domain analysis to support transformational domain-specific program generation in the seminal Draco system [12], which accepted language ("domain") definitions and source-to-source transformations for each language, using recursive-descent parsing. The technology has continued development, moving into scientific computing [10] and is commercially used for sophisticated financial instruments [Kant Synapse 2005].
The idea that program transformations could be used for software maintenance and evolution by changing a specification and re-synthesizing was suggested in the early 80s [Balzer85 need paper ref]. Porting software and carrying out changes using transformations were suggested and demonstrated in the late 80s [1]. Simple source-to-source "evolution" transforms used to change specification code were suggested by Feather [9]. Evolution transforms lead to the key idea of using both correctness preserving transforms to change nonfunctional properties of software, and to use noncorrectness preserving transforms to change functional properties of a software system. Theory about how to modify programs incrementally using "maintenance delta" transformations and previously captured design information was suggested in 1990 [3][4]. This work defined an array of transformation support technology, which inspired the DMS Software Reengineering Toolkit.
Source to source transformation as a serious tool for software evolution is largely unrealized in practice, although there have been some very interesting applications. Following the Draco lead, Refine [6,17] was a ground-breaking software transformation engine originally intended for program generation from high level specifications, but found modest success when used for some blackbox commercial automated migration projects on legacy code [Newcomb85]. Refine tended to be limited by its LALR parser generator, which does not handle most legacy languages very well.
TXL [Cordy early, CordyLCT04?] was used to find and repair Y2K flaws on reportedly 4 billion lines of legacy code, much of it COBOL, and is still available [ref]. TXL uses a backtracking parser, which makes it somewhat more robust for parsing legacy languages, but does not handle ambiguities well. The TXL transformation philosophy gets around this partly by partitioning massive transformations into separate stages each operating on an abstracted, and therefore less complex grammars specific to that stage.
The ASF+SDF system [??] enables the specification of arbitrary languages, using GLR parsing technology, which can handle arbitrary context free grammars. ASF+SDF parsers working at the character level, allowing convenient composition of arbitrary sets of grammar rules. Transformations are specified not in terms of concrete surface syntax, but rather in terms of abstract syntax. Stratego [Visser] augments ASF+SDF with a domain-specific language for specifying complex traversals of the abstract syntax tree and sequencing of transforms. Both ASF+SDF and Stratego have been used as porting and mass source code change tools in research projects, but apparently not commercially.
Two key stumbling blocks for transforming C++ is problem of parsing it, and the even more difficult engineering task of constructing an accurate symbol table, both because of the complexity of the language semantics and because of the differences in the C++ dialects in use. CodeBoost [ ] combines the Edison Design Group (EDG) commercial front end C++ parser with Stratego to research the application of library-call optimizing transforms to be applied to C++ code; we believe transforms are specified in terms of abstract syntax. Boost does not appear to have been used to produce production code. Rose [ Quinlan ] combines the EDG front end with recursive calls on that front end to enable writing C++ source patterns used by C++ procedural transforms, to help optimize scientific codes written in C++. Besides being a production C++ parser, the EDG front end provides a symbol table, crucial to implementing sophisticated context-sensitive transforms. AspectC++, built on top of Puma, combines a hand-constructed C++ parser and symbol table adequate for research tasks, to enable customized procedural transforms to implement AspectJ-like transforms on C++ code. AspectC++ is not used for general C++ transformation.
The only available production quality C++ parsers known to us are the GNU hand-coded parser, the EDG parser and the one for DMS; all three are successful in our opinion not only because of the ability to parse, but especially because of the invested effort to engineer production quality symbol tables. While GNU is very good as a compiler, it is not designed as a general program transformation system. EDG's front end is just that; it builds an AST and a symbol table, but provides nothing beyond that point to support transformation. ASF+SDF's parsing machinery, like DMS's is GLR and can arguably be used to parse C++ directly, but that does not produce a symbol table. Of all of these systems, only DMS can accept true surface-syntax transforms and apply them. Rose comes close, but can only work with C++, whereas DMS's source-to-source engine is actually independent of the actual grammar, but can still use the grammar syntax. The DMS parsers, being reengineering parsers, capture macros, preprocessor conditionals, include files, comments and lexical formatting information in a regenerable way, which none of the other tools accomplish.
One atypical aspect of the DMS Toolkit is the explicit encouragement to mix both source to source and procedural transformation. While this might sound awkward, combining the succinctness of source-to-source transforms with procedural control and the more direct capability of procedural transforms to manipulate complex structures is remarkably powerful.
We believe the Boeing Migration tool (REF) would not have been practical without the robust DMS C++ front end, the reengineering parsing machinery, and the combined power of the mixed rewrites.
Transformation System | Status | Grammars Available | Transformation Style | Analyzers | Scale | Performance | Implementation Language |
---|---|---|---|---|---|---|---|
DMS | Production; Commercial; Active enhancement | Production preprocessor/C/C++0x, C#, Java1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Matlab M, Simulink, Verilog, VHDL, SystemC; Extensible to others | Source-to-Source and/or Procedural; Procedural Metaprograms | Parallel Attribute Grammars; Control Flow; Data Flow; Local Points-to; Global Points-to; Call Graph; Symbolic Range; Custom PARLANSE Code | Multilingual; Tens of thousands of files; Parallel implementation | Excellent; Uses SMP | PARLANSE (parallel) |
TXL | Very stable; Academic; Used commercially in past | Java but not C++; Others as partial "island" grammars; Extensible to others | Source-to-Source; Functional style | Implement using transforms | Big single files; Multilingual via union grammars | Good | Turing |
ASF+SDF | Stable; Academic | Java but not C++ Many university quality; Extensible to others | Abstract Syntax to Abstract Syntax | Implement via transforms; Escape to C | Unknown | Excellent for rewrites? Modest for parsing due to Scannerless GLR | C |
Stratego | Stable; Academic; Active enhancement; Built on AST+SDF | Java but not C++ Many university quality; Extensible to others | Abstract Syntax to Abstract Syntax; Dynamic Rules; Metaprogramming via explicit Strategies | Implement via transforms; Escape to C | Unknown | Excellent for rewriting; Modest for parsing due to Scannerless GLR | C |
CodeBoost | Stable; Academic; Built on OpenC++ and Stratego | C++ via OpenC++ | Stratego style | Implement via transforms; Escape to C? | Unknown | Unknown; Stratego transforms likely excellent | C |
OpenC++ | Stable; Open Source | Most of ANSI C++ No preprocessor; Limited template handling | Procedural analysis/transformation in C++ via Visitor pattern | Ad hoc coded in C++ | Single source file plus definitions from includes? | Unknown | C++ |
ROSE | Stable?; Research; Active enhancement | C/C++ via EDG front end; Fortran90?; Not extensible | Source-patterns for matching and instantiating; Procedural transforms in C++ | Procedural in C++ | Large numbers of files | Unknown | C++ |
Fermat | Stable; Research; Applied commercially by SMLtd | C, COBOL with University qualitity; Not easily extended because of requirement to map to WSL | Encodes programs in WSL language; WSL source-patterns for matching and instantiating; Procedural transforms in MetaWSL | Procedural in MetaWSL | Tens of thousands of files (assembler) | Several thousand lines per minute | MetaWSL, PERL |
Tom | Stable; Active enhancement | No conventional languages; Handles small DSLs well | Tom patterns/rewrite rules embedded in host language Java | Tom rewrite rules; Escape to Java? | Unknown | Fast | OCAML? |
CodeWorker | Stable; One man show; Active enhancement | Java? but not C++ Several partial grammars; Arguably extensible to others | Procedural in "CodeWorker" language | Procedural in "CodeWorker" | Unknown | Unknown | Java? |
XML/XSLT | Production; Active enhancement | JavaML; GCCXML (incomplete); Extensible to others by using some parser generator with ad hoc XML export | XSLT functional transformation | Weak analysis via functional transformation | Poor because of XML document sizes | Poor | C/C++/Java |
Rascal/MP | Academic; Experimental; Active enhancement | Several DSLs; Java?; Extensible to others by using Rascal GLL parser generator | Mixed source-to-Source with arbitrary Rascal procedural code | Custom coded in Rascal relational calculus | Poor (see performance) | Interpreted; reported "very slow" by implementers | Java? |
C++ Front Ends
C++ parsers ("front ends") are intended to read C++ source code and build compiler data structures representing that code, with the expectation that a custom tool will be build to process the compiler data structures, for analysis or C++ transformation.
Issues for C++ front ends are:
- What dialects of C/C++ are handled?
- Is there an integrated preprocessor?
- Does the front end capture comments and lexical format information?
- What compiler data structures are built? (AST, Symbol tables, ...?)
- How can those structures be processed?
- What kind of support for analysis is available?
- Is the tool designed to process just one file, or entire systems of files?
- Can the front end operate with other languages (SQL, ...?)
C++ Front Ends | Status | C/C++ Dialects Available | Post-Parsing Capability | Scale | Performance | Implementation Language |
---|---|---|---|---|---|---|
DMS | Production; Commercial; Active enhancement; Integrated Preprocessor with "no expand" option | Production; C & C++; C++0x; ANSI,GCC, MSVC6, MS Studio 2005, SystemC; Many other languages | Procedural tree processing; Name/Type resolution with or without preprocessing; Source-to-Source Transformation; Attribute Evaluation; Access to DMS control, data, points-to analyzers; Regeneration of source with comments and preprocessor directives | Tens of thousands of source files; Multilingual processing | Parallel implementation | PARLANSE (parallel) |
EDG | Production; Commercial; Active enhancement; Integrated Preprocessor | Production; Most C/C++ dialects | Procedural tree processing; Name/Type Resolution after preprocessing | Single source file, includes processed but lost? | Fast, hand tuned | C |
GCC | Production; Open Source; Active Enhancement; Integrated Preprocessor | Production; GCC, ANSI | Arbitrary procedural analysis and transformation; but tangled up in GCC implementation | One source file plus includes | Fast, hand tuned | GCC C |
GXLGCC (built on GCC) | Stable?; Academic; Integrated Preprocessor | Production; GCC, ANSI | XML export of declarations; No information about function bodies | Declarations from single source file and its includes | Poor; XML documents are huge | C |
Elsa (built on Elkhound) | Stable?; No preprocessor; Academic | C as C++ subset; ANSI? C++; Site claims some limits on template support | Name/Type resolution after preprocessing; Procedural tree processing in C++ | Single file plus includes | Unknown | C++ |
Columbus | Stable?; No preprocessor; Academic/Commercial? | ANSI C?/ANSI C++; Limited template support? | Procedural tree processing | Single file plus includes | Unknown | C++? |
OpenC++ | Stable; No preprocessor; Open Source; | Most of ANSI C++; Limited template handling | Procedural analysis/transformation in C++ via Visitor pattern; Reneration of preprocessed source | Single source file plus definitions from includes? | Unknown | C++ |
ANTLR | ANSI; No proprocessor; Stale C++ definition; Not used anywhere in practice | C++ (draft); Some name/type resolution while parsing | Procedural tree processing | Single preprocessed file? | Moderate? | Java |
CLang/LLVM | Stable; Open Source; Active enhancement; Integrated preprocessor | ANSI C++; GCC dialect? | Name/type resolution while parsing; Procedural tree processing; Regeneration of source (w/ comments? preprocessor?) Native code generation via LLVM; Varying flow analysies via LLVM | Single preprocessed file? | Fast,hand tuned | C++ |
Compiler Infrastructures
Compiler infrastructures are intended to make it easier to construct compilers or compiler-like tools such as source code instrumenters.
Issues for compiler infrastructures are:
- What languages can the infrastructure process?
- What compiler data structures are built? (AST, Symbol tables, ...?)
- Does the front end capture comments and lexical format information?
- How can those structures be processed?
- What kind of support for analysis is available?
- How hard is it to customize the infrastructure for the new task?
- Is the tool designed to process just one file, or entire systems of files?
- Can the front end operate with other languages (SQL, ...?)
Compiler Infrastructures | Status | Languages Available | Post-Parsing Customization Capability | Scale | Performance | Implementation Language |
---|---|---|---|---|---|---|
DMS | Production; Commercial; Active Enhancement | Production preprocessor/C/C++0x, C#, Java1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Matlab M, Simulink, Verilog, VHDL, SystemC; Extensible to others (e.g. DSP assembly); Easily extended to include DSLs for graphs, code generation, etc. | Procedural tree processing; Name/Type resolution with or without preprocessing; Source-to-Source Transformation; Attribute Evaluation; Regeneration of source with comments and preprocessor directives; Control, dataflow, and global points-to analyzers; Used to build a wide variety of source-analysis or instrumentation tools, including code generators and vectorizers. | Tens of thousands of source files; Multilingual processing | Parallel implementation | PARLANSE (parallel) |
GCC | Production; Open Source; Active Enhancement | GCC; ANSI | Arbitrary procedural analysis and transformation; Code-generation focused flow analysis; Tangled in "wants to be a compiler for every possible machine" | One source file plus includes | Fast, hand tuned | GCC C |
LCC | Stable; Academic; | ANSI C | Arbitrary procedureal analysis and transformation; Tangled in "wants to be a compiler" | One source file plus includes | Fast, hand-tuned | C |
SUIF | Stable; Academic; No preprocessor | C and FORTRAN 90? | Procedural tree processing; Name/Type Resolution after preprocessing;Data flow analyses; Reneration of preprocessed source? | Multiple files? | Unknown | C++? |
Open64 | Stable; Open Source | ANSI C, C++?; FORTRAN 90?; | Arbitrary procedural analysis and transformation; Data flow analysis; Reneration of preprocessed source? Tangled in "wants to be an SGI compiler" | Single source file plus definitions from includes? | Likely fast, used as production compiler? | C++ |
CLang/LLVM | Stable; Open Source; Active enhancement; Integrated preprocessor | ANSI C++; GCC dialect? | Name/type resolution while parsing; Procedural tree processing; Regeneration of source (w/ comments? preprocessor?) Native x86 code generation via LLVM; Varying flow analyses via LLVM; Custom analyzers in C++ | Single preprocessed file? | Fast,hand tuned | C++ |
Note: This page under construction.
ToDo: References. Links. CodeGenerators.