=================== Accent requirements =================== This document describes the requirements for Accent - an extensible PHP source code highlighter. Scope ===== The basic scope of this component is highlighting source code of various programming languages. It is limited to common text based one dimensional programming languages, not designed to also work for esoteric languages. Key requirements ================ There are some key requirements, which should be fulfilled at any chance. Customizable output formats --------------------------- Several existing syntax highlighters directly generate HTML, which is not feasible for some applications. Accent should offer different output fromats, which include: - HTML For direct rendering it might be useful. There should be options for generating CSS classes for the detected tokens, style information or even font tags as HTML markup. This might be implemented in different visitors. - Line based token stream Line base annotations of the input file(s), where each line might be again split up in several tokens of different types. This would make it easier to combine the highlighted source with additional information like blame from version control systems. - Abstract syntax tree Since some kind of annotated abstract syntax tree has to be built during processing, a simplified version might be useful as a result for some applications. Handling input data ------------------- The input, which should be highlighted, should be possible to be read from different sources, like files and strings. Generic handling of PHP streams should therefore be sufficant. Language definitions -------------------- Existing syntax highlighters use custom definition formats to annotate the source files with the highlight information. Since nearly each language definition can be described by context free grammars [#]_ (except C++) EBNFs [#]_ with associations of "sorts" should sufficiently describe the highlight information. Such highlight files would also be easily writeable, at least for computer scientists. Sorts ^^^^^ Sorts describe the semantics of items in a transformed or annotated abstract syntax tree. The basic abstract syntax tree can be generated directly from the input data and the given EBNF. The plain abstract syntax is often not useful for further processing, because it might contain a lot of additional overhead, like shown in the following example: Number ::= OctalNumber | DecimalNumber | HexaDecimalNumber | Float OctalNumber ::= /0[0-7]+ ... It showed up, that it is often useful to introduce such separate non terminal symbols inside the grammar. The abstract syntax tree for the input string "012 + 0x2f" would then look something like: /-- Number --- OctalNumber Addition -+ \-- Number --- HexaDecimalNumber This information is irrelevant to the highlighter, so we can reduce this into the annotated abstract syntax tree based on the associated sorts. For example we can associate the sort "number" to the non terminal symbol "Number" and the sort "expression" to the non terminal symbol "Addition". After that we may reduce the abstract syntax tree to contain only items with associated sorts, we would get something like: /-- number expression -+ \-- number Which makes more sense for the associated highligter. .. [#] http://en.wikipedia.org/wiki/Context-free_language .. [#] http://en.wikipedia.org/wiki/EBNF Environmental requirements ========================== Requirements, which should be met by the software, which are not related to the actual functionality. - Runs with PHP 5.2 and 5.3 Older versions of PHP are irrelevant. The software should not issue any errors or warning, including E_DEPRECATED and E_STRICT errors. - Tested with PHPUnit, [#]_ including a test coverage of at least 95%. - Complete english API documentation. Each method, class and property has to be documented, using the PHPDoc [#]_ standard. .. [#] http://phpun.it/ .. [#] http://www.phpdoc.org/ Implementation ideas ==================== Some basic ideas how the different parts may be implemented. EBNF ---- The EBNF should be available in a textual definition. It should be sufficient to only include non terminal symbols and regular expressions. There might be a generic parser, which can handle all kinds of input EBNFs, but it showed up that it might be easier (and faster) to generate a parser based on the given EBNF. NT to sort conversion --------------------- For most cases a simple mapping of non terminal symbols to sorts will be sufficient. Therefore it will probably be useful to either embed this directly in the EBNF definition, or provide a dedicated file with the mapping definitions. It should still be possible to write a custom mapper, which does non terminal symbol to sort mapping based on different aspects. AAST visitor ------------ Since the annotated abstract syntax tree (AAST) is just a tree the visitor may just iterate over it, maybe the AAST can implement the RecursiveIterator interface, and generate the output. For that it will be necessary, that the AAST nodes / leaves will contain at minimum the following information: - sort - textual content - line - position in line