Detailled Proposal

This is the detailled project proposal, updated with comments from mentors and changing requirements.

Primary Goals of the Project

  • Refactor the existing code to a standalone library
  • Polish and extend the AST
    • Extend the AST and the pretty printer, which are incomplete at the moment, so everything parsed is accurately modeled.
    • Provide an interface to the AST that allows flexible use in different kind of projects.

As Manuel pointed out, we probably do not want to change the AST dramatically, but rather improve it and provide an interface on top of it.

  • Extensively test if the parser is correct
    • Double check the existing lexing and parsing code.
    • Define an equivalence relation on the AST and test the framework by pretty-printing and parsing again.

This has already been implemented (to some extend). "Have to be non-equivalent" tests can be used to check that fragments are recorded properly. Finally, "does not compile" can be used to ensure that the parser isn't to liberal - tough this has a rather low priority.

  • Do so using some large C projects (there are plenty available) and the gcc test suites.
  • Improve the performance of the Alex lexer by switching to use ByteStrings?.

Performance testing will ensure that the parser has a predictable decent overall performance. Some improvements, like switching to ByteString? for lexing, will be included as well.

  • Extend the code's documentation.
  • Integrate the parser back into c2hs. There is also the opportunity the reuse existing analysis code from c2hs.

Secondary Goals

  • Create a testing suite verifying that c2hs computes size and offsets for marshalling right
  • Make sure the AST works well with generic programming libraries, such as the Strafunsky. For this purpose, I will create Data / Uniplate instances and write some small demonstrations of static analysis using one of the generic programming frameworks available.
  • Code Generation: Investigate if the framework, or some parts of it, could be used with the new quasi-quoting extension of ghc to provide a C code generator. Alternatively, investigate the integration of a combinator based approach for generating C ASTs.
  • Extend the parser to (externally) record comment locations, for projects which need to extract them.
  • Support checks whether the AST is included in some C - subset (C89, no gnu extensions, MISRA C).

Philippa J. Cowderoy noted that it would be nice to be able to check wheter an AST is e.g. valid C89. The first task (checking whether the AST is some C-subset) was also requested by Adrian C. Hey pointing to MISRA C.

  • Philippa also asked whether it would be possible to check if e.g. a String is a valid identifier.

Deliverables

The project should at least result in:

  • A cabalized Language.C library including
    • A parser for C99 + extensions
    • A complete pretty printer for the resulting AST
    • A test harness
    • Haddock Documentation
  • A patch for c2hs to use the library

Benefits

  • c2hs, and therefore projects which depend on it such as gtk2hs, benefit from improved maintainability and extensability, and from the testing efforts, some of which are specifically aimed at that project.
  • Writing bindings to C libraries is still the easiest way to get powerful external libraries to haskell. A standalone C parser will further push the development of tools to simplify the process of writing such bindings.
  • As a complete C parser is made available to other projects, this would make haskell more attractive for projects which depend on parsing C code, e.g. model checking C programs.