Industry Publications Index ... Click Here

Untangling the Web: Literate Programming

Originally published  October, 2000
by Carlo Kopp
2000, 2005 Carlo Kopp

Literate Programming is one of the less known but no less exceptional creations of Prof Donald Knuth, in many respects one of the founding fathers of modern computing. While the fundamental idea of Literate Programming has existed for almost two decades, it remains a technique which has not found a large following in the industry.

In this month's issue we will explore the ideas in Literate Programming and consider some of the longer term implications of this interesting an useful paradigm.

Software and Documentation - The Established Process

The modern practice of software engineering represents in many respects a codification of what are know to be good habits or practices in coding technique and documentation.

In a well implemented environment, products are proposed in formal documents which describe the functions of the product, intended markets, platforms and such. Once approved for development, a formal functional specification for the product is produced, which details its features and if done particularly well, breaks the product structure down further.

A development specification may follow later, produced by a system architect or team thereof, who take the formal functional specification and use it to produce a detailed description of all of the functional modules of code intended to be used in the program. This can be common especially in very large projects.

At this stage, code cutters are cut loose, pun intended, and proceed to convert the specifications defining the function / internals of the product into working code modules. Concurrently, other developers produce a test specification and test environments for the emerging product, using the formal functional specification or development specification as a basis.

Each programmer, fearlessly dedicated to the greater good, long term corporate profits, ideals of highly maintainable code and his or her high personal ethos of consideration for other fellow code cutters, then dutifully and meticulously documents each code module with comments in the source file, and subsequently produces a nicely laid out and highly readable hard copy or electronic technical manual describing how the product internals actually work. Supported by committed project leaders and managers, who carefully budget out the time and assets required to achieve these lofty goals, the product development process flows through and completes successfully ....

Does this sound familiar ? Have you, the practicing code cutter, ever experienced it ?

The sad reality is that the formal and structured, well planned methodology just described is most frequently not followed. In part this is usually a result of programmers who detest any documentation task, in part it is a result of project timelines which are estimated incompletely or worse, intentionally made to appear better than they are expected to be, and in part due to the unavailability of suitable tools to integrate the process. While it is true that many good development environments now exist, they are frequently expensive and frequently also impose a learning curve delay which few companies are prepared to accept.

One of the key points of breakdown in practical software engineering is the link between the source code, comments in the source code, and the technical manual describing the inner workings of the code. Frequently comments in the code are totally rudimentary, if not non-existent, and the technical manual whipped up in a frenzied week of writing, dedicating a page or maybe two to several thousand lines of code.

This breakdown in the idealised software engineering model can cause difficulties during the development phase of a product, especially in its latter parts. However, where it bites the most is in the latter portions of a program life cycle, after the original development team has largely scattered to the winds, lured by better salaries, newer and more interesting projects, or simply sacked as an unwanted overhead dragging the company down. The result is that a hapless newbie programmer, or code cutter experienced in another product, has to wade through tens or hundreds of thousands of lines of code to attempt to isolate bugs or introduce a new feature desperately insisted upon by the intrepid marketeers.

With a rudimentary technical manual and scant or non-existent comments in the code, this type of exercise can become very expensive indeed, since vast numbers of person-hours (to be PC) are absorbed by the process of reading the code and attempting to divine its function.

So serious was this issue considered to be in the defence contracting game, where a piece of code may have a life cycle of decades, that the much maligned ADA language was form the outset designed to force the inclusion of comments into the code. That ADA has produced a schism in the programming community is a well known fact.

Are there other alternatives in tackling this painful issue ?

Literate Programming

Literate Programming (LP) is one technique which is designed to avoid or bypass the tendency to divide the program development process into coding and documenting.

Prof Donald Knuth at Stanford is widely regarded to be one of the key architects of the modern programming paradigm, indeed his series of textbooks, The Art of Computer Programming, have been an integral part of most Computer Science undergraduate courses over the last three decades. The typesetting language TeX, and its offspring LaTeX, remain the preferred medium for the production of academic papers and many scientific texts. Knuth's fundamentals texts and TeX have produced a major impact in the computing community, especially in teaching.

LP is yet another of Don Knuth's great contributions to the discipline, but as yet it has not conquered the world in the manner that his texts and TeX have done.

The central idea of LP is to merge the processes of coding and documentation into a single task, using a single source file which encapsulates both the source code and the technical manual.

Knuth's philosophical argument is a very simple one: a programmer should be writing a program as a document to be read by other humans, rather than a series of instructions to be executed by a machine.

Indeed, in his original early 1980s paper in The Computer Journal, entitled Literate Programming, Knuth argues persuasively that a program should be a work of literature first and foremost, aimed at human readers. The toolset should perform the conversion into something which a compiler can digest.

Knuth's position is that the programmer should become an essayist, whose main concern is with excellence and exposition of style in the practice of programming.

The first LP toolset was produced by Knuth himself. Called WEB, the tool comprised two basic components, programs called tangle and weave. Source files in WEB syntax combined a detailed commentary by the program author with Pascal statements. Running tangle extracted the Pascal source for compiling, running weave produced a TeX formatted technical manual explaining the working of the program.

The name WEB was chosen intentionally by Knuth. He argues, and with much substance, that traditional structured models of program behaviour cannot always accurately convey the interrelationships between various portions of a program. A top-down model, which shows the subroutine or function calling relationships may not in itself quickly illustrate the mutual relationships of important pieces of code buried in the bowels of this structure. Knuth sees the model as being rather one of a web, hence the name of his original toolset.

There is much to be said for Knuth's perspective on this issue. In many programs, routines will operate on common datastructures and the relationships between calls may tell us very little about what is really going on.

WEB was the beginning of the paradigm, as Pascal waned and C emerged as the language of choice in commercial computing, WEB was adapted for C and thus CWEB was born. In the almost two decades since Knuth's original paper was published, a plethora of different LP tools have emerged. Many designed from the outset to support the use of arbitrary programming languages.

The Case for Literate Programming

Probably one of the best papers arguing the case for LP was written fairly recently by Mike Gradman (Literate Programming). He argues in essence the following case:

Since the 1950s programming languages and technique have been evolving steadily ahead with the aim of making programs easier to understand in structure and function. FORTRAN incorporated control statements and syntax which were essentially mnemonics of commonly used algebraic operators. COBOL used a highly verbose syntax, with the intent of being self documenting. Both failed to meet their original aims of easily understood languages, FORTRAN due to its permissiveness in allowing GOTO rich spaghetti code, COBOL in its verbosity which may frequently obscure the actual function of the code.

The next important phase in this evolution were procedural programming languages. Algol introduced FOR and WHILE loops and IF-ELSE control statements, replacing the FORTRAN DO loop construct. Pascal, and later ADA and C introduced structures, objects which can be handled using very simple syntax despite their possibly high internal complexity. These features did indeed improve the readability of much code, but by the same token produced other difficulties. Consider a situation where a C program contains many nested switch() statements, each containing a fragment of code with multiple statements. The result is often spaghetti which rivals the most spiderweb-like FORTRAN construct.

To this very day a persistent techno-religious schism exists between programmers favouring goto statements and those favouring structured code. Suffice to say that judicious use of either construct can wonderfully obfuscate any piece of code, the larger it is, the more thoroughly !

The introduction of the Object Oriented programming model, best typified by Stroustrup's C with classes, which became C++, was intended in part to further improve the structure of code and promote the use of reusable code modules. In the OO model, objects contain both functions and data, and may be dynamically created or destroyed at runtime. Like it predecessors, the OO model produced what clearly amounts to major improvements in dealing with many structural problems in program understanding. However, like its predecessors, it has also created opportunities for code cutters to get into difficulty. Very complex inheritance hierarchies between objects can confuse even the most determined reader.

Programming style guidelines, which impose conventions upon the programmer, such as using intuitive naming for variables and functions, or objects, or code indenting conventions, have proved to be less than entirely successful in the quest for comprehensible code. Since they are not enforced structurally by the language or the tools, they are frequently abused or ignored by lazy programmers. Inevitably, they fail to impose the order which is needed in the process.

Gradman's arguments are indeed valid, and my experience using some of the languages in question would support his position here. In addition, it is worth noting that in a large part the crux of the problem is dealing with ever growing complexity in software products. With every improvement in the ability of languages and tools to cope with complexity, inevitably programs become more complex, until they hit a bound determined by the ability of the programmer to overcome that complexity. This appears to be a recurring theme in the development of programming languages and techniques.

Gradman reiterates Knuth's central arguments for the use of LP, and summarises a number of key points in favour of using the LP model:

  • Integrating Design Strategies: whether the code is constructed using top-down or bottom-up techniques, LP accommodates both effectively.

  • Divide and Conquer Strategy: LP promotes the division of a program into small chunks, each of which is easy to comprehend.

  • Cognitive Reinforcement: because LP produces well formatted and readable typeset documentation automatically, a programmer has simple to read document to reinforce his or her understanding of the code.

  • Readability: cross referencing information or contents tables produced automatically by LP tools emulates the effect of hypertext, making it easier to navigate the code.

  • Alternatives in Design: the model encourages commentary on choices in the code design and structure, facilitating understanding and maintenance.

  • Augmentation of Languages: a well designed LP environment provides opportunities to add features to the programming language being used, by embedding macros.

  • Powerful Environments: if properly integrated with development tools such as editors, typesetters and debuggers, LP techniques can bypass many difficulties seen in the use of conventional techniques.

An important point not articulated by Gradman is that of the maintenance of consistency between documentation revisions and code revisions. The LP model inherently imposes this consistency, since the top level source file encapsulates both the documentation and the source code. Once it is locked into a revision management system such as RCS or SCCS, the revisions of manuals and the software product are inherently in lockstep.

Gradman does point out some excellent aspects of LP in the industry software engineering context:

  • The model promotes a process whereby a detailed design specification can be directly evolved into an LP source file and thus code. In this manner, the specification becomes a direct template for the code with little scope for interpretation or variation by industrious code cutters downstream.

  • LP directly facilitates communication between code developers, testers and maintainers.

  • The model allows, ideally, technical and user manual components to be directly embedded into the master document.

  • Where the LP toolset is language independent, a single development environment can be used for range of different projects, facilitating the use of a common style and documentation format for all.

Why Isn't Literate Programming Widely Used in Industry?

The central question which arises from an exploration of the LP paradigm is that of why it has not seen the explosive growth of other programming paradigms, such as OO, SQL or even C ?

Gradman argues that in part it is a result of the LP paradigm not being well known or understood outside University Computer Science schools, and in part due to the result of it imposing a bigger overhead in development when compared to pure source code with the odd scattered comment.

Both of his arguments are true, but there are other factors which may also play into this situation.

There is no doubt that without the loud support of commercial marketeers, the LP paradigm will have much difficulty being seen against the highly visible exhortations of Use Blogs Development Environment XXX and Save Millions on Your Development Cycle, or Cut Development Time by Half Using Our YYY Tool produced by commercial tool vendors. However, languages such as C++ started with equally humble beginnings in academia, and later blossomed into world beaters.

Perhaps the two biggest handicaps the LP model has to deal with are both inherent reflections of the industry and its professional culture.

The first is that LP is a global cost minimising technique, which maximises profitability over the whole product life cycle, at the expense of slightly higher development overheads in the early phase of the life cycle. Classical code, crafted using conventional languages and the odd programmer comment thrown in, is a local cost minimising technique, which maximises short term profitability at the beginning of the product life cycle, at the expense of much higher long term costs in maintaining the product.

In an industry culture permeated by the idea of the quick buck being everything, the imperative is always to satisfy hungry shareholders and money merchants. Time to market and initial market yields being accorded such high value, any technique which maximises long term profit will always be sidelined by techniques which are perceived to maximise short term profit. Whether the latter perception is accurate is entirely beside the point.

If the product is sufficiently complex, it may tie itself into knots well before the development cycle is complete, and the argument for short term cost minimisation collapses entirely. Perceptions, however, can be much stronger drivers than facts.

Another important reason why LP has not attracted the enthusiasm of other contemporary paradigms is that many code cutters are challenged in English literacy skills, to be PC about the issue. This problem has deeper underlying causes, but is typically manifested by poor communications and English language skills seen in Computing, Engineering and Science graduates and undergraduate students. My experience in a University teaching role, and in various industry positions, is that many programmers or aspiring programmers have the writing skills set of a 12 year old!

Since English language skills are not a focus of Computing, Engineering and Science courses in universities, whatever difficulties the aspiring programmer may have upon entering his or her professional education are not remedied. Indeed, funding models for the university system largely assume that the high school/secondary education system has imparted these skills. Alas, nothing could be further from the truth. The undergraduate, unless naturally talented in writing, is handicapped by a woeful prior education and carries this handicap through his or her professional life. Without formal English grammar being taught in high schools, those without natural talent are denied the means of overcoming their handicap.

Since LP demands that the programmer include robust commentary, explanations and descriptions into the source file, for many programmers this amounts to the equivalent of being burned at the proverbial medieval stake.

Other techniques could be adopted to circumvent the problem, such as the inclusion of a project team member with good writing skills, who polishes the commentary in source files and acts as target for the programmer to aim his or her explanations at. However, this will be perceived to be an unnecessary overhead in many organisations.

The LP paradigm highlights many weaknesses which exist in our industrial model for software production, and in the overall process via which professional code cutters are trained.

The big long term question which must be grappled with is that of how far the existing coding paradigm can be pushed before it collapses under the weight of complexity in software products. The growth in complexity is a fact of life and we cannot escape it, programs are being expected to do more and more over time. Whether this is good or bad is immaterial, since the trend exists.

I will leave this final point for the reader to contemplate.

$Revision: 1.1 $
Last Updated: Sun Apr 24 11:22:45 GMT 2005
Artwork and text 2005 Carlo Kopp

Industry Publications Index ... Click Here