Investigating the use of Software Engineering in Computer Science Research
Quick Links
About the project
Quick Links
Computer Science and Software Engineering
Method and results of investigation
Conclusions drawn
Recomendations

The problems addressed

1 difficulties in communication and understanding research

Design documents are clearly of benefit in communicating ideas to researchers wishing to use and build upon the work of others. As observed in the CDMS project, there are two basic types of users for research software. Those wanting to use and obtain a result from the software, and those wanting to extend, review or re-implement parts of it.

The CDMS study showed the use of design documents when new researchers take over an existing project. The benefit occurs during the initial learning phase and is later outgrown. As noted by Lim (2002b) in the GIFT project, if one doesn't know the design language used, such documents may be of little use to begin with. A user manual seems in the case of GIFT to be the most useful form of documentation, although it does not replace the theses for technical research detail. In the CDMS project many people have asked for a manual and the designers recognise the need for one. Like GIFT they expect their manual to be a "how to" document. A user manual already exists for CaMML despite the lack of a product, this reverse approach to design is aimed at producing a consistent interface right from the start.

In the Australian survey 18 of 30 respondants answered "yes" to the question "Given more time and/or money, would you go back and create or update design documents after coding?" The correlations between the use of tools such as interaction diagrams and use case diagrams and the willingness to go back and create or update design documents is of interest and an explanation might be found in the case studies. The high correlations may reflect the use of such tools early on, followed by a period of neglect in maintaining them. The CDMS case study suggests the neglect may be due to developers outgrowing the use of such tools. The information in the diagrams is already understood by the developers.

Prof Meyer suggested most ideas in research fail. As the software hinges on these ideas, constant change to the software is to be expected. If updated regularly, as was tried for a while in CDMS, the documentation becomes a burden. Until a working idea is found, effort spent on documentation may be wasted. The correlation between willingness to document after and desire to have used more software engineering earlier suggests many researchers may not be aware of this, and are perhaps basing their views on those expressed in industry. The research environment is different and with a small team working in one room, design documentation of existing code may be of little value. Worse, design documentation developed during the more active part of the research may harm by wasting time and increasing frustration. The correlations between a willingness to produce design documents after the research and to use technical and code reviews can however be seen as a much more positive step. With the code and researchers in one room, little more is needed to check design and implementation ideas. It was pointed out by Dr Allison (Alison, 2002) that this was practically feasible.

While some researchers obviously see the benefit of reviews (as shown in the survey and interviews) others see a review as implying a lack of trust in the coder. Some work may need to be done with staff and research students to ensure reviews, like thesis drafting is seen in a positive light.

The GIFT and CDMS case studies address the benefit of user documentation and design documentation for people other than the authors. The benefit for the authors after a significant break is also suggested. Our research suggests that design documents should be produced after an intense research period rather than during it. Using more software engineering earlier is often not possible due to a lack of time and resources. The majority of people would like to use more software engineering earlier, but our research suggests that with the exception of reviews, there may be little benefit in this.

2 Loss of knowledge

Wallace's code as discussed in interview results is a good example of the difficulties faced by many researchers. While the code can be understood, it may take significant effort before this is achieved. Tim Wilkin's experience shows the penalty someone else may incur when knowledge goes undocumented and is lost.

Loss of knowledge may also occur when the ideas behind a design are lost. This appears to have happened in CaMML. One of the developers explained that he never designed his code, but rather just tended to just start coding data structures. This approach has limitations. For one thing, it requires the developers to have a full knowledge of all future usage of the system and all issues considered to make this more extensible. Mr Fitzgibbon comments on the need to explain not only how but more importantly why a particular design was chosen show insight. The misapplication of MML, as discussed by Prof Wallace, again shows a lack of understanding by those re-implementing it. There is a clear need to explain not only code and design but implementation choices as well.

While documentation may not be needed during the development of a project, it is invaluable to future researchers who further develop that code.

3 Inefficient and ineffective reinvention

Although rewriting code is part of the nature of research (as already discussed), code that is not part of the research and available in standard libraries should be used. Many researchers recode things such as basic search algorithms repeatedly. Variations on an algorithm, or adjustment to better suit the data are clearly appropriate, however more often than not the code is being produced exactly as found in the library, or as found in the library but with the addition of bugs. Interviews suggested that this practice, while inefficient, was an effective way of familiarising oneself with important concepts, hence there may be a benefit in certain cases. The issues of recoding is of more importance for larger bodies of code, as discussed in the interview results.

4 Repeated framework rebuilding

In the interview results we discussed how new research ideas often require code to be rewritten. The correlation between publications and rewrites as shown in the survey results combined with the interview data suggests that results worth publishing may only occur after significant research (as shown in the planning time) and a number of software rewrites. It also suggests that rewrites, which according to the interviews are usually caused by new ideas, increase the number of publications resulting from a project. This seems quite plausible and if true provides a strong case for ensuring modularity in long-term research projects.

The approach to modularity of using of plug-ins (used in both the GIFT and CDMS case studies) seems to work well and has many advantages. This approach involves separating the program into modules and ensuring the algorithms of prime research interest become their own highly cohesive modules, loosely coupled to the rest of the system. New ideas and replacement or changes to these modules then only require altering or replacement of the one module. This is a key design idea behind both GIFT and CDMS.

The internal commenting of code also plays a significant in the decision to reuse or rewrite code. This can we seen in the CaMML project and was also discussed by Wilkin (2002).

While repeated framework building is a danger, at least one framework must be built. The shortage of time for both researchers and higher degree by research students makes this a difficult task. There appears to be a view that software of use to people is clearly not research software. Both CaMML and CDMS require some form of grants to enable them to employ full time developers to complete the skeleton of the programs. Further research work will then be possible.

5 Authenticity

Wallace's experience with incorrect implementations of his MML concept, as seen in the interview results, are probably unavoidable to some extent. These are problems of human nature and beyond careful diplomacy, not much can be done.

The need for both source code and data to be available is of prime importance to the authenticity of the experiment conducted. Wallace's claim in the interview results that source code does not need to be readable is at best doubtful. If the source code for both the incorrect and correct MML software were readable, reviewers and editors would find it easier to check claims of incorrect implementation.

6 Intermediate software

The CaMML project has a number of documented features that have been left for a future version of the software. These are described and marked as future work in the product specification. The design as written allows for all future changes or additions that have been listed. This is an attempt to avoid the second system effect. GIFT, through its use of plug-ins avoids the problem by allowing incremental development and looser coupling between parts of the system. CDMS is currently experiencing some difficulties in this respect. Researchers are unsure if papers should be published now or after major changes have been incorporated (Alison, 2002). Interviews with developers proved inconsistent with themselves and each other when the stability of the current core code was discussed (Alison, 2002; Comley, 2002; Fitzgibbon, 2002).

The issue of intermediate software needs to be controlled. Some view the complete rewriting from scratch as inevitable, while others see it as ludicrous. Research software can be designed to avoid this, but it requires an overall architecture and careful management.

7 Creative Problem

While a direct impact can not be made on the creativeness of researchers, the barrier to implementing ideas can be lowered. As discussed in the interview results, many ideas are flawed and it is often only by trying them that this is discovered. Where a framework exists, as demonstrated by GIFT, it becomes easier for researchers, and particularly research students, to try out ideas (as plug-ins) in a short period of time. This allows more creative ideas to be tried. CDMS and CaMML have been designed to allow this, but the benefit will only be realised when completed stable version of their core code become available.

References above can be seen in the Bibliography.