<?xml version="1.0"?>

<!DOCTYPE document [
<!ENTITY unix "U<sc>nix</sc>">
]>

<document>
  <meta>
    <title>Structured process input/output</title>
    <subtitle>Reformulating the &unix; command line environment to generate and process XML documents</subtitle>
    <student-name>Cameron McCormack</student-name>
    <student-id>12793086</student-id>
    <supervisor>Prof. John Hurst</supervisor>
    <abstract>
      <para>
        One of the main tenets of the philosophy of the &unix; operating
        system is that is more effective to write small, simple programs that do
        specific tasks very well, as opposed to having one integrated program
        that does it all. Users compose these small programs at the command
        line, by way of a pipeline, to perform more complex computations.
        These simple programs have typically focussed on processing flat
        text files. Often, programs generate output that was destined to be
        displayed to the user and as such, the data is formatted for display.
        This necessitates the use of text processing programs to filter out
        irrelevant information and to select the appropriate data. However, this
        imposes an unwanted dependency on the formatting of the original output.
        It is clear that a separation of form and content must be made.
        This thesis explores the idea of using structured data for the
        input and output of some standard &unix; utilities. In particular,
        versions of these programs that parse XML input and generate an XML
        document as their output will be presented. An environment in which
        to execute these programs and format their output for display is also
        given.
      </para>
    </abstract>
  </meta>

  <sections>
    
    <section title="Introduction">
      <p>
        The &unix; operating system <cite ref="unix"/> is an interactive time-sharing
        system devloped in 1963 while its authors were working for
        Bell Laboratoris.  It was created with the aim of being an
        operating system for programmers.  Since its inception, the &unix; command
        line environment has focussed on processing flat text files.
      </p>
      <p>
        Flat text files are files whose data are organised into records and fields.
        Each record in the file is stored on one line.  Each field within that record
        is separated from the other fields by a variable amount of horizontal
        whitespace.
        None of the &unix; systems developed over the last 30 years have moved away from the
        flat text file format that the &unix; command line holds to firmly.  Gancarz (1995)
        notes that data files should contain only streams of bytes separated by newline
        characters.  Authors of &unix; systems have thus far adhered to this.
      </p>
      <p>
        The reason for the success of the flat text file format is its simplicity.  It is
        indeed an easy to use, useful format in which to store some types of data.  Thirty years,
        however, is a very long time in the world of computers, and requirements for
        data storage have changed.
      </p>
      <p>
        Of interest is the possibility of reworking the &unix; command line environment
        to use some other, more structured file format instead of flat text.
        If programs generate and process a more user friendly file format, there
        is the possibility that commands constructed at the shell are easier
        to devise and understand.
      </p>
      
      <subsection title="Previous research">
        <p>
          Even since the early days of &unix;, little research has been done into
          reworking the command line and even less into the idea that leaving behind
          the flat text file paradigm would be prudent.  This is because keeping
          data stored in such a simple format has proven simple and powerful.
          Not all data fits neatly into the record/field organisation of the flat
          text file, however.  It is for this reason that a more structured format for
          data files and process input and output, such as the Extensible Markup
          Language (XML) <cite ref="xml"/>, needs to be investigated.
        </p>
      </subsection>

      <subsection title="Project outline">
        <p>
          The first stage of this project involved looking at existing &unix; systems'
          command line environments and determining the problems they have.  This
          review of traditional &unix; data processing is given in section 2.  Following
          this, a new model was developed for process input and output, which is
          presented in section 3.  An overview of XML is also given here.
        </p>
        <p>
          The programs developed for the new command line environment are divided into
          three categories &#x2014; data generating programs, data filtering programs
          and the interface to the environment.  These are discussed in sections
          4, 5 and 6, respectively.
        </p>
        <p>
          Finally, in section 7, some example applications of the command line environment
          are given, comparing the techniques used in a traditional &unix; environment
          and those used in the new model.
        </p>
      </subsection>
    </section>

    <section title="Existing systems">
      <subsection>
        <title>&unix; command line environment</title>
        <p>
          The command line environment of &unix; has four main aspects: user programs,
          input/output redirection, command pipelines and the shell.  Together, these provide
          the user with a powerful means of composing and executing commands to
          manipulate data files.
        </p>
        <subsubsection title="User programs">
          <p>
            The selection of programs available for the user to run is at the heart of the
            command line environment.  From the outset, the focus of user programs was on
            file manipulation.  The first user programs to be written for &unix; were
            simple file system commands.  After having just implemented the file system,
            Thompson (1979) wrote a few small programs to manage it &#x2014; programs to
            let the user create, copy, edit, list and delete files.  These commands alone
            gave the user a simplistic electronic word processor, but not much more.
            As more user programs were added to the system, three general categories of
            program emerged: commands which performed a task with no output (such as
            copying or deleting a file), those which produced output (such as the
            directory listing program) and those which transformed output (filters, like the program
            to sort a file's contents).
          </p>
          <p>
            The user programs that were developed for &unix; were all small programs that
            performed a single task, and only that task, well.  Kernighan and Plauger (1976) expounded
            on that idea and coined the term <em>software tool</em>.  By writing programs
            to deal with generalities rather than specifics, focussing them on performing
            a single task well, the user can build up a toolkit from which they can assemble
            commands to perform complex tasks.
          </p>
          <p>
            These programs were all written with a certain data format in mind, namely,
            flat text.  It was held that, to maximise portability and usefulness, data
            should be stored in plain ASCII (the American Standard Code for Information
            Interchange &#x2014; a 7-bit code definining a set of alphabetic, numeric,
            punctuation and control characters), arranged into fields separated by horizontal
            whitespace and records separated by newline characters <cite ref="phil"/>.
          </p>
        </subsubsection>
        <subsubsection title="Input/output redirection">
          <p>
            One idea which increases the usefulness of the user programs in &unix;
            is that of input/output (I/O) redirection.  General redirection of process
            input and output was already available in Multics <cite ref="multics"/>
            and it was from here that the idea was borrowed.  One of the tenets
            of &unix; philsophy that Gancarz (1995) mentions is that all programs
            should be written as filters.  Processes under &unix; have the notion
            of a <em>standard input</em> and <em>standard output</em>.  If a program
            has not been given a filename on the command line, the standard
            input is where a program typically reads its input from.  The standard output,
            analogously, is where a program would write its output to.  Any program
            that reads data from the standard input, transforms it somehow, then writes
            that data out to the standard output, is considered to be a filter.
          </p>
          <p>
            Normally, both the standard input and output of a process is the terminal.
            Thus, program input comes from the keyboard and program output is written
            to the screen.  In &unix;, both the standard input and output of a process
            can be redirected to some other file or device.  With this, a program
            can be written to handle just a single input source (the standard input)
            and be able to handle any file implicitly, by having its input redirected
            before the program is run; similarly for a program's output.  This redirection
            is performed by the <em>shell</em>, the command interpreter that elicits
            commands to run from the user.  Programs which did not use the standard input and
            output, and instead relied on hard coded filenames, were definitely of limited
            use.
          </p>
          <p>
            While Multics could also handle such redirection, the method used for enacting
            the redirection in &unix; was far more elegant.  According to Ritchie (1979),
            Multics required the user to use an explicit command to reassign a program's
            output.  For example, to run the <code>list</code> command and redirect its
            output to a file called <code>listing</code>, the following three commands were issued:
          </p>
          <listing>iocall attach user_output file listing
list
iocall attach user_output syn user_i/o</listing>
          <p>
            To accomplish the same task in &unix;, running the <code>ls</code> program
            to generate the listing, just a single command was needed:
          </p>
          <listing>ls > listing</listing>
          <p>
            Input redirection could be achieved with a similarly simple command:
          </p>
          <listing>ed &lt; cmds</listing>
          <p>
            This command runs the <code>ed</code> editor, taking commands from the file
            <code>cmds</code>.  The syntax of I/O redirection, using '&lt;' and '>',
            has not changed since its original inclusion in &unix; in 1970.
          </p>
        </subsubsection>
        <subsubsection title="Command pipelines" break="true">
          <p>
            The mechanism for composing commands is the <em>command pipeline</em>.  The concept,
            which was first seen in the &unix; operating system, is attributed to McIlroy <cite ref="evolution"/>.
            In 1972 McIlroy proposed the concept of connecting a series of processes together such
            that the output of one process became the input of the next, the analogy used being
            that connecting these processes was like screwing together garden hoses.
            The command pipeline is what enables the reuse of the software tools available to
            the &unix; user to create complex commands in a single line.
          </p>
          <p>
            The original syntax used for the pipe concept was similar to the I/O redirection
            syntax that already existed.  A pipeline to sort a file and take the first 10 lines
            to send to the printer could be written as:
          </p>
          <listing>sort file >head>lpr</listing>
          <p>
            Thus, either a file or a command could be placed after a '>' character to indicate
            where the output of the previous command inthe pipeline was to go.  This syntax,
            while consistent with the redirection syntax, did allow ambiguous commands to
            be written.  For example, with this syntax there was no way to specify that you
            wished the output of the <code>sort</code> command to be written to a file called <code>head</code>,
            as the shell would determine that <code>head</code> was a program and that <code>sort</code>'s
            output should be sent there instead.  The syntax was therefore changed to use
            the vertical bar (or the <em>pipe symbol</em> as it has become known).
            The above command now could be written as:
          </p>
          <listing>sort file | head | lpr</listing>
          <p>
            It was at this time, after having implemented the pipeline concept, that the developers
            of &unix; came to the realisation that this simple extension of I/O redirection was the
            vehicle for program reuse <cite ref="tools"/>.
          </p>
        </subsubsection>
        <subsubsection title="The shell">
          <p>
            The program that provides an interface to the command line environment is the shell.
            It is this program that repeatedly prompts the user for a command and then executes
            that command.  The shell is perhaps the one aspect of the &unix; command line
            environment that has undergone a number of changes since its inception.
          </p>
          <p>
            The original &unix; shell created at the start of 1970 was very simplistic <cite ref="evolution"/>.
            All the shell did was read a single command from the terminal, load the
            specified program over the top of itself, and jump to that new program.  The
            operating system relied on the user program to invoke the <em>exit</em> system call,
            which would load a fresh copy of the shell into memory and begin its execution.
            This was not a true multiprogrammed system, though.  Only one process could run
            at one time.
          </p>
          <p>
            Subsequent developments to the operating system included the <em>fork</em>/<em>exec</em>
            process control that still exists in todays' &unix;es.  The Berkeley time-sharing system
            developed in 1969 already incorporated this method of process control.  With this,
            the &unix; shell could now handle background processes easily.  Background processes
            are those started with a command such as:
          </p>
          <listing>program &amp;</listing>
          <p>
            The ampersand indicates that the shell should execute the program in the background
            and immediately ask the user for another command to run.
          </p>
          <p>
            From the release of &unix; First Edition in 1971 until the Sixth Edition in 1976, very
            little changed in the operation of the shell.  However, during the development of
            the Seventh Edition, Bourne (1983) created his own shell, to be known as the Bourne shell.
            The Bourne shell brought to &unix; shell programming constructs such as iteration
            and selection that allowed the user to
            write complex shell scripts.  From that point onward, most new shells developed for &unix;,
            such as the Kornshell <cite ref="korn"/> and the Z shell <cite ref="zsh"/>, maintained a 
            syntax compatible with the Bourne shell.  One popular shell that was significantly different
            from the Bourne shell was the C shell, csh, written by Bill Joy for inclusion into
            the Berkeley Systems Distribution of &unix; <cite ref="bsd"/>.  The C shell, as its
            name implies, lets the user run commands with a syntax reminiscent of the C programming
            language <cite ref="c"/>.
          </p>
          <p>
            Attempts to vastly redesign the &unix; shell have been made.  One notable design is that
            of a functional shell, where commands are entered in a form similar to those used
            in a functional programming language.  These shells, such as McDonald's (1987) fsh
            and the Scheme shell <cite ref="scsh"/>, scsh, allow the user to compose commands as
            if they were functions.
          </p>
        </subsubsection>
      </subsection>

      <subsection title="Problems">
          <p>
            While certainly an appropriate design decision for the early 1970s, the
            7-bit ASCII flat text file format used by the &unix; user programs
            is not always the best today.  ASCII is able to represent only
            a handful of Western, Latin-based languages.  There was not much cause
            for writing with internationalisation and localisation in mind when
            the standard &unix; user programs were being created.  Today, with global
            networks such as the Internet, languages other than English have to be
            considered.  A character set such as Unicode <cite ref="unicode"/> or
            the Universal Character Set (UCS) <cite ref="iso10646"/> would certainly
            solve this problem if they were routinely used and had support from the
            operating system and its user programs.
          </p>
          <p>
            The rigid record/field format of flat text files also does not
            accommodate all types of data.  A particular example would be information
            arranged hierarchically.  Naturally such information could be stored
            in flat text, however an inelegant kludge such as recording a node's
            depth in the tree would have to be stored in its own field.  Far better
            to keep the relationships between the hierarchical data evident due
            to the file format itself.  A data file format such as XML could
            quite easily be used to store such data, being intrinsically hierarchical.
            XML also has the advantage of using named fields.  A user reading
            an XML document could easily determine the use of certain fields by their names,
            whereas in a flat text file, such names are absent, and the user would
            have to rely on some external documentation.  To be maximally useful
            all text generating programs would  have to write their output in XML
            so they could be used in conjunction with XML filtering programs.
          </p>
          <p>
            All of the shells mentioned earlier, which have been developed for &unix;, have been
            designed with the assumption that the
            majority of user programs that are to be composed use flat text for their input and 
            output.  The shells can let program output be written verbatim to the terminal, since
            flat text is eminently human readable.  If, however, user programs are to be
            designed to read and write XML documents, program output is not immediately
            appropriate for the user to read.  XML documents are readable by humans &#x2014; they
            do not contain any illegible control codes as some binary formats do, but
            the information in these documents should be converted into a form that
            is easy for the user to read.  Since the user will be interacting with the shell
            using a text based terminal, program output destined for the terminal should be
            transformed into plain text by the shell.  Making this final transformation for
            the user the responsibility of the shell means that user programs do not have
            to worry about checking whether their output is to be piped into another program,
            redirected into a file or written to the terminal.
          </p>
      </subsection>
    </section>

    <section title="Structured input/output">
      <p>
        To address the problems mentioned in the previous section, a new model
        for the &unix; command line environment was devised.  This model describes the
        general format programs should accept on their standard input and
        produce on their standard output, as well as the techniques that are to
        be used for handling the presentation of this data.
      </p>
      <p>
        In the new model, programs will use a more structured data format
        for their input and output.  Specifically, they will read and write
        Extensible Markup Language (XML) documents.  There are a number
        of advantages to using a markup language such as XML instead of
        flat text.  The major superiority is the ease of parsing to extract
        particular information.  Formatting output for display on the terminal
        is sometimes an information losing process.  It is this formatting
        for display which often makes it difficult for the user to construct
        a command to select the relevant fields or records.  If a program generates
        its output in XML, human factors such as "information overload" do not
        have to be considered.  All of the information can be kept
        in the output document in a form that is easy to extract.
      </p>
      <p>
        XML documents also are inherently self documenting.  Records and fields
        (elements and attributes in the document, correspondingly) must all be named explicitly.
        This helps the user inspecting the document determine the nature of the
        contents of each field.  Comments can also be placed in the XML documents
        without causing disruption to programs that parse them.
      </p>
      <p>
        As the name suggests, XML documents are extensible.  What this means is that
        a user or a program can annotate a document by adding extra attributes to
        an element, again without confusing existing programs which parse the
        document.
      </p>
      <p>
        All conforming XML parsers also must be capable of reading in documents using
        the Unicode character set.  Unicode, which encompasses a wide variety
        of scripts and symbols used around the world, allows a single document
        to include data in many languages using only a single encoding.  A user
        can include, say, text written in both Traditional Chinese and in
        Hebrew, without worrying that the programs manipulating such documents
        could not handle them.  Of course, the display of such languages is another
        issue.
      </p>
      <p>
        Since programs which generate output now must not worry about formatting,
        this formatting must be deferred somehow.  The formatting should be
        delayed for as long as possible, so that the original information is
        always available to be parsed.  It makes sense, then, that formatting
        should take place just before the output is to be written to the
        terminal for the user to read.  Thus this model requires that the
        shell transform the XML document resulting from the last program
        in a pipeline into a form suitable for writing to the terminal.
      </p>
      <subsection title="Extensible Markup Language">
        <p>
          XML documents are essentially data organised into a tree structure.
          The syntax of XML is reminiscent of the Standard Generalised Markup Language
          (SGML) <cite ref="sgmlstd"/>, being derived from it.  XML is basically a stricter form of
          SGML with a couple of additions.
        </p>
        <p>
          Every well formed XML document is a rooted tree of elements.  An element
          can contain other elements, as well as attributes and character data.
          A very simple XML document is shown below:
        </p>
        <listing><![CDATA[<datafile>
  <record field1="val1" field2="val2"/>
  <record field1="val1" field2="val2">
    <details>Some text</details>
  </record>
</datafile>]]></listing>
        <p>
          The <code>datafile</code> element is the root element of this document.
          It opens with the start tag <code>&lt;datafile&gt;</code> and closes
          with the end tag <code>&lt;/datafile&gt;</code>.  Note that the first
          <code>record</code> field in the document does not have an end tag.
          Since the element has no content, it is opened and closed in the one
          tag, by having the slash just before the closing angle bracket.
        </p>
        <p>
          Every element can have attributes.  These can be seen in the example document
          as <code>field1</code> and <code>field2</code> in the <code>record</code>
          elements.  Every attribute must have a value, surrounded by either double
          or single quotes.  Each element can also have character data as its content.
          In the example, the <code>details</code> element has the text "Some text"
          as its content.
        </p>
        <p>
          Every element and attribute is also within a particular namespace.  Namespaces
          are used in XML documents to help prevent name collisions, and also
          to designate the type of an element or document.  A namespace is a
          Uniform Resource Identifier <cite ref="rfc1630"/>.  Each element
          in an XML document can be given a namespace, such that every
          element within that element (unless overriden by another namespace)
          will be in that namespace.  This is achieved by using the
          special <code>xmlns</code> attribute on an element, as demonstrated below:
        </p>
        <listing><![CDATA[<courses xmlns="http://myuni.edu/">
  <subject>
    ...
  </subject>
</courses>]]></listing>
        <p>
          Now both the <code>courses</code> and <code>subject</code> elements are
          identified as coming from the namespace with URI <code>http://myuni.edu/</code>.
          The URI is not required to refer to a physical web page.  It is simply
          used as a unique identifier.  Instead of assigning a namespace
          to the <code>courses</code> element and all its descendents, the namespace can be given a
          prefix and used explicitly.
        </p>
        <listing><![CDATA[<myu:courses xmlns:myu="http://myuni.edu/">
  <myu:subject>
    ...
  </myu:subject>
</myu:courses>]]></listing>
        <p>
          The <code>myu</code> prefix is used to identify the namespace.  Namespace prefixes
          are often used when more than one namespace is needed in the one
          document.  Elements which do not have an associated namespace are said
          to be within the default namespace.
        </p>
        <p>
          The Extensible Stylesheet Language Transformations (XSLT) specification <cite ref="xslt"/>
          was developed along side XML to be used as a means of specifying how XML documents
          can be translated into other XML documents or into plain text.  XSLT stylesheets
          are XML documents themselves, and provide the user with a means of
          specifying declaratively how particular elements of an XML document are to be transformed.
        </p>
        <p>
          A language to address parts of an XML document was developed for use with XSLT, called
          XPath <cite ref="xpath"/>.  XPath locations look similar to &unix; pathnames in that
          components are separated by a slash.  A component in an XPath location can be an
          element name, an attribute name preceded by an <code>@</code> symbol or a predicate
          such as <code>text()</code> to refer to a text node.  For example, the XPath
          location <code>/courses/subject</code> refers to all of the <code>subject</code>
          elements which are directly beneath the <code>courses</code> element at the top
          of the document.  Square brackets can be used to specify a constraint on
          selecting particular nodes.  So, <code>/courses/subject[@code='CSE1301']/@name</code>
          will select the name of the subject whose code attribute contains the string 'CSE1301'.
          Quite complex expressions can be built up with XPath to refer to parts of an
          XML document.
        </p>
      </subsection>
    </section>

    <section title="Data generating programs">
      <p>
        The first category of programs in the new model for the &unix; command line
        environment is the data generating programs.
        These programs do not take
        any input, though they may have command line arguments.  Whereas in the
        traditional &unix; command line environment such programs may format
        their output for human readability, in the new model these data
        generating programs must not.  Instead, the output must be a well
        formed XML document.
      </p>

      <subsection title="Generating XML output">
        <p>
          So that the output can be identified as having been generated by
          a particular program, each data generating program will produce
          an XML document with a namespace specific for that program.
          This namespace will be used to determine how to transform the output
          to a human readable form later.
        </p>
        <p>
          In general, the structure of the output document will be the same.
          Namely, there will be an element for each record underneath the
          root element.  The data filtering programs described later default
          to processing documents structured in this manner.  Of course,
          not all documents generated will have this format.  The filters
          have options to inspect different parts of the document to handle
          these cases.
        </p>
        <p>
          Users are accustomed to specifying formatting options to data
          generating programs.  For example, they will give the <code>-l</code>
          switch to the <code>ls</code> program to instruct it to give a long
          listing of the files.  Since the new model requires the data
          generating programs not to act on these formatting flags, they should
          store this formatting preference as part of the resulting output
          document.  Adding an attribute to the root element of the XML
          document is sufficient for recording the formatting option.  Later on,
          when the document is to be transformed for output to the terminal,
          this attribute can be checked for and acted upon.
        </p>
      </subsection>

      <subsection title="Important programs">
        <p>
          Inspecting a typical &unix; system, the programs in the table below were
          identified as being important data generating programs which should
          be included in the new environment.
        </p>
        <table>
          <col width="8em"/>
          <col width="25em"/>
          <tr type="header">
            <td>Program</td>
            <td>Purpose</td>
          </tr>
          <tr>
            <td>date</td>
            <td>Reports the current date and time</td>
          </tr>
          <tr>
            <td>df</td>
            <td>Reports the disk space free on each mounted filesystem</td>
          </tr>
          <tr>
            <td>du</td>
            <td>Report disk usage</td>
          </tr>
          <tr>
            <td>echo</td>
            <td>Outputs some text</td>
          </tr>
          <tr>
            <td>hostname</td>
            <td>Reports the current host's name</td>
          </tr>
          <tr>
            <td>ls</td>
            <td>Lists file details</td>
          </tr>
          <tr>
            <td>netstat</td>
            <td>Reports currently open network connections</td>
          </tr>
          <tr>
            <td>ps</td>
            <td>Lists details of processes running on the system</td>
          </tr>
          <tr>
            <td>uname</td>
            <td>Reports operating system details</td>
          </tr>
        </table>
      </subsection>

      <subsection title="Implementation examples">
        <p>
          Three examples are now given for new, XML outputting versions of
          the &unix; data generating processes.
        </p>

        <subsubsection title="ls">
          <p>
            The traditional <code>ls</code> program takes a list of files on the
            command line and displays information about those files.  If a file
            on the command line is a directory, information about the files inside
            that directory are given instead.  If no files are given on the
            command line, just information about the files in the current
            directory is given.
          </p>
          <p>
            Without any formatting options, <code>ls</code> outputs just a list of
            the filenames.  If the output is destined for a terminal, these files
            would typically be lined up in coumns to save space on the screen.  If
            the output is going to a file or another process' standard input, the
            filenames would be listed one per line.  Such output is good for parsing,
            as there is no real formatting going on here.
          </p>
          <p>
            One of the most commonly used options, however, does force some formatting
            to be done.  The <code>-l</code> option causes <code>ls</code> to output
            a long listing with details about each file, such as its permissions, size
            and last modified date.  This information is arranged into fixed width columns,
            regardless of whether the output is destined for the terminal or another
            process' standard input.  An example of this output is shown below:
          </p>
          <listing><![CDATA[total 8
-rw-r--r--   1 cameron cameron   253 Oct 22 23:40 phonebook
-rw-r--r--   1 cameron cameron   763 Sep  5  2001 resume]]></listing>
          <p>
            The first thing output is the number of disk blocks taken up by
            the files which are listed.  Then comes a line for each file, detailing
            the permissions in symbolic form, number of hard links,
            owner, group, size, last modified time and name of the file.
            One particularly nasty problem with this output is that the custom
            for using whitespace to separate fields has been broken.  The
            last modified time actually has spaces embedded in the field.
            "Oct 22 23:40" should be just the one field of this record.
            This means that it is somewhat difficult to extract this information
            from the record.  One way to do it would be to use the <code>cut</code>
            program, which can select the text from each record between two
            given columns.  Obviously this is going to need the user to
            count the characters in this output to determine where the last modified
            time begins and where it ends.  Really the user would prefer to
            issue a command to just "extract the last modified time" rather than
            "extract characters in columns 40 through 52".
          </p>
          <p>
            It can also be seen that the <code>ls</code> program changes the
            format of the last modified date depending on the actual date.  If
            the last modified date is within about ±6 months of the
            current date, the month, day of month and time will be displayed.  But if
            the last modified date falls outside this range, the month,
            day of month and year will be displayed instead.  This is perhaps
            not such a bad idea when considering what the user wants to see.
            If a document has been changed over a year ago, it is more important
            to mention the year of modification than the time of day.  For
            extract information, however, this is not so good.  Ideally, the full
            time would be listed for every file, so that no information is lost.
          </p>
          <p>
            The <code>ls</code> program also has another mode, enabled by using
            the <code>-R</code> switch.  This causes <code>ls</code> to recurse
            through directories to list information about files in a whole
            directory tree.  An example of the traditional <code>ls -R</code>
            output is listed below:
          </p>
          <listing>.:
documents

./documents:
personal
shared

./documents/personal:
phonebook
resume

./documents/shared:
housekeeping</listing>
          <p>
            The output for the recursive directory listing is split into sections.
            Each section is for one directory, and it starts with a header comprising
            the name of the directory followed by a colon.  Each file is the directory
            is then listed after this header.  A blank line separates each directory.
            This output format breaks the customary flat text format even more than the
            previous example did.  Now instead of there being a file per line, any
            program parsing this output must be careful of the headers and the blank
            lines.  For a human, this output format is fine, as it is a reasonable
            way to flatten the hierarchy of the directory tree for display on the
            terminal.  The inherent structure in the directory tree has been lost,
            though, and it is non-trivial to extract information about a file's
            position in the directory tree.
          </p>
          <p>
            To overcome these shortcomings in the traditional <code>ls</code> program,
            the new <code>ls</code> will produce an XML document with an element
            for each file being listed.  Each <code>file</code> element in the output
            will have a number of attributes, used to hold the details of that file.
            Also, each <code>file</code> element can contain other <code>file</code>
            elements, thereby preserving the structure of the directory tree.
          </p>
          <p>
            Here is the above <code>ls</code> example with the new output format
            (some attributes have been elided for brevity):
          </p>
          <listing><![CDATA[<files xmlns="urn:ns:clm.xml-unix.ls" rec="true">
  <file name="documents" path="." uid="1000" gid="1000"
        size="4096" type="directory" mode="0755"
        mtime="914037321" human-mtime="Dec 19 1998">
    <file name="personal" path="documents" uid="1000" gid="1000"
          size="4096" type="directory" mode="0755" mtime="1022074845"
          human-mtime="May 22 23:40">
      <file name="phonebook" path="documents/personal" uid="1000"
            gid="1000" size="253" type="regular" mode="0644"
            mtime="1022074845" human-mtime="May 22 23:40"/>
      <file name="resume" path="documents/personal" uid="1000"
            gid="1000" size="763" type="regular" mode="0644"
            mtime="1022078771" human-mtime="May 23 00:46"/>
    </file>
    <file name="shared" path="documents" uid="1000" gid="1000"
          size="4096" type="directory" mode="0755" mtime="1022074911"
          human-mtime="May 22 23:41">
      <file name="housekeeping" path="documents/shared" uid="1000"
            gid="1000" size="0" type="regular" mode="0644"
            mtime="1022074911" human-mtime="May 22 23:41"/>
    </file>
  </file>
</files>]]>
          </listing>
          <p>
            The output document uses the namespace <code>urn:ns:clm.xml-unix.ls</code>.
            The specific namespace that was chosen here does not matter, though
            it is important that it is unique.  The namespace will be used by the
            shell after the <code>ls</code> program is run to determine the correct
            transformation to perform on this output to make it human readable.
            The <code>rec</code> attribute is used in the root element to inform the
            transformer that a recursive directory listing format is required when
            it comes time display it on the terminal.
          </p>
          <p>
            The last modified time is now bundled up in a single attribute, to 
            facilitate its extraction.  In fact, for the last modified time,
            two attributes are used to represent it.  The first, <code>mtime</code>,
            holds the &unix; time &#x2014; this is the number of seconds since
            the &unix; epoch, 1 January 1970.  The second attribute is
            <code>human-mtime</code>, and this holds the human readable form
            of the last modified time.  It is this field which will be output
            to the terminal in the directory listing, as it is much easier
            for a human to understand than the &unix; time.
          </p>
        </subsubsection>

        <subsubsection title="ps">
          <p>
            The <code>ps</code> program reports the status of processes running on the
            system.  <code>ps</code>, while differing in behaviour across different
            &unix; systems, typically has two modes of process selection.  By default,
            the <code>ps</code> command will only report processes which are running
            in the same session as itself.  Usually this results in the processes
            which have been started under the current login shell being listed.
            The <code>-e</code> option instructs <code>ps</code> to list every process
            running.  (There exist other criteria by which to select processes to list,
            such as by process ID or by user, however these are not as interesting
            as the major two mentioned, and can be reproduced using a filter on the
            data.)
          </p>
          <p>
            Which properties of processes to report is the other main aspect of <code>ps</code>.
            By default, just four fields are shown &#x2014; process ID, controlling terminal,
            running time and command.  Using the <code>-f</code> flag asks for a full listing.
            A full listing comprises the username, process ID, parent process ID, percentage
            CPU usage, start time,
            controlling terminal, running time and command fields.  If neither of these two
            alternatives is appopriate, the <code>-o</code> argument may be used to specify
            which fields should be displayed.
          </p>
          <p>
            Two of these fields contain embedded spaces which make them harder to extract
            from the flat text output &#x2014; the <code>lstart</code> field, which contains
            a long format date, and the <code>command</code> field, which holds the full
            command line for the process.  Also, the <code>command</code> field and the
            <code>wchan</code> field, which contains the system call that
            process is currently running in, will be truncated if they are not the last
            column in the output.
          </p>
          <p>
            A simple XML file is used to store all of the relevant information about a process
            in the new model <code>ps</code> program.  An example output document
            from running <code>ps -f</code> is given below (very many attributes
            omitted):
          </p>
          <listing><![CDATA[<processes xmlns="urn:ns:clm.xml-unix.ps" sid="1925" full="true">
  <process pid="1" lstart='Mon Nov  4 10:59:05 2002' vsize='1264' uid='0'
           state='S' command='init' tty='?' time='00:00:04' sid='0' c='0'
           nice='0' ppid='0' user='root' wchan='select'/>
  <process pid="240" lstart='Mon Nov  4 10:59:24 2002' vsize='2156' uid='0'
           state='S' command='/sbin/dhclient-2.2.x -q eth0' tty='?'
           time='00:00:00' sid='240' c='0' nice='0' ppid='1' user='root'
           wchan='select'/>
  <process pid="244" lstart='Mon Nov  4 10:59:24 2002' vsize='1372' uid='1'
           state='S' command='/sbin/portmap' tty='?' time='00:00:00' sid='244'
           c='0' nice='0' ppid='1' user='daemon' wchan='poll'/>
  <process pid="347" lstart='Mon Nov  4 10:59:29 2002' vsize='2028' uid='0'
           state='S' command='/sbin/syslogd' tty='?' time='00:00:00' sid='347'
           c='0' nice='0' ppid='1' user='root' wchan='select'/>
  <process pid="350" lstart='Mon Nov  4 10:59:29 2002' vsize='1936' uid='0'
           state='S' command='/sbin/klogd' tty='?' time='00:00:00' sid='350'
           c='0' nice='0' ppid='1' user='root' wchan='syslog'/>
</processes>]]></listing>
          <p>
            To signify that the "full" listing format is required, a
            <code>full</code> attribute is added to the root element of the
            document.  Also, an <code>sid</code> attribute is added.  This
            contains the process ID of the session leader of the <code>ps</code>
            process.  This is helpful for the transformation process the shell
            will perform later, in determining which processes should be
            displayed.
          </p>
          <p>
            Thus, although only some processes will be written to the terminal,
            all process details are included in the output.  Selecting
            which processes to display is a formatting issue, and should be
            delayed until as late as possible.
          </p>
        </subsubsection>

        <subsubsection title="df">
          <p>
            The <code>df</code> program is a relatively simple tool to report the
            amount of free disk space on each of the currently mounted
            filesystems.  Some systems, such as Linux, have a number of
            virtual filesystems which don't actually take up any space.  These
            filesystems are by default not listed by <code>df</code>, unless
            the <code>-a</code> option is given.
          </p>
          <p>
            <code>df</code> does not have much wrong with its output.  It is
            straightforward and easy to parse.  The only thing the user
            must be wary of is the header on the first line.
          </p>
          <p>
            An XML version of the output of running <code>df -a</code> is shown below:
          </p>
          <listing><![CDATA[<filesystems xmlns="urn:ns:clm.xml-unix.df" all="true">
  <filesystem dev="/dev/hde2" blocks="29249536" free="1884160" use="94"
              mountpoint="/"/>
  <filesystem dev="proc" blocks="0" free="0" use="0" mountpoint="/proc"/>
  <filesystem dev="devpts" blocks="0" free="0" use="0" mountpoint="/dev/pts"/>
  <filesystem dev="/dev/hde1" blocks="10481664" free="3760128" use="65"
              mountpoint="/winxp"/>
  <filesystem dev="/dev/hdf3" blocks="28409856" free="1523712" use="95"
              mountpoint="/storage"/>
  <filesystem dev="usb" blocks="0" free="0" use="0" mountpoint="/proc/bus/usb"/>
</filesystems>]]></listing>
          <p>
            Like both <code>ls</code> and <code>ps</code>, a formatting option
            (in this case, <code>-a</code>) is handled by storing it as
            an attribute in the root element of the document.
          </p>
        </subsubsection>

      </subsection>
    </section>

    <section title="Data filtering programs">
      <p>
        The second category of programs in this new model of the
        &unix; command line environment is the data filtering programs.
        Filters are programs that take some data from standard input,
        manipulate it somehow, and then write that manipulated data
        to standard output.  Since data generating programs now
        produce well formed XML documents on their standard output,
        the filtering programs will have to parse and manipulate
        that.
      </p>
      
      <p>
        Parsing XML documents correctly is not a trivial task.  Fortunately
        there are many freely available XML parsing libraries in
        existence.  In the implementations presented in this project,
        the GNOME project's libxml2 library was used (www.xmlsoft.org).
      </p>

      <subsection title="Transforming the input">
        <p>
          Once the input XML document has been parsed and a document tree
          constructed in memory, the filter can transform the document.
          At first glance, it may seem as if the majority of text filters
          in the traditional &unix; command line system can be converted
          to operate on XML documents without much paradigm shift.
          However, the fundamental difference in structure between flat
          text files and XML documents causes the XML filters to
          behave somewhat differently.
        </p>
        <p>
          The most prominent difference in the behavious of XML filters is
          that they need to know where in the document they will be
          operating.  According to the model presented in this project,
          most XML documents generated by a data generating process will
          have an element for each record, whose parent element is the
          document element.  This is a nice correspondence to
          the line-based records in a flat text file.  XML filters, though,
          cannot rely on that structure being adhered to.  Some processes
          will generated output with a different structure.  XML data files
          on disk could have a completely different structure.  The XML
          filters must therefore be able to handle the main data of an
          XML document being stored in a different location.
        </p>
        <p>
          To achieve this, most XML filters presented here have a command line
          argument which instructs the filter to
          act upon certain elements in the document.  By default, to
          concur with the description of a typical program output in section 3,
          the filters will act upon each element directly under the
          root element of the document.
        </p>
      </subsection>

      <subsection title="Important filters" break="true">
        <p>
          The programs presented in the following table have been identified
          as being important, and should be present in a complete implementation
          of an XML based &unix; command line environment.
        </p>
        <table>
          <col width="8em"/>
          <col width="25em"/>
          <tr type="header">
            <td>Program</td>
            <td>Purpose</td>
          </tr>
          <tr>
            <td>cat</td>
            <td>Concatenates files</td>
          </tr>
          <tr>
            <td>grep</td>
            <td>Searches for text which matches a regular expression</td>
          </tr>
          <tr>
            <td>awk</td>
            <td>Record/field matching scripting language</td>
          </tr>
          <tr>
            <td>cut</td>
            <td>Selects fields from a file</td>
          </tr>
          <tr>
            <td>head</td>
            <td>Selects the first records from a file</td>
          </tr>
          <tr>
            <td>tail</td>
            <td>Selects the last records from a file</td>
          </tr>
          <tr>
            <td>sort</td>
            <td>Sorts records in a file</td>
          </tr>
        </table>
      </subsection>

      <subsection title="Implementation examples">
        <p>
          Presented now are three examples of XML filtering programs
          for the new &unix; command line environment.
        </p>

        <subsubsection title="cat">
          <p>
            The <code>cat</code> program is a very simple one in the
            traditional &unix; command line environment.  Its job
            is simply to concatenate files and write them out to
            standard output.  The concatenation is trivial; since
            flat text files use line based records, the program
            needs only to write out each file in succession to
            standard output for them to be joined together.
          </p>
          <p>
            Unfortunately, such simple behaviour will not work with
            XML documents.  This is because outputting one document
            after another will not result in a well formed document.
            Well formed documents must contain exactly one root element.
            Thus, <code>cat</code> must do something else.
          </p>
          <p>
            If it is assumed that records are stored as elements directly
            under the root element of the document, as outlined
            in the model description in section 3, then there is a
            fairly simple means of concatenating two XML documents together.
            All that must be done is to take the second level elements
            from the source documents and insert them just before the
            closing tag of the destination document's root element.  For
            example, if the file <code>doc1.xml</code> contains:
          </p>
          <listing><![CDATA[<data>
  <record id="1"/>
  <record id="2"/>
</data>]]></listing>
          <p>
            and <code>doc2.xml</code> contains:
          </p>
          <listing><![CDATA[<data>
  <record id="3"/>
  <record id="4"/>
</data>]]></listing>
          <p>
            then running the command <code>cat doc1.xml doc2.xml</code>
            should result in the following document:
          </p>
          <listing><![CDATA[<data>
  <record id="1"/>
  <record id="2"/>
  <record id="3"/>
  <record id="4"/>
</data>]]></listing>
          <p>
            Note that the implementation here does not provide any checking
            to ensure both documents are in the same namespace or have
            the same root element.  Traditional &unix; <code>cat</code>
            doesn't check for similar file formats either; it just
            outputs the files with no regard to if they "should" be
            allowed to be joined together.
          </p>
          <p>
            If the important data is elsewhere in the source document,
            however, there needs to be some way of specifying which 
            elements are to be selected.  Similarly, if the destination
            document shouldn't have its root element appended to, 
            there must be a way of specifying the insertion point for
            the new data.  The <code>cat</code> program presented here
            has a command line argument <code>--from</code> which
            allows the user to give an XPath location specifying
            which elements from the source document are to be copied.
            Analogously, there exists a <code>--to</code> argument
            which should be followed by an XPath location indicating
            where in the destination document the source documents'
            elements are to be inserted.
          </p>
          <p>
            The default <code>--from</code> XPath location is <code>/*/node()</code>,
            which refers to all of the nodes (that is, elements, text nodes and attributes),
            underneath the document's root element.  The default <code>--to</code>
            XPath location is <code>/*</code>, which refers to the root element
            of the destination document.
          </p>
          <p>
            To demonstrate these options, consider the file <code>doc3.xml</code>
            to contain:
          </p>
          <listing><![CDATA[<html>
  <head>
    <title>Document title</title>
  </head>
  <body>
    <p>This is from the first document.</p>
  </body>
</html>]]></listing>
          <p>
            and <code>doc4.xml</code> to contain:
          </p>
          <listing><![CDATA[<html>
  <head>
    <title>Second document title</title>
  </head>
  <body>
    <p>This is from the second document.</p>
  </body>
</html>]]></listing>
          <p>
            If the command <code>cat --to /html/body doc3.xml --from '/html/body/*' doc4.xml</code>
            is then issued, the resulting document should be:
          </p>
          <listing><![CDATA[<html>
  <head>
    <title>Second document title</title>
  </head>
  <body>
    <p>This is from the first document.</p>
    <p>This is from the second document.</p>
  </body>
</html>]]></listing>
          <p>
            As encoding is an important issue with XML documents, unlike with
            traditionally 7-bit ASCII flat text files, the <code>cat</code>
            program also has a command line argument to specify which
            encoding should be used for output.  Thus, to transcode
            an XML document to EBCDIC-US, the command <code>cat --encoding EBCDIC-US</code>
            is used.  libxml2 can be compiled with support for the libiconv
            internationalisation library.  Running the command <code>iconv -l</code>
            at the command line produces a list of encodings which <code>cat</code>
            should therefore be able to handle.
          </p>
        </subsubsection>

        <subsubsection title="grep">
          <p>
            The <code>grep</code> program is one of the most useful of the
            traditional &unix; command line environment filters.  Its purpose
            is to select records (i.e., lines, in the flat text model)
            which match a given regular expression.  As with <code>cat</code>,
            <code>grep</code> must be told where in the XML document it should
            be operating.
          </p>
          <p>
            The <code>grep</code> program presented here takes two parameters
            to control where the search takes place.  Firstly, it has
            a <code>--context</code> command line argument.  This tells
            <code>grep</code> what each record of the XML document is.
            By default, this will be the XPath location <code>/*/*</code>,
            the elements directly underneath the document's root element.
            Secondly, <code>grep</code> can be passed a <code>--node</code>
            argument.  This controls, relative to each node that matches
            the context XPath location, which nodes should be compared
            against the regular expression.  The default node XPath
            location is <code>//text()</code>, that is any text
            node underneath the context node.
          </p>
          <p>
            For the regular document structure mentioned earlier, the default
            XPath locations work well.  For example, assuming that the file
            <code>doc5.xml</code> contains:
          </p>
          <listing><![CDATA[<data>
  <entry name="entry1">
    Hello there.
  </entry>
  <entry name="entry2">
    How <em>are</em> you?
  </entry>
  <entry name="entry3">
    I am well.
  </entry>
</data>]]></listing>
          <p>
            then issuing the command <code>grep 'a[^ ]' doc5.xml</code> will
            produce the following XML document:
          </p>
          <listing><![CDATA[<data>
  <entry name="entry2">
    How <em>are</em> you?
  </entry>
  <entry name="entry3">
    I am well.
  </entry>
</data>]]></listing>
          <p>
            To demonstrate the use of the <code>--context</code> and
            <code>--node</code> arguments, assume the file <code>doc6.xml</code>
            contains the following:
          </p>
          <listing><![CDATA[<data>
  <entries>
    <entry name="entry1">
      Hello there.
    </entry>
    <entry name="entry2">
      How <em>are</em> you?
    </entry>
    <entry name="entry3">
      I am well.
    </entry>
  </entries>
</data>]]></listing>
          <p>
            Then, the command <code>grep --context /data/entries/entry --node @name 'entry[12]' doc6.xml</code>
            can be used to select the first two entries:
          </p>
          <listing><![CDATA[<data>
  <entries>
    <entry name="entry1">
      Hello there.
    </entry>
    <entry name="entry2">
      How <em>are</em> you?
    </entry>
  </entries>
</data>]]></listing>
          <p>
            Notice that the two <code>entry</code> elements are returned still contained in the
            <code>entries</code> element.  This is because the elements which match the regular
            expression are checked to see if they have a common parent, and since they do,
            they are placed within it in the output document.
          </p>
        </subsubsection>

        <subsubsection title="sort">
          <p>
            The traditional <code>sort</code> program rearranges the records in a file, sorting them
            by a particular field.  The sort key is specified by giving a field number
            or range of characters.  The <code>-r</code> command line argument is used
            to reverse the sort order, and <code>-n</code> is used to signify a numeric
            (rather than lexicographic) sort.
          </p>
          <p>
            Just as with <code>cat</code> and <code>grep</code>, new version of <code>sort</code>
            defaults to sorting the elements directly beneath the document's root element.
            If this is not appropriate, a <code>--context</code> and <code>--node</code>
            command line argument can be given to select which elements are to be sorted,
            and on what they are to be sorted, respectively.  By default, sorting is done
            just on the text nodes beneath each context node.
          </p>
          <p>
            Below is a demonstration of how the new model <code>sort</code> works.
            If the file <code>doc7.xml</code> contains:
          </p>
          <listing><![CDATA[<foods>
  <staple id="4">Bread</staple>
  <fruit id="1">Apple</fruit>
  <fruit id="2">Starfruit</fruit>
  <staple id="3">Rice</staple>
</foods>]]></listing>
          <p>
            and the command <code>sort doc7.xml</code> is issued, the resulting
            XML document will be:
          </p>
          <listing><![CDATA[<foods>
  <fruit id="1">Apple</fruit>
  <staple id="4">Bread</staple>
  <staple id="3">Rice</staple>
  <fruit id="2">Starfruit</fruit>
</foods>]]></listing>
          <p>
            A different sort key can be specified.  Running the command
            <code>sort --context '/foods/*' --node @id doc7.xml</code> results in:
          </p>
          <listing><![CDATA[<foods>
  <fruit id="1">Apple</fruit>
  <fruit id="2">Starfruit</fruit>
  <staple id="3">Rice</staple>
  <staple id="4">Bread</staple>
</foods>]]></listing>
          <p>
            The <code>sort</code> command can also take an XPath expression
            (not just an XPath location) to sort on.  This makes it possible to,
            for example, sort based on the name of the element being sorted.
            Executing <code>sort --context '/foods/*' --expr 'name()' doc7.xml</code>
            produces:
          </p>
          <listing><![CDATA[<foods>
  <fruit id="1">Apple</fruit>
  <fruit id="2">Starfruit</fruit>
  <staple id="4">Bread</staple>
  <staple id="3">Rice</staple>
</foods>]]></listing>
          <p>
            As can be seen from this example, the sort is stable.  The two
            <code>fruit</code> elements are considered equal in the sorting,
            as are the two <code>staple</code> elements.  Their relative
            ordering is the same as their relative ordering in the original
            document.
          </p>
        </subsubsection>
      </subsection>
    </section>

    <section title="User interface to the environment">
      <subsection title="The shell">
        <p>
          The final piece in the new model of the &unix; command line
          environment is the shell.  The shell is the program which
          interacts with the user via the terminal, allowing the user
          to execute commands and view their output.
          The new shell behaves exactly like the Bourne shell in
          most respects.  However, it also has the ability to
          transform the XML output of a program to a human
          readable format suitable for display on the terminal.
        </p>
      </subsection>

      <subsection title="Human readable output">
        <p>
          At the end of a command pipeline, an XML document
          will be produced.  This document, while still readable,
          isn't intended for viewing by the user.  Instead, a formatted
          representation of the data should be printed to the terminal.
          But to determine what sort of transformation should be
          applied to the output document, that document must be read in
          and parsed by the shell.
        </p>
        <p>
          The shell uses a configuration file, named <code>/etc/transforms.xml</code>
          in the implementation given here, to decide which XSLT stylesheet
          should be used to transform the XML output document.  An example of this
          file is given below:
        </p>
        <listing><![CDATA[<transforms>
  <transform namespace="urn:ns:clm.xml-unix.df">
    <target type="terminal" stylesheet="transforms/df-terminal.xsl"/>
  </transform>
  <transform namespace="urn:ns:clm.xml-unix.ps">
    <target type="terminal" stylesheet="transforms/ps-terminal.xsl"/>
  </transform>
  <transform namespace="urn:ns:clm.xml-unix.ls">
    <target type="terminal" stylesheet="transforms/ls-terminal.xsl"/>
  </transform>
</transforms>]]></listing>
        <p>
          The configuration file associates an XSL stylesheet whose
          output is appropriate for displaying on a terminal with
          each type of document that can be produced by the
          data generating programs.  The namespace of the XML output
          document is compared against that stored in the <code>namespace</code>
          attribute of each <code>transform</code> element in the
          <code>transforms.xml</code> file.  If a match is found,
          the relevant stylesheet is used to transform the output
          which is then written to the terminal.  If no match is found,
          the XML output document is written to the terminal verbatim.
        </p>
        <p>
          However, the user will not want this automatic transformation
          to occur all the time.  This is especially true when the
          user is building up a command pipeline.  In a traditional
          &unix; system, pipelines are generally built up incrementally.
          First, the user would try one command.  Then, upon inspecting
          the output of this command, they would pipe that output
          through a filter.  If the new shell is ever converting
          the XML output documents into a human readable form, the
          user will have no chance at determining through what command
          the output should be piped.  For this reason, the shell
          introduces a new character that instructs it to
          dump the XML output document to the terminal without converting it.
          This character, the caret (<code>^</code>) is placed at the
          end of the pipeline to ensure that the markup is kept.
        </p>
        <p>
          To exemplify the behaviour of the shell, below is some sample
          output from the shell after entering some commands:
        </p>
        <listing><![CDATA[$ df
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hde2             29249536  27398144   1851392  94% /
/dev/hde1             10481664   6721536   3760128  65% /winxp
/dev/hdf3             28409856  26886144   1523712  95% /storage
$ df ^
<filesystems xmlns="urn:ns:clm.xml-unix.df">
  <filesystem dev="/dev/hde2" blocks="29249536" free="1851392" use="94"
              mountpoint="/"/>
  <filesystem dev="proc" blocks="0" free="0" use="0" mountpoint="/proc"/>
  <filesystem dev="devpts" blocks="0" free="0" use="0" mountpoint="/dev/pts"/>
  <filesystem dev="/dev/hde1" blocks="10481664" free="3760128" use="65"
              mountpoint="/winxp"/>
  <filesystem dev="/dev/hdf3" blocks="28409856" free="1523712" use="95"
              mountpoint="/storage"/>
  <filesystem dev="usb" blocks="0" free="0" use="0" mountpoint="/proc/bus/usb"/>
</filesystems>]]></listing>
        <p>
          The presence of the caret after the command prevents the shell
          from using the XSLT stylesheet to produce the formatted output.
        </p>
      </subsection>
    </section>

    <section title="Evaluating the model">
      <p>
        Presented in this section are some examples of the new command
        line environment in use, and a discussion of some problems
        that were encountered with the model.
      </p>
      
      <subsection title="Common applications">
        <p>
          To determine the efficacy of the new model for the &unix; command line
          environment, some common tasks are performed in both the traditional
          model and the new model.  The commands used to perform the tasks
          are then compared.
        </p>
        
        <subsubsection title="Determine filesystem with most free space">
          <p>
            The goal of this task is to determine which filesystem has
            the most free disk space.
            In order to determine which filesystem has the most free
            space, the <code>df</code> command must of course be used.
            Both the traditional and the new <code>df</code> print the
            same output to the terminal, but of course underneath it
            is all different.
          </p>
          <p>
            To select the filesystem with the largest amount of free space,
            the <code>sort</code> command can be used to arrange the
            records such that the last record is the one with the most
            free space.  In the traditional system, the user must be
            mindful of the header <code>df</code> also outputs.  Since it
            is part of the flat text output, it must be avoided.  The
            <code>tail +2</code> command can be used to remove that
            first line.  Subsequently, the records can be sorted.
            Since the traditional <code>sort</code> command uses field
            numbers to select the key, the <code>sort -n +3</code>
            command is used to sort the records numerically by the fourth
            column.  Finally, the <code>tail -1</code> command
            can be used to select the last record from the sorted list.
            That record will be the one with the highest amount of
            free space.
          </p>
          <p>
            In the new model, the header does not form part of <code>df</code>'s
            output, so it does not have to be avoided as must be done in the
            traditional system.  The output of <code>df</code> can be piped
            straight into <code>sort</code>.  Inspecting the output format of
            <code>df</code>, it is apparent that the <code>free</code> attribute
            contains the number of free blocks on a filesystem.  Thus, the
            command <code>df</code>'s output must be piped through is
            <code>sort --numeric --node @free</code>.  Once the records
            are sorted, the last one can be chosen as with traditional system,
            using <code>tail -1</code>.
          </p>
          <p>
            In summary, the traditional &unix; command to find the filesystem
            with the most free space is:
          </p>
          <listing>df | tail +2 | sort -n +3 | tail -1</listing>
          <p>
            The command needed in the new model is
          </p>
          <listing>df | sort --numeric --node @free | tail -1</listing>
          <p>
            The new model command, while taking more keystrokes to type, uses one fewer pipeline stage
            to generate the output.  Also, it is more intuitive than the traditional &unix;
            command.  The traditional &unix; command requires a numeric reference to the sort key,
            while the new model command uses a symbolic key name.  The <code>sort</code>
            command should thus be easier for the user to write.
          </p>
        </subsubsection>

        <subsubsection title="Determine which process has been running the longest">
          <p>
            This goal of this task is to display only the process ID and the command
            line of the process which has used the most CPU time.
          </p>
          <p>
            Obviously, the <code>ps</code> command will need to be used in this
            command pipeline.  To begin with in the traditional system, all processes
            are selected for display by the <code>ps</code> command.  Also,
            the relevant fields (process ID, CPU time and command line) are also selected.
            As with the previous task, the header must be stripped off.  Luckily,
            the GNU <code>ps</code> program being used has a <code>--no-headers</code>
            command line option.
            Thus the <code>ps</code> command used is <code>ps -e -o pid,time,command --no-headers</code>.
          </p>
          <p>
            Once all of the records are available, they must be sorted based on their running time.
            This is done with the <code>sort</code> command, specifying the second field
            as the sort key.  The command used is <code>sort +1</code>.
            The last record in the output will now be for the process with the longest CPU
            time.  It can be selected with the <code>tail -1</code> command.
          </p>
          <p>
            Finally, since the goal of this task is to display only the process ID
            and the command line, the CPU time must be removed.  Unfortunately,
            since the command line could have spaces in its field, a field based
            solution (such as using <code>awk</code>) is not possible.  Instead,
            <code>sed</code> can be used to match the time a remove it from the
            line.  The command to do this is <code>sed 's/ ..:..:..//'</code>.
          </p>
          <p>
            Building the pipeline in the new system starts similarly.  However,
            when selecting which fields to display, we can omit the CPU time.
            This wasn't possible in the traditional model, since if the CPU time
            wasn't included in the initial <code>ps</code> output, its value
            could not be used as the sort key.  With the new model, however,
            all of the information about each process is included in <code>ps</code>'s
            output document.  It is just that the process ID and command line
            can be registered as the only fields to be displayed when the
            shell finally transforms the document for output to the terminal.
            Just like in the previous task, the header is part of the
            formatting and thus is not present in the XML output document of
            <code>ps</code>.  Thus it doesn't not need to be explicitly
            excluded.  The initial command in the pipeline is therefore
            <code>ps -e -o pid,command</code>.
          </p>
          <p>
            Looking at the raw XML output
            from the <code>ps</code> command, it can be seen that the
            <code>time</code> attribute holds the CPU time field to
            be sorted on.  So, the next command in the pipeline is 
            <code>sort --node @time</code>.  The last record in the
            sorted list, which has the greatest CPU time, can be 
            selected with <code>tail -1</code>.  Since the formatting
            options given to <code>ps</code> initially preclude the CPU
            time from being displayed, there is no need to remove it as in
            the traditional model. 
          </p>
          <p>
            Comparing the two models, the traditional model solution resulted in
            the command:
          </p>
          <listing>ps -e -o pid,time,command --no-headers | sort +1 | tail -1 | sed 's/ ..:..:..//'</listing>
          <p>
            while the new model solution produced the command:
          </p>
          <listing>ps -e -o pid,command | sort --node @time | tail -1</listing>
          <p>
            The new model gives the user a much cleaner and shorter command to
            produce the correct output.  Again, the command is more intuitive
            (especially the <code>sort</code> command) than the equivalent
            traditional model command.  The traditional model command
            needs more text processing to massage the data into an acceptable form.
          </p>
        </subsubsection>
      </subsection>

      <subsection title="Issues with the model">
        <p>
          While developing the model, the idea that the user programs did not have read in
          and output complete, well formed XML documents was bandied about.  Since some
          programs might extract individual attributes or just text from an XML document,
          the question of whether it made sense to allow them to be valid output and input
          to other programs was raised.  An implementation of a model allowing such
          data was worked on, however it soon lost the simplicity of the original model.
          Some programs could perform sensible tasks with these sub-documents, while
          others had to reject incomplete documents.
        </p>
        <p>
          A particular example which illustrates the problem is that of XInclude
          <cite ref="xinclude"/>.  XInclude allows the user to include another XML document
          in the current document by referring to its URL.  This inclusion is a direct
          substitution, in much the same was as a #include is performed in C.
          The problem with XInclude is that it allows the user to include a file which
          contains just a collection of elements without a single root element.
          This included file is not a well formed XML document.  Thus the question of
          how such a file should be processed in the new model of the command line environment
          arose.
        </p>
        <p>
          The idea of allowing the processing of invalid XML documents was then abandoned,
          and the XML processing programs returned to processing only complete XML documents.
        </p>
      </subsection>
    </section>

    <section title="Conclusion">
      <p>
        During this project a comprehensive model for &unix; process input and
        output using XML documents was developed.  This model specified the
        general format to be used for a process' standard input and output.
        The requirement that formatting should be kept separate from the
        content of a process' output was the driving force behind the rest
        of the model.
      </p>
      <p>
        Once the model was devised, implementations of a few &unix; programs
        conforming to the new model were created.  These programs,
        while not forming a complete system, demonstrated the concept
        the model is trying to convey.
      </p>
      <p>
        Some sample tasks were also performed with both a traditional &unix;
        command line environment and the environment afforded by the new model.
        Comparing the commands used to perform the tasks in both models,
        the claim that the new model provides the user with a more intuitive
        in which to work, at least for the examples given, are vindicated.
      </p>

      <subsection title="Future work">
        <p>
          A number of areas exist where further research can be undertaken.
          Firstly, there are a few programs which need to be written
          to complete the system.  Also, the programs need to be set in
          a context where support for the XML file format is
          found throughout the system.  Specifically, a &unix;-like operating
          system could be developed with native support for the XML file
          format.  Configuration and most other files in the system would have
          to use XML too, for the programs to prove most useful.
        </p>
        <p>
          Proper analysis of the interface to determine if users actually
          find the environment more intuitive than the traditional environment
          could be performed.  This might involve surveying users' opinions
          of the environment and developing a better human-computer interface
          for it.
        </p>
        <p>
          Finally, the model currently deals only with XML documents and not other
          file formats.  The model could also be reviewed to determine if
          support for other file formats should be written in to the
          implemented programs or if they should be kept solely for processing
          XML documents.  Having only one version of the <code>cat</code> program,
          for example, which could concatenate XML documents or flat text
          documents, could be advantageous.
        </p>
      </subsection>
    </section>

    <bibliography>
      <book id="c">
        <authors>
          <author>
            <surname>Kernighan</surname>
            <initials>B.W.</initials>
          </author>
          <author>
            <surname>Ritchie</surname>
            <initials>D.M.</initials>
          </author>
        </authors>
        <year>1978</year>
        <title>The C Programming Language</title>
        <publisher>Prentice Hall</publisher>
      </book>
      
      <book id="korn">
        <authors>
          <author>
            <surname>Bolsky</surname>
            <initials>M.I.</initials>
          </author>
          <author>
            <surname>Korn</surname>
            <initials>D.G.</initials>
          </author>
        </authors>
        <year>1995</year>
        <title>The new Kornshell, 2nd ed.</title>
        <publisher>Prentice Hall</publisher>
        <place>Englewood Cliffs</place>
      </book>
      
      <book id="bourne">
        <authors>
          <author>
            <surname>Bourne</surname>
            <initials>S.R.</initials>
          </author>
        </authors>
        <year>1983</year>
        <title>The &unix; system</title>
        <publisher>Addison-Wesley</publisher>
        <place>Reading, Mass</place>
      </book>
      
      <standard id="iso10646">
        <organisation>International Organization for Standardization</organisation>
        <year>2000</year>
        <title>Information technology &#x2014; Universal Multiple-Octet Coded Character Set (UCS) &#x2014; Part 1: Architecture and Basic Multilingual Plane</title>
        <number>ISO 10646-1:2000</number>
        <place>Geneva</place>
      </standard>

      <book id="unicode">
        <authors>
          <author>
            <surname>The Unicode Consortium</surname>
          </author>
        </authors>
        <year>2000</year>
        <title>The Unicode standard, version 3.0</title>
        <publisher>Addison-Wesley</publisher>
        <place>Reading, Mass</place>
      </book>
      
      <book id="phil">
        <authors>
          <author>
            <surname>Gancarz</surname>
            <initials>M.</initials>
          </author>
        </authors>
        <year>1995</year>
        <title>The &unix; Philosophy</title>
        <publisher>Digital Press</publisher>
        <place>Boston</place>
      </book>
      
      <book-section id="bsd">
        <authors>
          <author>
            <surname>McKusick</surname>
            <initials>K.</initials>
          </author>
        </authors>
        <year>1999</year>
        <section-title>Chapter 2 &#x2014; Twenty Years of Berkeley Unix: From AT&amp;T-Owned to Freely Redistributable</section-title>
        <title>Open Sources: Voices from the Open Source Revolution</title>
        <pages>
          <first>31</first>
          <last>46</last>
        </pages>
        <editors>
          <editor>
            <surname>DiBona</surname>
            <initials>C.</initials>
          </editor>
          <editor>
            <surname>Ockman</surname>
            <initials>S.</initials>
          </editor>
          <editor>
            <surname>Stone</surname>
            <initials>M.</initials>
          </editor>
        </editors>
        <publisher>O'Reilly &amp; Associates</publisher>
        <place>Sebastapol, California</place>
      </book-section>
      
      <web id="zsh">
        <year>2001</year>
        <date>2001, Oct. 24</date>
        <title>The Z Shell Manual</title>
        <site-title>ZSH - THE Z SHELL</site-title>
        <url>http://zsh.sunsite.dk/Doc/zsh_a4.ps.gz</url>
        <authors>
          <author>
            <surname>Falstad</surname>
            <initials>P.</initials>
          </author>
        </authors>
        <accessed-date>30 July 2002</accessed-date>
      </web>
      
      <web id="scsh">
        <year>2002</year>
        <date>2002, May</date>
        <title>Scsh Reference Manual</title>
        <site-title>Scsh - The Scheme Shell</site-title>
        <url>ftp://ftp.scsh.net/pub/scsh/0.6/scsh-manual.ps.gz</url>
        <authors>
          <author>
            <surname>Shivers</surname>
            <initials>O.</initials>
          </author>
          <author>
            <surname>Carlstrom</surname>
            <initials>B.D.</initials>
          </author>
          <author>
            <surname>Gasbichler</surname>
            <initials>M.</initials>
          </author>
          <author>
            <surname>Sperber</surname>
            <initials>M.</initials>
          </author>
        </authors>
        <accessed-date>30 July 2002</accessed-date>
      </web>
      
      <web id="xinclude">
        <year>2002</year>
        <date>2002, Sep. 17</date>
        <title>XML Inclusions (XInclude)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/2002/CR-xinclude-20020917</url>
        <authors>
          <author>
            <surname>Marsh</surname>
            <initials>J.</initials>
          </author>
          <author>
            <surname>Orchard</surname>
            <initials>D.</initials>
          </author>
        </authors>
        <accessed-date>5 Nov 2002</accessed-date>
      </web>
      
      <web id="xslt">
        <year>1999</year>
        <date>1999, Nov. 16</date>
        <title>XSL Transformations (XSLT)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/1999/REC-xslt-19991116</url>
        <authors>
          <author>
            <surname>Clark</surname>
            <initials>J.</initials>
          </author>
        </authors>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <web id="xpath">
        <year>1999</year>
        <date>1999, Nov. 16</date>
        <title>XML Path Language (XPath)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/1999/REC-xpath-19991116</url>
        <authors>
          <author>
            <surname>Clark</surname>
            <initials>J.</initials>
          </author>
          <author>
            <surname>DeRose</surname>
            <initials>S.</initials>
          </author>
        </authors>
        <accessed-date>5 Nov 2002</accessed-date>
      </web>
      
      <!--web id="soap">
        <year>2000</year>
        <date>2000, May 8</date>
        <title>Simple Object Access Protocol (SOAP) 1.1</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/2000/NOTE-SOAP-20000508</url>
        <authors>
          <author>
            <surname>Box</surname>
            <initials>D.</initials>
          </author>
          <author>
            <surname>Ehnebuske</surname>
            <initials>D.</initials>
          </author>
          <author>
            <surname>Kakivaya</surname>
            <initials>G.</initials>
          </author>
          <author>
            <surname>Layman</surname>
            <initials>A.</initials>
          </author>
        </authors>
        <accessed-date>2 May 2002</accessed-date>
      </web-->
      
      <web id="xml">
        <year>2000</year>
        <date>2000, Oct. 6</date>
        <title>Extensible Markup Language (XML) 1.0 (Second Edition)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/REC-xml</url>
        <authors>
          <author>
            <surname>Bray</surname>
            <initials>T.</initials>
          </author>
          <author>
            <surname>Sperberg-McQueen</surname>
            <initials>C.M.</initials>
          </author>
          <author>
            <surname>Paoli</surname>
            <initials>J.</initials>
          </author>
          <author>
            <surname>Maler</surname>
            <initials>E.</initials>
          </author>
        </authors>
        <accessed-date>30 July 2002</accessed-date>
      </web>
      
      <web id="rfc1630">
        <year>1994</year>
        <date>1994, Jun.</date>
        <title>Uniform Resource Identifiers in WWW</title>
        <site-title>The Internet Engineering Taskforce</site-title>
        <authors>
          <author>
            <surname>Berners-Lee</surname>
            <initials>T.</initials>
          </author>
        </authors>
        <url>http://www.ietf.org/rfc/rfc1630.txt</url>
        <accessed-date>4 Nov 2002</accessed-date>
      </web>
      
      <!--web id="xmlwd">
        <year>1996</year>
        <date>1996, Nov. 14</date>
        <title>Extensible Markup Language (XML)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/WD-xml-961114.html</url>
        <authors>
          <author>
            <surname>Bray</surname>
            <initials>T.</initials>
          </author>
          <author>
            <surname>Sperberg-McQueen</surname>
            <initials>C.M.</initials>
          </author>
        </authors>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <web id="html">
        <year>1995</year>
        <date>1995, Nov.</date>
        <title>Hypertext Markup Language &#x2014; 2.0</title>
        <site-title>The Internet Engineering Taskforce</site-title>
        <authors>
          <author>
            <surname>Berners-Lee</surname>
            <initials>T.</initials>
          </author>
          <author>
            <surname>Connolly</surname>
            <initials>D.</initials>
          </author>
        </authors>
        <url>http://www.ietf.org/rfc/rfc1866.txt</url>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <book id="env">
        <authors>
          <author>
            <surname>Kernighan</surname>
            <initials>B.W.</initials>
          </author>
          <author>
            <surname>Pike</surname>
            <initials>R.</initials>
          </author>
        </authors>
        <year>1984</year>
        <title>The &unix; Programming Environment</title>
        <publisher>Prentice Hall, Inc.</publisher>
        <place>Englewood Cliffs, New Jersey</place>
      </book-->
      
      <standard id="sgmlstd">
        <organisation>International Organization for Standardization</organisation>
        <year>1986</year>
        <title>Information processing &#x2014; Text and office systems &#x2014; Standard Generalized Markup Language (SGML)</title>
        <number>ISO 8879:1986</number>
        <place>Geneva</place>
      </standard>

      <book id="tools">
        <authors>
          <author>
            <surname>Kernighan</surname>
            <initials>B.W.</initials>
          </author>
          <author>
            <surname>Plauger</surname>
            <initials>P.J.</initials>
          </author>
        </authors>
        <year>1976</year>
        <title>Software Tools</title>
        <publisher>Addison-Wesley</publisher>
        <place>Reading, Massachusetts</place>
      </book>
      
      <journal-article id="unix">
        <authors>
          <author>
            <surname>Ritchie</surname>
            <initials>D.M.</initials>
          </author>
          <author>
            <surname>Thompson</surname>
            <initials>K.</initials>
          </author>
        </authors>
        <year>1974</year>
        <article-title>The &unix; time-sharing system</article-title>
        <title>Communications of the ACM</title>
        <volume>17</volume>
        <issue>7</issue>
        <pages>
          <first>365</first>
          <last>375</last>
        </pages>
      </journal-article>
      
      <journal-article id="fsh">
        <authors>
          <author>
            <surname>McDonald</surname>
            <initials>C.S.</initials>
          </author>
        </authors>
        <year>1987</year>
        <article-title>fsh &#x2014; A Functional &unix; Command Interpreter</article-title>
        <title>Software &#x2014; Practice and Experience</title>
        <volume>17</volume>
        <issue>10</issue>
        <pages>
          <first>687</first>
          <last>700</last>
        </pages>
      </journal-article>

      <journal-article id="multics">
        <authors>
          <author>
            <surname>Corbat&#xf3;</surname>
            <initials>F.J.</initials>
          </author>
          <author>
            <surname>Vyssotsky</surname>
            <initials>V.A.</initials>
          </author>
        </authors>
        <year>1965</year>
        <article-title>Introduction and Overview of the Multics System</article-title>
        <title>Proceedings of the AFIPS Fall Join Computer Conference</title>
        <volume>27</volume>
        <issue>1</issue>
        <pages>
          <first>185</first>
          <last>196</last>
        </pages>
      </journal-article>

      <journal-article id="multics-vm">
        <authors>
          <author>
            <surname>Bensoussan</surname>
            <initials>A.</initials>
          </author>
          <author>
            <surname>Clingen</surname>
            <initials>C.T.</initials>
          </author>
          <author>
            <surname>Daley</surname>
            <initials>R.C.</initials>
          </author>
        </authors>
        <year>1972</year>
        <article-title>The Multics Virtual Memory: Concepts and Design</article-title>
        <title>Communications of the ACM</title>
        <volume>15</volume>
        <issue>5</issue>
        <pages>
          <first>308</first>
          <last>318</last>
        </pages>
      </journal-article>
      
      <journal-article id="multics-proc">
        <authors>
          <author>
            <surname>Daley</surname>
            <initials>R.C.</initials>
          </author>
          <author>
            <surname>Dennis</surname>
            <initials>J.B.</initials>
          </author>
        </authors>
        <year>1968</year>
        <article-title>Virtual memory, processes, and sharing in Multics</article-title>
        <title>Communications of the ACM</title>
        <volume>11</volume>
        <issue>5</issue>
        <pages>
          <first>306</first>
          <last>312</last>
        </pages>
      </journal-article>

      <journal-article id="multics-fs">
        <authors>
          <author>
            <surname>Daley</surname>
            <initials>R.C.</initials>
          </author>
          <author>
            <surname>Neumann</surname>
            <initials>P.G.</initials>
          </author>
        </authors>
        <year>1965</year>
        <article-title>A General Purpose File System for Secondary Storage</article-title>
        <title>Proceedings of AFIPS Fall Joint Computer Conference</title>
        <volume>27</volume>
        <issue>1</issue>
        <pages>
          <first>213</first>
          <last>229</last>
        </pages>
      </journal-article>
      
      <journal-article id="multicsio">
        <authors>
          <author>
            <surname>Feiertag</surname>
            <initials>R.J.</initials>
          </author>
          <author>
            <surname>Organick</surname>
            <initials>E.I.</initials>
          </author>
        </authors>
        <year>1971</year>
        <article-title>The Multics input-output system</article-title>
        <title>Proceedings of the Third Symposium on Operating Systems Principles</title>
        <pages>
          <first>35</first>
          <last>41</last>
        </pages>
      </journal-article>
      
      <conference-paper id="evolution">
        <authors>
          <author>
            <surname>Ritchie</surname>
            <initials>D.M.</initials>
          </author>
        </authors>
        <year>1979</year>
        <paper-title>The Evolution of the &unix; Time-sharing System</paper-title>
        <title>Language Design and Programming Methodology</title>
        <conference-place>Sydney, Australia</conference-place>
        <conference-date>September 1979</conference-date>
        <publisher>Springer-Verlag</publisher>
      </conference-paper>
      
    </bibliography>

    <appendix title="Code">
      <p>
        The code for a few of the implementations presented in this report
        is given here.  The full code will be available from
        <code>http://www.csse.monash.edu.au/~clm/uni/honours/</code>.
      </p>
      <subsection title="ls.pl">
        <listing><![CDATA[#!/usr/bin/perl -w
# ls.pl
#
# A version of ls that outputs in XML.

use File::stat;
use Fcntl ':mode';

# Return just the filename from a pathname.
sub basename {
  my $f = shift;
  return '.' if !defined $f || $f eq '';
  return '/' if $f eq '/';
  return $f unless m[^.*/(.*)];
  return $1;
}

# Return just the path from a pathname.
sub basepath {
  my $f = shift;
  return '.' if !defined $f || $f eq '';
  return '.' unless m[^(.*)/];
  return '/' if $1 eq '';
  return $1;
}

sub usage {
  print <<EOF;
Usage: ls [--all] [--long] [--recursive] [FILE ...]

Options/Arguments:
  --all        Even display dotfiles.
        --long       Display detailed file listing.
        --recursive  Recurse into subdirectories.
EOF
  exit 1;
}

# Convert a permission to its symbolic form.
sub mde {
  my $m = shift;
  my $c = '';
  $c .= ($m & 4 ? 'r' : '-');
  $c .= ($m & 2 ? 'w' : '-');
  $c .= ($m & 1 ? 'x' : '-');
  return $c;
}

# Format a time just like "ls -l" does.
@month = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
sub hmt {
  my $t = shift;
  my @l = localtime($t);
  my $now = time;
  my $threshold = 6 * 30 * 24 * 60 * 60;
  if ($t < $now - $threshold|| $t > $now + $threshold) {
    return sprintf('%s %2d  %d', $month[$l[4]], $l[3], $l[5] + 1900);
  } else {
    return sprintf('%s %2d %02d:%02d', $month[$l[4]], $l[3], $l[2], $l[1]);
  }
}

# Recursive function to handle a file.
sub dofile {
  my ($sb, $name, $path, $ind) = @_;
  if (defined $sb) {
    print ' ' x $ind, '<file';
    print " name=\"$name\"";
    print " path=\"$path\"";
    printf ' device="%X"', $sb->dev;
    printf ' inode="%d"', $sb->ino;
    printf ' nlink="%d"', $sb->nlink;
    printf ' uid="%d"', $sb->uid;
    printf ' gid="%d"', $sb->gid;
    printf ' rdev="%X"', $sb->rdev;
    printf ' size="%d"', $sb->size;
    printf ' blksize="%d"', $sb->blksize;
    printf ' blocks="%d"', $sb->blocks;
    $type = 'unknown';
    if (S_ISREG($sb->mode)) {
      $type = 'regular';
    } elsif (S_ISDIR($sb->mode)) {
      $type = 'directory';
    } elsif (S_ISLNK($sb->mode)) {
      $type = 'symlink';
    } elsif (S_ISBLK($sb->mode)) {
      $type = 'block';
    } elsif (S_ISCHR($sb->mode)) {
      $type = 'character';
    } elsif (S_ISFIFO($sb->mode)) {
      $type = 'fifo';
    } elsif (S_ISSOCK($sb->mode)) {
      $type = 'socket';
    } elsif (S_ISWHT($sb->mode)) {
      $type = 'whiteout';
    }
    print " type=\"$type\"";
    printf ' mode="%04o"', $sb->mode & 07777;
    printf ' atime="%d"', $sb->atime;
    printf ' mtime="%d"', $sb->mtime;
    printf " ctime=\"%d\"", $sb->ctime;

    # Loop up the owner's username.
    @pw = getpwuid($sb->uid);
    if (@pw) {
      $u = $pw[0];
    } else {
      $u = $sb->uid;
    }
    print " username=\"$u\"";

    # Look up the the group's name.
    @g = getgrgid($sb->gid);
    if (@g) {
      $g = $g[0];
    } else {
      $g = $sb->gid;
    }
    print " group=\"$g\"";

    # Construct the symbolic mode.
    if ($type eq 'regular') {
      $modechars = '-';
    } elsif ($type eq 'directory') {
      $modechars = 'd';
    } elsif ($type eq 'symlink') {
      $modechars = 'l';
    } elsif ($type eq 'block') {
      $modechars = 'b';
    } elsif ($type eq 'character') {
      $modechars = 'c';
    } elsif ($type eq 'fifo') {
      $modechars = 'p';
    } elsif ($type eq 'socket') {
      $modechars = 's';
    } else {
      $modechars = '?';
    }

    $modechars .= mde(($sb->mode & 0700) >> 6);
    $modechars .= mde(($sb->mode & 0070) >> 3);
    $modechars .= mde($sb->mode & 0007);

    print " modechars=\"$modechars\"";

    $hmt = hmt($sb->ctime);
    print " human-mtime=\"$hmt\"";

    $modechars = '';
    if ($rec && $type eq 'directory') {
      # Recurse.
      print ">\n";
      opendir DIR, "$path/$name";
      for (grep { $_ ne '.' && $_ ne '..' } readdir DIR) {
        dofile(lstat("$path/$name/$_"), $_, ($path eq '.' ? $name
            : "$path/$name"), $ind + 2);
      }
      closedir DIR;
      print ' ' x $ind, "</file>\n";
    } else {
      print "/>\n";
    }
  } else {
    print STDERR "  ${name}: file not found\n";
  }
}

# Parse command line.
for ($i = 0; $i <= $#ARGV; $i++) {
  if ($ARGV[$i] eq '--help' || $ARGV[$i] eq '-h') {
    usage();
  } elsif ($ARGV[$i] eq '--recursive' || $ARGV[$i] eq '-R') {
    $rec = 1;
  } elsif ($ARGV[$i] eq '--all' || $ARGV[$i] eq '-a') {
    $all = 1;
  } elsif ($ARGV[$i] eq '--long' || $ARGV[$i] eq '-l') {
    $long = 1;
  } elsif ($ARGV[$i] =~ /^-/) {
    usage();
  } else {
    push @files, $ARGV[$i];
  }
}

print "<files xmlns=\"urn:ns:clm.xml-unix.ls\"";
print " all=\"true\"" if $all;
print " long=\"true\"" if $long;
print " rec=\"true\"" if $rec;
print ">\n";

# Work on the files in the current directory if no files were given
# on the command line.
unless (@files) {
  if ($rec) {
    @files = ('.');
  } else {
    opendir DIR, '.';
    @files = readdir DIR;
    closedir DIR;
  }
}

# Handle each file.
for (@files) {
  dofile(lstat($_), basename($_), basepath($_), 2);
}

print "</files>\n";]]></listing>
      </subsection>

      <subsection title="ps.pl">
        <listing><![CDATA[#!/usr/bin/perl
# ps.pl
#
# A version of ps that outputs in XML.

use Getopt::Long qw(:config bundling);

# Well behaved fields.
@fields = qw(alarm blocked c caught cp pcpu drs dsiz egid egroup eip etime esp
    euid euser flags fgid fgroup fsgid fsgroup fsuid fsuser fuid fuser gid
    group ignored lim majflt pmem minflt nice pagein pending pgid pgrp ppid
    pri intpri priority psr rgid rgroup rss ruid ruser state sgid sgroup sid
    spid stackp suid suser svgid svgroup svuid svuser sz thcount tid time
    timeout tpgid trss tty uid user vsize nwchan);
# Fields which require sepcial treatment.
@specialfields = qw(lstart wchan command);

# Escape value to fit in an attribute.
sub escape {
  my $x = shift;
  s/&/&amp;/g;
  s/'/&quot;/g;
  s/</&lt;/g;
  return $x;
}

sub usage {
  print <<EOF;
Usage: ps [--all] [--full] [--format field,field,...]

Options/Arguments:
  --all        Display all processes, not just those in this session.
        --full       List full details of each process.
        --format     Display given fields in output.
EOF
  exit 1;
}

# Parse command line.
for ($i = 0; $i <= $#ARGV; $i++) {
  $_ = $ARGV[$i];
  usage() if $_ eq '--help' || $_ eq '-h';
  $all = 1 if $_ eq '--all' || $_ eq '-a' || $_ eq '-e';
  $full = 1 if $_ eq '--full' || $_ eq '-f';
  if ($_ eq '--format' || $_ eq '-o') {
    $i++;
    $format = $ARGV[$i];
  }
}

$fieldlist = 'pid';
for (@fields) {
  $fieldlist .= ",$_";
}

# Run the system ps
%p = ();
open PS, "/bin/ps --no-header -ewo $fieldlist |";
for (<PS>) {
  s/^\s*//;
  @f = split /\s+/;
  $n = 0;
  $pid = shift @f;
  # Extract the fields from the output
  $p{$pid} = {};
  for (@f) {
    $p{$pid}{$fields[$n]} = $_;
    $n++;
  }
}

# Run the system ps again for the badly behaved fields
for $field (@specialfields) {
  open PS, "/bin/ps --no-header -ewo pid,$field |";
  for (<PS>) {
    # Extract that field from the output
    /(\d+)\s+(.*)/;
    $p{$1}{$field} = $2;
  }
}

close PS;

print "<processes xmlns=\"urn:ns:clm.xml-unix.ps\"";
print " sid=\"", $p{$$}{sid}, "\"";
print " all=\"true\"" if $all;
print " full=\"true\"" if $full;
print " format=\"$format\"" if $format ne '';
print ">\n";
# Dump processes in order of PID
for (sort { $a <=> $b } keys %p) {
  $proc = $p{$_};
  print "  <process pid=\"$_\"";#command=\"$proc->{command}\"/>\n";
  for (keys %$proc) {
    print " $_='", escape($proc->{$_}), "'";
  }
  print "/>\n";
}
print "</processes>\n";]]></listing>
      </subsection>

      <subsection title="cat.cpp">
        <listing><![CDATA[#include <iostream>
#include <string>
#include <vector>

#include <libxml/parser.h>
#include <libxml/xpath.h>

const string DEFAULT_DEST_XPATH = "/*";
const string DEFAULT_SOURCE_XPATH = "/*/node()";

// Class to encapsulate command line arguments.
class Args
{
  int _argc;
  vector<string> _args;
public:
  typedef vector<string>::iterator iterator;
  typedef vector<string>::const_iterator const_iterator;
  
  Args(int argc, char *argv[])
    : _argc(argc)
  {
    _args.resize(argc);
    for (int i = 0; i < argc; i++)
      _args[i] = argv[i];
  }

  const string& prog_name() const
  {
    return _args[0];
  }

  iterator begin()
  {
    return _args.begin();
  }

  iterator end()
  {
    return _args.end();
  }
};

// Source document filename and XPath
struct Path
{
  string _file;
  string _xpath;
};

bool is_option(const string& s)
{
  return s[0] == '-';
}

void usage()
{
  cerr << "Usage: cat [--format] [--encoding enc] [[--to xpath] destdoc]\n"
        "           [[--from xpath] sourcedoc ...]\n"
        "\n"
      "Options/Arguments:\n"
      "\t--format        adds whitespace to the output document to indent it.\n"
      "\t--encoding enc  writes the output document in the given encoding.\n"
      "\tdestdoc         XML document to concatenate on to.\n"
      "\t--to xpath      XPath expression indicating where in destdoc\n"
      "\t                the concatenated nodes should be inserted.\n"
      "\tsourcedoc       XML document to be concatenated on to destdoc.\n"
      "\t--from xpath    XPath expression indicating which nodes in\n"
      "\t                sourcedoc should be used.\n"
      "\n"
      "The source XPath expression defaults to \"/*/node()\".\n"
      "The destination XPath expression defaults to \"/*\".\n"
      ;
  exit(1);
}

void parse_commandline(Args& args, Path& dest, vector<Path>& source,
    bool& format, string& encoding)
{
  int state = -3;
  Args::const_iterator i = args.begin() + 1;
  Path next_source;
  while (i != args.end())
  {
    if (*i == "--help" || *i == "-h")
      usage();

    switch (state)
    {
      case -3:
        if (*i == "--format" || *i == "-f")
        {
          format = true;
          i++;
        }
        state = -2;
        break;

      case -2:
        if (*i == "--encoding" || *i == "-e")
        {
          state = -1;
          i++;
        }
        else
          state = 0;
        break;

      case -1:
        if (is_option(*i))
          usage();
        encoding = *i;
        state = 0;
        i++;
        break;

      case 0:
        if (*i == "--to" || *i == "-t")
        {
          state = 1;
          i++;
        }
        else if (*i == "--from" || *i == "-f")
        {
          state = 4;
          next_source._xpath = DEFAULT_SOURCE_XPATH;
          i++;
        }
        else
          state = 2;
        break;
        
      case 1:
        if (is_option(*i))
          usage();
        dest._xpath = *i;
        state = 2;
        i++;
        break;

      case 2:
        if (*i == "--from" || *i == "-f")
        {
          state = 4;
          next_source._xpath = DEFAULT_SOURCE_XPATH;
          i++;
        }
        else if (is_option(*i))
          usage();
        dest._file = *i;
        state = 3;
        i++;
        break;

      case 3:
        next_source._xpath = DEFAULT_SOURCE_XPATH;
        if (*i == "--from" || *i == "-f")
        {
          state = 4;
          i++;
        }
        else
          state = 5;
        break;

      case 4:
        if (is_option(*i))
          usage();
        next_source._xpath = *i;
        state = 5;
        i++;
        break;

      case 5:
        if (is_option(*i))
          usage();
        next_source._file = *i;
        source.push_back(next_source);
        state = 3;
        i++;
        break;
    }
  }

  if (state == -1 || state == 1 || state == 4 || state == 5)
    usage();
}

void error(const string& msg)
{
  cerr << "cat: " << msg << endl;
  exit(2);
}

int main(int argc, char *argv[])
{
  Args args(argc, argv);

  Path dest;
  vector<Path> source;
  bool format;
  string encoding;

  dest._file = "-";
  dest._xpath = DEFAULT_DEST_XPATH;

  parse_commandline(args, dest, source, format, encoding);

  xmlSubstituteEntitiesDefault(1);
  xmlLoadExtDtdDefaultValue = 1;

  // Read in the destination document
  xmlDocPtr xml_dest = xmlParseFile(dest._file.c_str());

  // Evaluate the XPath expression to find the destination document's
  // insertion point
  xmlXPathContextPtr xml_dest_context = xmlXPathNewContext(xml_dest);

  xmlXPathObjectPtr insertion_point_nodeset = xmlXPathEval(
      (const xmlChar*) dest._xpath.c_str(), xml_dest_context);
  if (insertion_point_nodeset == NULL)
    error("Destination XPath must be valid");
  if (insertion_point_nodeset->type != XPATH_NODESET)
    error("Destination XPath must evaluate to a nodeset");
  if (insertion_point_nodeset->nodesetval == 0 ||
    insertion_point_nodeset->nodesetval->nodeNr == 0)
    error("Destination XPath must evaulate to a nodeset with at least one node");
  xmlNodePtr insertion_point = insertion_point_nodeset->nodesetval->nodeTab[0];

  xmlXPathFreeObject(insertion_point_nodeset);
  xmlXPathFreeContext(xml_dest_context);

  // Insert each source document's nodes into the destination document at the
  // insertion point
  for (vector<Path>::const_iterator i = source.begin(); i != source.end(); i++)
  {
    // Read in the source document
    xmlDocPtr xml_source = xmlParseFile(i->_file.c_str());

    // Evaluate the XPath expression to find the source document's nodes
    xmlXPathContextPtr xml_source_context = xmlXPathNewContext(xml_source);

    xmlXPathObjectPtr source_nodeset = xmlXPathEval(
        (const xmlChar*) i->_xpath.c_str(), xml_source_context);
    if (source_nodeset == NULL)
      error("Source XPath must be valid");
    if (source_nodeset->type != XPATH_NODESET)
      error("Source XPath must evaluate to a nodeset");

    // Loop through each node in the nodeset and add them to the
    // destination document
    for (int j = 0; j < source_nodeset->nodesetval->nodeNr; j++)
      xmlAddChild(insertion_point, xmlCopyNode(
          source_nodeset->nodesetval->nodeTab[j], 1));

    xmlXPathFreeObject(source_nodeset);
    xmlXPathFreeContext(xml_source_context);
    xmlFreeDoc(xml_source);
  }

  int save_ret;
  if (encoding == "")
    save_ret = xmlSaveFormatFile("-", xml_dest, format ? 1 : 0);
  else
    save_ret = xmlSaveFormatFileEnc("-", xml_dest, encoding.c_str(),
        format ? 1 : 0);

  if (save_ret == 0)
    error("Error writing output document");

  xmlFreeDoc(xml_dest);
  xmlCleanupParser();
}]]></listing>
      </subsection>

      <subsection title="sort.cpp">
        <listing><![CDATA[#include <iostream>
#include <string>
#include <vector>

#include <libxml/parser.h>
#include <libxml/xpath.h>

const string DEFAULT_CONTEXT_XPATH = "/*/*";
const string DEFAULT_NODE_XPATH = ".//text()";

// Class to encapsulate command line arguments.
class Args
{
  int _argc;
  vector<string> _args;
public:
  typedef vector<string>::iterator iterator;
  typedef vector<string>::const_iterator const_iterator;
  
  Args(int argc, char *argv[])
    : _argc(argc)
  {
    _args.resize(argc);
    for (int i = 0; i < argc; i++)
      _args[i] = argv[i];
  }

  const string& prog_name() const
  {
    return _args[0];
  }

  iterator begin()
  {
    return _args.begin();
  }

  iterator end()
  {
    return _args.end();
  }
};

bool is_option(const string& s)
{
  return s[0] == '-';
}

void usage()
{
  cerr << "Usage: sort [--context xpath] [--node xpath | --expr xpathexpr] [--reverse] [--numeric] [sourcedoc ...]\n"
        "\n"
      "Options/Arguments:\n"
      "\t--context xpath  XPath location indicating the context of the \"match\"\n"
      "\t                 path.\n"
      "\t--node xpath     XPath location indicating which nodes in the context\n"
      "\t                 nodes should be used for sorting.\n"
      "\t--expr xpathexpr XPath expression evaluated in the given context to\n"
      "\t                 be used for sorting.\n"
      "\t--reverse        sorts nodes in reverse order.\n"
      "\t--numeric        indicates that sorting should be performed numerically, not\n"
      "\t                 lexicographically.\n"
      "\tsourcedoc        XML document to be sorted.\n"
      "\n"
      "The context XPath location defaults to \"/*/*\".\n"
      "The node XPath location defaults to \".//text()\".\n"
      ;
  exit(1);
}

void parse_commandline(Args& args, vector<string>& source, string& context,
    string& node, string& expr, bool& numeric, bool& reverse)
{
  int state = 0;
  Args::const_iterator i = args.begin() + 1;
  
  node = "";
  expr = "";

  while (i != args.end())
  {
    if (*i == "--help" || *i == "-h")
      usage();

    switch (state)
    {
      case 0:
        if (*i == "--context" || *i == "-c")
        {
          state = 1;
          i++;
        }
        else if (*i == "--node" || *i == "-n")
        {
          state = 2;
          i++;
        }
        else if (*i == "--expr" || *i == "-e")
        {
          state = 3;
          i++;
        }
        else if (*i == "--reverse" || *i == "-r")
        {
          reverse = true;
          i++;
        }
        else if (*i == "--numeric" || *i == "-N")
        {
          numeric = true;
          i++;
        }
        else
        {
          if (is_option(*i))
            usage();
          state = 4;
        }
        break;

      case 1:
        if (is_option(*i))
          usage();
        context = *i;
        state = 0;
        i++;
        break;

      case 2:
        if (is_option(*i) || !expr.empty())
          usage();
        node = *i;
        state = 0;
        i++;
        break;

      case 3:
        if (is_option(*i) || !node.empty())
          usage();
        expr = *i;
        state = 0;
        i++;
        break;
        
      case 4:
        if (is_option(*i))
          usage();
        source.push_back(*i);
        i++;
        break;
    }
  }

  if (state == 1 || state == 2 || state == 3)
    usage();
}

void error(const string& msg)
{
  cerr << "sort: " << msg << endl;
  exit(2);
}

string context = DEFAULT_CONTEXT_XPATH;
string node;
string expr;
bool numeric = false;
bool reverse = false;

// Compare two nodes in a document.
bool lt(xmlNodePtr a, xmlNodePtr b, xmlXPathContextPtr c)
{
  if (!node.empty())
  {
    // --node used on command line
    
    // Evaluate node in the a context
    c->node = a;
    xmlXPathObjectPtr an = xmlXPathEval((xmlChar*) node.c_str(), c);
    // Evaluate node in the b context
    c->node = b;
    xmlXPathObjectPtr bn = xmlXPathEval((xmlChar*) node.c_str(), c);
    // Loop over each node in the result nodesets
    for (int i = 0; i < min(an->nodesetval->nodeNr, bn->nodesetval->nodeNr); i++)
    {
      // Cast the nodes to a string
      xmlChar* s1 = xmlXPathCastNodeToString(an->nodesetval->nodeTab[i]);
      xmlChar* s2 = xmlXPathCastNodeToString(bn->nodesetval->nodeTab[i]);
      // Compare the strings
      int r = xmlStrcmp(s1, s2);
      xmlFree(s1);
      xmlFree(s2);
      if (r < 0)
        return !reverse;
      else if (r > 0)
        return reverse;
    }
    return false;
  }
  else
  {
    // --expr used on command line

    // Evaluate node in the a context
    c->node = a;
    xmlXPathObjectPtr an = xmlXPathEvalExpression((xmlChar*) expr.c_str(), c);
    // Evaluate node in the b context
    c->node = b;
    xmlXPathObjectPtr bn = xmlXPathEvalExpression((xmlChar*) expr.c_str(), c);
    double r;
    if (numeric)
    {
      // Compare the two nodes numerically
      double d1 = xmlXPathCastToNumber(an);
      double d2 = xmlXPathCastToNumber(bn);
      r = d1 - d2;
    }
    else
    {
      // Compare the two nodes lexicographically
      xmlChar* s1 = xmlXPathCastToString(an);
      xmlChar* s2 = xmlXPathCastToString(bn);
      r = xmlStrcmp(s1, s2);
      xmlFree(s1);
      xmlFree(s2);
    }
    xmlXPathFreeObject(an);
    xmlXPathFreeObject(bn);
    // Reverse the sort order if required
    if (r < 0)
      return !reverse;
    else
      return reverse;
  }
}

int main(int argc, char* argv[])
{
  Args args(argc, argv);

  vector<string> source;

  parse_commandline(args, source, context, node, expr, numeric, reverse);

  if (node.empty() && expr.empty())
    node = DEFAULT_NODE_XPATH;

  xmlSubstituteEntitiesDefault(1);
  xmlLoadExtDtdDefaultValue = 1;

  if (source.empty())
    source.push_back("-");

  // First thing to do is cat the source documents together.
  // If the default cat behaviour is unacceptable, the user should cat the
  // documents himself before running sort.

  // Read in the destination document
  xmlDocPtr xml_dest = xmlParseFile(source[0].c_str());

  // Evaluate the XPath expression to find the destination document's
  // insertion point
  xmlXPathContextPtr xml_dest_context = xmlXPathNewContext(xml_dest);

  xmlXPathObjectPtr insertion_point_nodeset = xmlXPathEval(
      (const xmlChar*) "/*", xml_dest_context);
  if (insertion_point_nodeset == NULL)
    error("Destination XPath must be valid");
  if (insertion_point_nodeset->type != XPATH_NODESET)
    error("Destination XPath must evaluate to a nodeset");
  if (insertion_point_nodeset->nodesetval == 0 ||
    insertion_point_nodeset->nodesetval->nodeNr == 0)
    error("Destination XPath must evaulate to a nodeset with at least one node");
  xmlNodePtr insertion_point = insertion_point_nodeset->nodesetval->nodeTab[0];

  xmlXPathFreeObject(insertion_point_nodeset);
  xmlXPathFreeContext(xml_dest_context);

  // Insert each source document's nodes into the destination document at the
  // insertion point
  for (vector<string>::const_iterator i = source.begin() + 1; i != source.end(); i++)
  {
    // Read in the source document
    xmlDocPtr xml_source = xmlParseFile(i->c_str());

    // Evaluate the XPath expression to find the source document's nodes
    xmlXPathContextPtr xml_source_context = xmlXPathNewContext(xml_source);

    xmlXPathObjectPtr source_nodeset = xmlXPathEval(
        (const xmlChar*) "/*/node()", xml_source_context);
    if (source_nodeset == NULL)
      error("Source XPath must be valid");
    if (source_nodeset->type != XPATH_NODESET)
      error("Source XPath must evaluate to a nodeset");

    // Loop through each node in the nodeset and add them to the destination
    // document
    for (int j = 0; j < source_nodeset->nodesetval->nodeNr; j++)
      xmlAddChild(insertion_point, xmlCopyNode(
          source_nodeset->nodesetval->nodeTab[j], 1));

    xmlXPathFreeObject(source_nodeset);
    xmlXPathFreeContext(xml_source_context);
    xmlFreeDoc(xml_source);
  }

  // Now that the documents have been cated together, the sorting can be done
  
  // Evaluate the context XPath location
  xmlXPathContextPtr xml_dest_ctx = xmlXPathNewContext(xml_dest);
  xmlXPathObjectPtr context_nodeset = xmlXPathEval(
      (const xmlChar*) context.c_str(), xml_dest_ctx);
  
  // Make sure all context nodes have the same parent
  xmlNodePtr parent;
  int num = context_nodeset->nodesetval->nodeNr;
  if (num > 0)
  {
    parent = context_nodeset->nodesetval->nodeTab[0]->parent;
    for (int i = 1; i < num; i++)
      if (parent != context_nodeset->nodesetval->nodeTab[i]->parent)
        error("Context nodes must all have the same parent");
    
    xmlNodePtr pos[num];
    xmlNodePtr n = parent->children;
    for (int i = 0; i < num && n; n = n->next)
      if (context_nodeset->nodesetval->nodeTab[i] == n)
        pos[i++] = n;

    // bubble sort (stable!)
    for (int i = 0; i < num; i++)
    {
      for (int j = 0; j < num; j++)
        if (i != j)
          if (lt(pos[i], pos[j], xml_dest_ctx))
          {
            // swap
            xmlNodePtr no = pos[i];
            pos[i] = pos[j];
            pos[j] = no;

            if (parent->children == pos[i])
            {
              parent->children = pos[j];
            }
            else if (parent->children == pos[j])
            {
              parent->children = pos[i];
            }
            if (parent->last == pos[i])
            {
              parent->last = pos[j];
            }
            else if (parent->last == pos[j])
            {
              parent->last = pos[i];
            }

            pos[i]->prev->next = pos[j];
            pos[i]->next->prev = pos[j];
            pos[j]->prev->next = pos[i];
            pos[j]->next->prev = pos[i];

            xmlNodePtr next = pos[i]->next;
            xmlNodePtr prev = pos[i]->prev;
            pos[i]->next = pos[j]->next;
            pos[i]->prev = pos[j]->prev;
            pos[j]->next = next;
            pos[j]->prev = prev;
          }
    }

    // Write result document
    xmlSaveFormatFile("-", xml_dest, 0);
  }

  xmlFreeDoc(xml_dest);
  xmlCleanupParser();
}]]></listing>
      </subsection>
    </appendix>

  </sections>
</document>

