<?xml version="1.0"?>

<!DOCTYPE document [
<!ENTITY unix "U<sc>nix</sc>">
]>

<document>
  <meta>
    <title>Structured process input/output</title>
    <subtitle>Reformulating the &unix; command line environment to generate and process XML documents</subtitle>
    <student-name>Cameron McCormack</student-name>
    <student-id>12793086</student-id>
    <supervisor>Prof. John Hurst</supervisor>
  </meta>

  <sections>
    
    <section title="Introduction">
      <p>
        One of the strengths of the &unix; operating system <cite ref="unix"/> is its philosophy of having
        many small, simple tools which perform specific tasks well.  Users can compose commands to peform
        more complex computations by combining these small tools at the command line, by way of a pipeline.
        This is especially true of text processing, where the tools can act as "filters", consuming some
        text from its input, and producing a modified version of the text on its output.
      </p>

      <p>
        Back in the early days of &unix;, text processing typically consisted of manipulating flat text files.
        This data file format is simply a collection of ASCII-based records, separated by newline characters.
        The standard text processing tools, such as <code>grep</code>, <code>awk</code> and <code>sed</code>,
        all focus on this flat text file format.
      </p>

      <p>
        There are, however, problems with dealing solely with flat text.  Firstly, not all data
        easily transform into this format; kludges must be made to fit the data into records.
        As an example, the hierarchical structure of a directory tree would benefit from a
        textual representation that could accommodate the nesting of directory entries in
        other directory entries.
      </p>

      <p>
        Secondly, some programs ignore the inherent record-based nature of their output and instead format it
        to be friendly for the user.  To process this requires the use of text filtering programs solely
        to extract the relevant information from the formatted output.  A prime example of this would
        be &unix;'s <code>ls -l</code> command.  The output from this command contains a date field whose
        format can change according to its difference from the current date.  Such conditional formatting
        makes it difficult to consistently parse the information.
      </p>

      <p>
        This project aims to look at the use of structured text for standard program input and output, in particular
        the Extensible Markup Language (XML), as a means of solving these problems.  Following on from this, an
        environment in which to run these programs will also be devised.
      </p>

    </section>

    <section title="Research Context">
      <p>
        The idea of shifting the focus of &unix; process I/O from flat text to a more structured format has not
        been investigated greatly.  Kernighan &amp; Plauger (1976) introduced the idea of the "software tool" &#x2014;
        the small program which performed its task, and only its task.  Descriptions of the standard &unix; text
        processing tools and combining these using the shell are presented <cite ref="env"/>.  However, all of
        these tools are designed for flat text processing.
      </p>

      <p>
        The beginnings of structured, marked up document processing lie in the Standard Generalized Markup Language (SGML)
        <cite ref="sgmlstd"/>.  SGML found itself a following in the publishing industry, as well as a killer application
        in the form of the Hypertext Markup Language (HTML) <cite ref="html"/>.  It wasn't until the formulation
        of a simpler, extended version of SGML, the Extensible Markup Language (XML), that a standard, structured
        document format was used for general computer interchange of information <cite ref="xmlwd"/>.
      </p>

      <p>
        Recently, XML has been proposed as the data format for many tasks, such as Remote Procedure Call (RPC)
        messages <cite ref="soap"/>, however XML hasn't yet been seriously considered for general process I/O.
      </p>
    </section>

    <section title="Research Plan and Methods">
      <subsection title="Methodology">
        <p>
          For the XML processing environment to be successful, there must be three main components:
          tools that generate XML documents, tools which process and filter XML documents, and
          an interface to compose these tools.  The project will look at each of these three components
          in turn.
        </p>

        <p>
          The first stage of the project will involve identifying the &unix; tools which are text producers,
          and adapting these tools to produce XML output instead of flat text.  Two important text producing
          tools will be studied: <code>ls</code> and <code>ps</code>.  The techniques for converting these
          programs to be XML generating will be discussed and then applied to the other text producing tools.
        </p>

        <p>
          The second stage will look at the range of standard text filtering tools available to the &unix;
          user.  Two commonly used filters, <code>grep</code> and <code>sed</code> will be expounded upon.
          Again, versions of these tools which use XML documents as their input and output will
          be developed.
        </p>
        
        <p>
          Lastly, a shell to handle the composition, execution and output presentation of these XML tools will
          be devised.  The emphasis for this shell will be on connecting the XML tools together using
          pipes, as the regular &unix; shell does, and on providing a way of automatically converting
          the output of these tools to a form suitable for display to the user.  The composition of process
          using pipes will not be substantially different from the Bourne shell, however issues of
          document metadata will be considered.  Process output document formatting will be handled
          using XML Stylesheet Transformations (XSLT) <cite ref="xslt"/>.
        </p>
        
        <p>
          The formulation of the overall XML tool environment will hopefully provide benefits to the user,
          mostly in the form of programs generating output from which it is easy to extract relevant information.
        </p>
      </subsection>

      <subsection title="Proposed Thesis Section Headings">
        <!--p>
          These are the tentative thesis section headings:
        </p-->
        <ol>
          <li>Introduction</li>
          <li>The state of &unix; tools</li>
          <li>Text producing tools</li>
          <li>Text processing and filtering tools</li>
          <li>A user interface for the XML tools</li>
          <li>Conclusions and future work</li>
          <li>Bibliography</li>
          <li>Appendix A: Software developed</li>
        </ol>
      </subsection>

      <subsection title="Timetable">
        <table>
          <col width="4cm"/>
          <col width="10cm"/>
          <tr type="header">
            <td>Date</td>
            <td>Task</td>
          </tr>
          <tr>
            <td>March 25</td>
            <td>Begin reading literature</td>
          </tr>
          <tr>
            <td>April 1</td>
            <td>Begin coding for stage 1</td>
          </tr>
          <tr>
            <td>April 15</td>
            <td>Start writing research proposal</td>
          </tr>
          <tr>
            <td>April 29</td>
            <td>Show draft proposal to supervisor</td>
          </tr>
          <tr>
            <td>May 2</td>
            <td>Submit research proposal</td>
          </tr>
          <tr>
            <td>May 13</td>
            <td>Prepare for interim presentation</td>
          </tr>
          <tr>
            <td>May 20</td>
            <td>Begin literature review</td>
          </tr>
          <tr>
            <td>May 23</td>
            <td>Interim presentation</td>
          </tr>
          <tr>
            <td>May 27</td>
            <td>Begin coding for stage 2</td>
          </tr>
          <tr>
            <td>June 13</td>
            <td>Submit literature review draft</td>
          </tr>
          <tr>
            <td>July 8</td>
            <td>Begin coding for stage 3</td>
          </tr>
          <tr>
            <td>August 1</td>
            <td>Submit literature review</td>
          </tr>
          <tr>
            <td>Auguest 22</td>
            <td>Finalise coding, begin thesis</td>
          </tr>
          <tr>
            <td>September 12</td>
            <td>Submit thesis draft</td>
          </tr>
          <tr>
            <td>October 10</td>
            <td>Prepare final presentation</td>
          </tr>
          <tr>
            <td>October 17</td>
            <td>Final presentation</td>
          </tr>
          <tr>
            <td>November 1</td>
            <td>Submit research log book</td>
          </tr>
          <tr>
            <td>November 5</td>
            <td>Submit final thesis</td>
          </tr>
          <tr>
            <td>November 11</td>
            <td>Finalise project web site</td>
          </tr>
        </table>
      </subsection>

      <subsection title="Special Facilities">
        <p>
          No facilities besides those made available to all honours students at
          the School of Computer Science and Software Engineering at Monash
          University will be necessary for the completion of the research
          as described in this proposal.
        </p>
      </subsection>
    </section>

    <bibliography>
      <web id="xslt">
        <year>1999</year>
        <date>1999, Nov. 16</date>
        <title>XSL Transformations (XSLT)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/1999/REC-xslt-19991116</url>
        <authors>
          <author>
            <surname>Clark</surname>
            <initials>J.</initials>
          </author>
        </authors>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <web id="soap">
        <year>2000</year>
        <date>2000, May 8</date>
        <title>Simple Object Access Protocol (SOAP) 1.1</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/2000/NOTE-SOAP-20000508</url>
        <authors>
          <author>
            <surname>Box</surname>
            <initials>D.</initials>
          </author>
          <author>
            <surname>Ehnebuske</surname>
            <initials>D.</initials>
          </author>
          <author>
            <surname>Kakivaya</surname>
            <initials>G.</initials>
          </author>
          <author>
            <surname>Layman</surname>
            <initials>A.</initials>
          </author>
        </authors>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <web id="xmlwd">
        <year>1996</year>
        <date>1996, Nov. 14</date>
        <title>Extensible Markup Language (XML)</title>
        <site-title>The World Wide Web Consortium</site-title>
        <url>http://www.w3.org/TR/WD-xml-961114.html</url>
        <authors>
          <author>
            <surname>Bray</surname>
            <initials>T.</initials>
          </author>
          <author>
            <surname>Sperberg-McQueen</surname>
            <initials>C.M.</initials>
          </author>
        </authors>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <web id="html">
        <year>1995</year>
        <date>1995, Nov.</date>
        <title>Hypertext Markup Language &#x2014; 2.0</title>
        <site-title>The Internet Engineering Taskforce</site-title>
        <authors>
          <author>
            <surname>Berners-Lee</surname>
            <initials>T.</initials>
          </author>
          <author>
            <surname>Connolly</surname>
            <initials>D.</initials>
          </author>
        </authors>
        <url>http://www.ietf.org/rfc/rfc1866.txt</url>
        <accessed-date>2 May 2002</accessed-date>
      </web>
      
      <standard id="sgmlstd">
        <organisation>International Organization for Standardization</organisation>
        <year>1986</year>
        <title>Information processing &#x2014; Text and office systems &#x2014; Standard Generalized Markup Language (SGML)</title>
        <number>ISO 8879:1986</number>
        <place>Geneva</place>
      </standard>

      <book id="env">
        <authors>
          <author>
            <surname>Kernighan</surname>
            <initials>B.W.</initials>
          </author>
          <author>
            <surname>Pike</surname>
            <initials>R.</initials>
          </author>
        </authors>
        <year>1984</year>
        <title>The &unix; Programming Environment</title>
        <publisher>Prentice Hall, Inc.</publisher>
        <place>Englewood Cliffs, New Jersey</place>
      </book>
      
      <book id="tools">
        <authors>
          <author>
            <surname>Kernighan</surname>
            <initials>B.W.</initials>
          </author>
          <author>
            <surname>Plauger</surname>
            <initials>P.J.</initials>
          </author>
        </authors>
        <year>1976</year>
        <title>Software Tools</title>
        <publisher>Addison-Wesley</publisher>
        <place>Reading, Massachusetts</place>
      </book>
      
      <journal-article id="unix">
        <authors>
          <author>
            <surname>Ritchie</surname>
            <initials>D.M.</initials>
          </author>
          <author>
            <surname>Thompson</surname>
            <initials>K.</initials>
          </author>
        </authors>
        <year>1974</year>
        <article-title>The &unix; time-sharing system</article-title>
        <title>Communications of the ACM</title>
        <volume>17</volume>
        <issue>7</issue>
        <pages>
          <first>365</first>
          <last>375</last>
        </pages>
      </journal-article>
      
      <!--conference-paper id="eq">
        <authors>
          <author>
            <surname>Hu</surname>
            <initials>J.</initials>
          </author>
          <author>
            <surname>Wu</surname>
            <initials>H.R.</initials>
          </author>
          <author>
            <surname>Jennings</surname>
            <initials>A.</initials>
          </author>
          <author>
            <surname>Wang</surname>
            <initials>X.</initials>
          </author>
        </authors>
        <year>2000</year>
        <paper-title>Fast and robust equalization: A case study</paper-title>
        <editors>
          <editor>
            <surname>Callaos</surname>
            <initials>XXX
          </editor>
        </editors>
      </conference-paper>

      <book id="ooperl">
        <authors>
          <author>
            <surname>Conway</surname>
            <initials>D.</initials>
          </author>
        </authors>
        <year>2000</year>
        <title>Object oriented perl - A comprehensive guide to concepts and programming techniques</title>
        <publisher>Manning Publications Co.</publisher>
        <place>Connecticut, USA</place>
      </book>

      <book-section id="causal">
        <authors>
          <author>
            <surname>Wallace</surname>
            <initials>C.S.</initials>
          </author>
          <author>
            <surname>Korb</surname>
            <initials>K.B.</initials>
          </author>
        </authors>
        <year>1999</year>
        <section-title>Learning linear causal models by MML sampling</section-title>
        <title>Causal models and intelligent data management</title>
        <editors>
          <editor>
            <surname>Gammerman</surname>
            <initials>A.</initials>
          </editor>
        </editors>
        <publisher>Springer-Verlag</publisher>
        <place>Berlin, New York</place>
        <pages>
          <first>89</first>
          <last>111</last>
        </pages>
      </book-section-->
      
    </bibliography>

  </sections>
</document>

