Creating a corpus

tOKo can handle documents in XML (HTML) and plain text. If the documents exist in some other format (e.g. PDF or Microsoft) they must be converted to one of the predefined formats as a pre-processing step. This pre-processing is not further discussed.

tOKo requires that the corpus is represented in XML, but it does not require a specific organisation of the XML input. This is achieved by associating a knowledge base (KB) with the corpus. The KB describes how tOKo should interprete the content of the corpus. The following are the minimal requirements for a KB:

  • Document. The XML element that represents a single document (e.g. an email or section). There can be several of such elements and they can be nested.
  • Identifier. A unique identifier for the document (used internally).
  • Title. The title of the document as the user sees it.
  • Body. The body or content of the document.

Example

A corpus containing emails might look like this:

<?xml version="1.0" encoding="UTF-8"?>

<maildatabase id="db1" title="All recent mails">              
  <mail id="m1" subject="Hello world">
    <content>
      ... This is the content of the email ...
    </content>
  </mail>

  ... many more mail elements go here ...
</maildatabase>

The corresponding knowledge base should then contain:

<?xml version="1.0" encoding="UTF-8"?>

<kb>
  <document tag="maildatabase" type="folder">
    <id attribute="id"/>
    <title attribute="title"/>
  </document>

  <document tag="mail" type="file">
    <id attribute="id"/>
    <title attribute="subject"/>
    <body tag="content"/>
  </document>
</kb>

The mail database contains two types of documents: maildatabase for the entire database and mail for a single message. The line <title attribute="subject"/> states that tOKo should interpret the subject attribute of a mail document as the title to be displayed in the corpus. Similary, <body tag="content"/> states that the content element of a mail is the body.

Specification

The screendump below illustrates how the rendering of a specific type of corpus (a weblog) can be customised by specifying the appropriate KB. Posts in this weblog are shown after the icon that looks like a letter, comments on posts are yellow and trackbacks are blue. Metadata associated with a comment (email, IP and URL) are displayed when the user clicks on it.

Below are the detailed steps for specifying a KB associated with a corpus. A corpus consists of documents and for each "type" of document an element should be present in the KB.

The possible attributes of a document element in the KB are:

tag="Element"
Tag of the Element as it appears in the corpus.
type="folder | file"
Whether the document type is a file or a folder. Optional.
corpus="yes | no"
If yes (default) the content of this document type will be included in the corpus. If no the content will not be part of the corpus, but still be displayed.
sigmund="yes | no"
Whether or Sigmund should be looking for interesting terms in the document type. Defaults to yes.

The possible sub-elements of a document element are given below. In all cases, attribute= and tag= are interchangeable. For example

<mail id="m1" title="Hello"> ... </mail>
<id attribute="Attribute>
Attribute is the attribute that uniquely (with respect to the entire corpus) identifies the document. Mandatory.
<title attribute="Attribute>
Attribute is the title for the document. This title is shown to the user. Mandatory.
<body tag="Tag>
The body of the document is inside Tag. The body normally contains either plain text, HTML or even arbitrary XML. Optional.
<icon src="File>
File contains the icon for this type of document. The default icon is either one for a folder or a file (as determined by the type attribute). The icon should ideally be a 16x16 image (gif, xpm) and be stored in the icons folder of the application. Optional.
<person attribute="Attribute>
States that the author of a document is identified by this Attribute.
<date attribute="Attribute>
The date associated with this type of document. See ISO 8601 for the full format. An example is 2006-07-20 for July 20th, 2006. Optional.