Getting started Creating an application User interface
Documentation Basic text analysis Pattern search Creating a corpus
Advanced topics Custom properties
Creating a corpus
tOKo can handle documents in XML (HTML) and plain text. If the documents exist in some other format (e.g. PDF or Microsoft) they must be converted to one of the predefined formats as a pre-processing step. This pre-processing is not further discussed.
tOKo requires that the corpus is represented in XML, but it does not require a specific organisation of the XML input. This is achieved by associating a knowledge base (KB) with the corpus. The KB describes how tOKo should interprete the content of the corpus. The following are the minimal requirements for a KB:
A corpus containing emails might look like this:
<?xml version="1.0" encoding="UTF-8"?> <maildatabase id="db1" title="All recent mails"> <mail id="m1" subject="Hello world"> <content> ... This is the content of the email ... </content> </mail> ... many more mail elements go here ... </maildatabase>
The corresponding knowledge base should then contain:
<?xml version="1.0" encoding="UTF-8"?> <kb> <document tag="maildatabase" type="folder"> <id attribute="id"/> <title attribute="title"/> </document> <document tag="mail" type="file"> <id attribute="id"/> <title attribute="subject"/> <body tag="content"/> </document> </kb>
The mail database contains two types of documents: maildatabase for the entire database and mail for a single message. The line <title attribute="subject"/> states that tOKo should interpret the subject attribute of a mail document as the title to be displayed in the corpus. Similary, <body tag="content"/> states that the content element of a mail is the body.
The screendump below illustrates how the rendering of a specific type of corpus (a weblog) can be customised by specifying the appropriate KB. Posts in this weblog are shown after the icon that looks like a letter, comments on posts are yellow and trackbacks are blue. Metadata associated with a comment (email, IP and URL) are displayed when the user clicks on it.
Below are the detailed steps for specifying a KB associated with a corpus. A corpus consists of documents and for each "type" of document an element should be present in the KB.
The possible attributes of a document element in the KB are:
The possible sub-elements of a document element are given below. In all cases, attribute= and tag= are interchangeable. For example
<mail id="m1" title="Hello"> ... </mail>