Basic text analysis

This page contains descriptions of the basic text analysis and search functions.

Most of the functions described below will produce different results depending on the mode. When ontology mode is active common words with little content are not included in the results. For example, Corpus / Words will not show words like the, is, and on when ontology mode is active. Something similar also applies when searching for compounds. In ontology mode, prefixing cheese will find goat cheese, it will not show the cheese.

In language mode, all words and/or terms are included in the results.


Words in the corpus

The Corpus menu provides access to options that summarize the frequencies of words, and subsets of words in the corpus.

The screendump illustrates the user interface after the user has selected Corpus / Lemmas. This resulted in all lemmas being displayed in the second browser. Thereafter, the user clicked on one of the lemmas (egg). The third browser then displays the inflections (e.g. the plural eggs) and the first browser the documents that contain the lemma egg.

  • Corpus / Words: all words in the corpus and their frequency. The results are case sensitive. Selecting a result shows the case variations in the third browser.
  • Corpus / Plains: all plain versions of the words in the corpus. A plain word is a word without uppercase letters and without diacritics. For example, the plain version of RĂ´le is role.
  • Corpus / Lemmas: all headwords of the lemmas in the corpus. A lemma is the base or uninflected form of the word. Selecting a lemma shows the inflected variations in the third browser.
  • Corpus / Capitalized: all capitalized words in the corpus. Useful to find names of people and places.
  • Corpus / All capitals: words that only contain capitals. Could be abbreviations.
  • Corpus / Diacritics: words that contain at least one diacritic.
  • Corpus / Adjectives: all words that can be an adjective (little, half, new).
  • Corpus / Adverbs: all words that can be an adverb.
  • Corpus / Nouns: all words that can be a noun.
  • Corpus / Numerals: all words that can be a numeral (one, ten, hundred).
  • Corpus / Pronouns: all words that can be a pronoun.
  • Corpus / Verbs: all words that can be a verb.
  • Corpus / Unknowns: all words not in the dictionary.

Prefixing, infixing and postfixing terms

Prefixing, infixing and postfixing (also called suffixing) are simple and powerful methods to find compound terms. In English, the general rule is that the last word of a compound determines what it is. For example, chocolate cake is a cake (and not a kind of chocolate).

Enter a term into the text entry field and then click on the prefix, infix or postfix icon. The results are shown in the third browser. Prefix looks before the term: prefixing cheese could result in compounds like goat cheese or fresh cheese. Postfix looks after the term: postfixing cheese results in cheese platter. Infix looks both before and after the term, it is hardly used.

When the third browser is showing a list of "fix" terms, a drag-and-drop from any browser results in the "fix" being applied to the term dropped.

Sub-strings in words

The drop-down menu provides a Sub-string function which searches for all words that contain the sub-string provided in the text entry field. For example, with the sub-string xt, possible results are next and extra. The results are displayed in the second browser.

The special character ^ matches the beginning of a word and $ matches the end of a word. Thus, xt$ matches all words ending on xt. English is particularly regular with regard to endings, nearly any verb can be turned into a noun by suffixing the verb with er (work, worker; read, reader; etc.).

Popups in the browsers

All five browsers contain a popup menu under the right mouse button. The options in these popups are described below.

  • Save as / Text .... Saves the content of the browser in a text file.
  • Save as / CSV (Excel) .... Saves the content of the browser in a comma-separated file, suitable for loading into a program that can read such files (e.g. Microsoft's Excel or OpenOffice).
  • Save as / XML .... Saves the content in an XML format.
  • Sort / Terms. Sorts the entries in the browser in lexicographic order.
  • Sort / Frequency. Sorts the entries in the browser by frequency (shown between square brackets).
  • Sort / Score. Sorts the entries in the browser by score (shown between angle brackets).
  • Sort / Reverse. Reverse the order in which the entries are displayed.
  • Hide ontology terms. Removes all entries in the browser which already appear in the ontology.
  • Highlight [first 100]. The next time a document is selected the (first 100) terms in the browser are highlighted in the document. Highlighted words are shown with a light blue background as shown below.

  • Shrink head. If the browser contains compound terms, for each such term the first word (the head) is removed. This is mainly intended in combination with postfix. For example, after postfixing cheese, a consecutive Shrink head removes the cheese part leaving all words that come after cheese in the corpus.
  • Shrink tail. Rather than removing the first word of compounds it removes the last word (the tail). Useful after a prefix.
  • Shrink spaces. Removes the spaces in compounds.