Pattern search

The pattern search facility in tOKo provides an intuitive and powerful mechanism to search for (short) phrases in a corpus. Pattern search is different from typing a query to a search engine (e.g. Google). A search engine returns documents related to the query, normally documents that contain the words in the query. Pattern search returns phrases that "precisely" match what you are looking for.

On this page a pattern looks like this: This is a pattern. In the user interface pattern search is available through Pattern search. The results are shown in the fourth browser.

  • Overview. An introduction to the pattern search language.
  • Reference. A description of all pattern elements.
  • API. Initial description of obtaining the results of a pattern search for researchers.
  • Technical issues. Some technical issues associated with pattern search.

Overview

To illustrate the various possibilities of the pattern search facility we consider searching for fruit. The pattern fruit matches the word fruit, including case variations, and nothing else. When we take fruit as a lemma in the English language, we may also want it to match inflections such as fruits. The pattern then becomes (fruit), brackets are the notational convention stating we are searching for a lemma. The next level is to consider fruit as a concept in a user-defined ontology. The notion of fruit may then have sub-concepts called apple, pear, banana and so forth. The pattern that refers to fruit as a concept is [fruit], and it matches all lexical variations of fruit and its sub-concepts.

Grammar-based patterns can be created from primitives that represent a word class. The notation is <word class>. For example <noun> matches all words in the dictionary that can be a noun. When we want all matches for certain amounts of fruit, the pattern <numeral> [fruit] would match two apples. The space is itself also a pattern element, it matches white space.

Syntax and special characters

Syntax follows common practice by using a simple quoting scheme and the \ (backslash) as the escape character. The pattern "(a)" matches precisely (a), a quote itself can be escaped: "\"" matches a double quote.

The pattern language defines several special characters. When you want to match such a character literally, it needs to be escaped with a backslash. The special characters are:

Content: # * @ _
Brackets: () [] {} $ <>
Operators: ! ^ |
Strings and escape: " \

Curly brackets are used for grouping, and largely determine how a pattern is parsed. For example, a|b c is potentially ambiguous. It could be interpreted as: {a|b} c or as a|{b c}.

Tokens and white space handling

tOKo uses a special definition of what a token is. In most text analysis tools a space is used to delineate a token. A sentence like BBC1 starts at 12 o'clock then contains five tokens. tOKo defines the following kinds of tokens:

  1. word: a consecutive sequence of letters. The pattern symbol is *.
  2. integer: a consecutive sequence of digits. The pattern symbol is #.
  3. space: a consecutive sequence of spaces or tabs.
  4. newline: a consecutive sequence of newlines.
  5. character: a single character not implied by any of the above.

For example, *# matches a word immediately followed by an integer: BBC1, and * # matches BBC 1.

The pattern for any single token is _ (an underscore). The pattern (a space) matches any amount of white space and may include both spaces and newlines.

Sequences, disjunctions and optional patterns

A pattern is normally matched from left to right against the corpus. For example, if the pattern is a b (were a and b are arbitrary patterns) then matches can only occur if the corpus contains a phrase that matches a, immediately followed by white space, immediately followed by phrases that match b.

The only construct to modify this order is a disjunction (also called alternation) with the | (vertical bar) operator. For example, apple|pear matches both apple and pear. More complex patterns are constructed by using curly brackets for grouping: one {apple|pear} or I like {[fruit]|[drink]}.

There is no special symbol for optional patterns. They are constructed by a disjunction of the pattern itself and the empty pattern. For example, one {*|} apple matches one apple as well as one tasty apple.

Syntactic glue: near and further

The pattern language provides two constructs to cope with the enormous flexibility language provides. The syntactic glue operators are an alternative for extensive grammatical rules that try to take all possibilities that could possibly occur into account.

The near operator ... (ellipses, three dots), for example <verb> ... <noun> matches eat an apple. The near operator matches at most five tokens, and would thus also match eat a tasty apple. In general, for a high recall you should use the near operator lavishly. There is a subtle difference between a...b and a ... b. In the first pattern, spaces between a and b have to be matched by the near operator, in the second pattern they need not be. The near operator may be suffixed with a number indicating the maximum number of tokens it can match: an ...1 apple.

The further operator ,,, (three comma's) is specifically for natural language analysis: it matches until the end of the sentence if necessary. See below.

Neither near nor further can start a pattern. They can end a pattern however. In that case, near matches as many tokens as it can, whereas further matches until the end of the sentence. Both are useful for exploration purposes: Fred....

Sentences and other boundaries

In most cases phrases that match a pattern should not run over sentence boundaries. Unfortunately it is not clear where sentences start and end. The pattern language provides a solution for two special cases.

Firstly, if the input is HTML a special null token is inserted whenever a natural language phrase cannot be not continued (e.g. over paragraph boundaries). The following HTML tags insert null tokens before and after them: blockquote, br, br, dd, div, dl, dt, h1, h2, h3, h4, h5, h6, hr, li, ol, p, table, tbody, td, tr, ul.

Secondly, the most common way to end a sentence and start a new one is the pattern: {.|"!"|?} <word;capitalized>. That is, a sentence ends with a dot, exclamation mark or question mark and starts with an capitalized word.

In both cases, <sos> matches the start of a sentence and <eos> matches the end of sentence. This implies that one way to find question sentences is <sos>,,,?<eos>.

The automatically inserted null token can be matched explicitely with <null>.

Set operations on phrases

Consider the phrases that is good and that was not good. Both match that...good. In cases such as this one it makes sense to think of matching phrases as a set and introduce operators that operate on sets. The operators are ! which excludes some elements and ^ which selects (or includes) some elements. Applied to the example: {that...good}!not excludes the phrases that contain not, whereas {that...good}^not selects the phrases that contain not.

Language is subtle at times. In one application, the distinction between A influences B and B was influenced by A. The two sentences are semantically identical, although A and B are reversed. In this case the by can be used as determiner to decide on the direction.

Named patterns

A (set of) pattern which refers to a single notion can be given a symbolic name to make re-use easier. An example is given in the screenshot on the right where the notion of a weekday is defined by listing the patterns for the days of the week. The weekday named pattern can be referred to in other patterns as follows: $Weekday$. The $ starts and ends a named pattern.

Named patterns are created as follows:

  1. Select the Resources tab.
  2. Select toko:Pattern (this is the root pattern resource).
  3. Enter the name of the named pattern in the text-entry field.
  4. Click the "Add a sub-concept", the name entered is now a named pattern.
  5. Enter a pattern into the text-entry field.
  6. Click the "Add a pattern" button.
  7. Repeat the previous two steps for additional patterns.
  8. Try it.

Sigmund terms

If Sigmund has extracted a set of meaningful terms from the corpus, @ can be used to refer to any of these terms. This can be very useful as Sigmund terms can be compound and, consequently, the results might be interesting to include in an ontology. For example @ and @ applied to a cooking weblog yields salt and pepper as the most frequent match, but also olive oil and balsamic vinegar.

Lists and other recursive structures

A pattern that should match a list of something can be defined as a named pattern. For example, a list of nouns is either a noun or a noun followed by a space followed by a list of nouns. The named $noun_list$ pattern is therefore: {<noun>|<noun> $nounlist$}.

Overlapping matches, disjointness

In an ontology about food, the concepts apple and apple pie overlap. The default is to consider both matches in the search. In many cases this is not desirable and the disjoint option for concept and named patterns prevents matching of overlaps. For example, in our food ontology, [food;disjoint] the phrase ``an apple pie is a pie made of apples'' will match apple only once.

Reference

The elements that can make up a pattern are:

word
Matches precisely that word. If word is lowercase then all case variants of the word also match. Otherwise the case match has to be exact. For example, hello also matches Hello, whereas Hello only matches Hello.
integer
Matches precisely that integer.
character
Matches precisely that character, provided it is not a meta-character.
\character
Matches precisely that character, even if it is a meta-character.
"literal"
Matches precisely that literal, escape double quotes inside the literal with a backslash.
  (space)
Matches a space, or more precisely a word break. That is any sequence of spaces and/or newlines. A word break sequence may start with a hyphen.
(lemma)
Matches lemma and all its inflections. For example, (car) matches both car and cars. Case is ignored.
(lemma;word class)
Matches lemma for the given word class. (cook;verb) matches cooked, whereas (cook;noun) does not (it matches both cook and cooks).
<word class>
Matches all words that can belong to the word class. The <> notation is also used for several special cases which are not strictly word classes. Common abbreviations used by part-of-speech taggers are included as short hands. Currently, the longer (adjective) and shorter notation (jj) are equivalent. Possible values of word class are listed in the table below:

<adjective> or <jj>Matches an adjective.
<numeral> or <cd>Matches a numeral.
<adverb> or <rb>Matches an adverb.
<article> or <at>Matches an article.
<noun> or <nn>Matches a noun.
<pronoun> or <pn>Matches a pronoun.
<verb> or <vb>Matches a verb.
<preposition> or <in>Matches a preposition.
<interjection> or <uh>Matches an interjection.
<character> or <chr>Matches any character.
<eos>Matches the end of a sentence.
<integer> or <int>Matches any integer, the same as #.
<nl>Matches a newline.
<null>Matches a virtual token that separates two sections of running text. A null token has to be matched explicitly, ... and ,,, stop when they encounter a null token. See also null tokens.
<sos>Matches the start of a sentence.
<spc>Matches a space.
<unknown> or <unk>Matches any word not in the dictionary.
<word>Matches any word, the same as *.

<word class;options>
The options provide a mechanism to restrict the match of the given word class. Possible options are given in the table below:
capitalizedMatches only capitalized words.
chars=NMatches only words of precisely N characters.
chars=N-MMatches only words or integers that contain between N and M characters. The value of M can be inf or infinite.
frequency=NMatches only words which occur with a frequency of N.
frequency=N-MMatches only words which occur with a frequency between N and M. The value of M can be inf or infinite.
upperMatches only uppercase words.
lowerMatches only lowercase words.
mixedMatches only mixed-case words.
[concept]
Matches all lexical variations of concept and, recursively, its sub-concepts and instances.
# (hash symbol)
Matches any integer (sequence of digits).
* (asterisk)
Matches any word (sequence of letters).
*integer
Matches integer consecutive words separated by space breaks. For example, * * * and *3 are identical.
_ (underscore)
Matches any token.
_integer
Matches integer tokens.
...
Matches at most five arbitrary tokens.
...integer
Matches at most integer tokens. For example, I ...10 apples would match I really don't like these apples.
@
Matches any Sigmund term that appears in the corpus.
@integer
Matches any Sigmund term that appears with a frequency of at least integer times in the corpus.
@"term"
Matches precisely the given Sigmund term. For example, @"ice cream".
{pattern}
Matches pattern, used for grouping.
left|right
Matches either the left or the right pattern.
left^right
Matches left provided it also matches right.
left!right
Matches left provided it does not match right.

API

This is an initial description of how the pattern search facility can be accessed by researchers.

Running pattern search

pattern_search_corpus(+Pattern, -Matches, +Options)
Runs a pattern search for the Pattern, the matching phrases are returned in Matches. The list of Options is normally empty for non-interactive searches. An example is:
:- use_module(load).
:- use_module(toko(pattern_search)).

?- pattern_search_corpus('a ... apple', Matches, []).

Matches = [tf('a warm apple', 1),
           tf('a splash of apple', 1),
           tf('a regular apple', 1),
           tf('a truly delicious apple', 1)] 

Examining the results

As may be obvious from the example it is not entirely trivial to relate matches to pattern elements (e.g. that the dots matched warm). A provisional predicate relating a pattern and its results is:

pattern_search_match(-LinearPattern, -Match)
Unifies a match of the previous pattern search with the pattern itself. LinearPattern is a pattern in the internal list representation that contains no disjunctions. Match is a list of the matching atoms in the corpus. The pattern elements in LinearPattern and the atoms in Match are aligned:
:- use_module(load).
:- use_module(toko(pattern_search)).

?- pattern_search_match(LP, Match).

LP =    [plain(a), word_break, near(5), plain(apple)]
Match = [a,        ' ',        'warm ', apple]
For example, the third element of the pattern is near(5) (internal for ...), and the third element in the match is warm.

Technical issues

Depending on the size and complexity of the ontology, the initial call to pattern search can take some time (indexing the ontology). Thereafter, pattern search is extremely fast.

We are looking for suggestions on how to present the results of pattern search both to end-users and for researchers who want to analyse the results.