Searching in #LancsBox X

#LancsBox X offers powerful searches at different levels of corpus annotation using 1) simple searches, 2) wildcard searches, 3) smart searches, 4) CQL searches. This article provides a brief introduction into each of these search types.

 

1) Simple Searches

 

Literal searches for a specific word or phrase are considered as simple searches in #LancsBox X. These searches will find all examples of words which match the search string exactly. However, simple searches are case insensitive, so the search ‘new york’ will include cases of ‘New York’ with capitalization.

 

2) Wildcard Searches

 

The asterisk (*) is used as a special character in #LancsBox X referred to as a ‘wildcard’. When this asterisk is included in a search, it can have two meanings:

 

Meaning

Example of use

0 or more characters

any word [with space]

new* [new, news, newly, newspaper…]

new *[new car, New York, new ideas…]

 

3) Smart Searches

 

Smart searches are a feature unique to #LancsBox which allow learners to access complex searches more easily. Certain words and phrases typed in all capital letters are used as shortcuts for complex searches which #LancsBox X can carry out. Users can search for word classes (e.g. NOUN, VERB, ADJECTIVE), complex grammatical patterns (e.g. PASSIVE, SPLIT_INFINITIVE, and semantic categories (e.g. PLACE_ADVERB). A full list of available smart searches can be found in Section 4 of the #LancsBox X manual.  

 

4) Corpus Query Language (CQL) Searches

 

#LancsBox X supports powerful searches using CQL. These can be used for defining complex searches at different levels of annotation.

The levels of annotation and syntax depend on the tagging of the corpus, but for XML corpora it is common to have i) word, ii) headword/lemma (hw), iii) part-of-speech (pos), and iv) a user-defined tag. For example, a single token can be searched in CQL with:

 

[word="goes" hw="go" pos="V.*" usas="M1"]

 

This will match every instance of the  word goes with the headword go, the part-of-speech tag V.* (verb) and the usas tag M1 (Moving, coming and going). If a level of annotation is not specified, no restriction is applied at that level. Everything in double quotes is interpreted as a case insensitive regular expression.

 

Multiple tokens can be placed in sequence. An empty pair of square brackets [] will match any token. Tokens can be repeated X times using the syntax {X}, and repeated anywhere between Y and Z times using the syntax {Y, Z}. The shorthand for {0, 1} is a question mark. Thus, for instance, the following CQL expression:

 

[pos="VB.*"] []{0,3} [pos="V.N"]?

 

is interpreted as a verb to be (VB.*) followed by between 0 and 3 tokens without restriction ([]{0,3}) and optionally followed by the past participle (V.N).

 

Parts of a query can also be wrapped in parentheses (), allowing a quantifier such as {1,2} to apply to sequence of tokens—e.g. ([pos="N.* "] [word="and"]){2}. Words, phrases and smart searches can be used anywhere CQL tokens can—e.g. very{2} ADJECTIVE{1,2} [hw="year"].

 

CQL also supports searching XML structure. This search matches every <u></u> element, representing utterances: <u/>. The following matches every utterance where the n attribute is 1 and the nationality attribute is British or American:

 

<u n="1" nationality="British|American"/>

 

These element queries can be combined with the other types of queries using the within syntax:

 

[pos="D.* "] green NOUN within <text genre="newspapers"/>

 

This query matches every instance of a determiner followed by “green” followed by a noun within newspaper texts. The left and right hand sides of the within query can be anything; they can also be other within queries:

 

(<emoji/> within please) within (<e/> within <text genre="elanguage"/>) 

Contact Our Team

If you still can't find an answer to what you're looking for, or you have a specific question, open a new ticket and we'd be happy to help!

Contact Us