Guide to database

This is a guide to using the database. Here we note the conventions that we used when entering the data. This is supposed to help the user in searching the data optimally.

Go to the database.

How to search for data

When to search the corpus for data, the user can first optionally set the filters and enter certain search items (words, glosses, POS tags). After this, the user has to press the button "Search" and wait until all data are displayed below. Finally, the results can be downloaded in different file formats (xml, tex, docx and pdf).

Note that the data contain Unicode signs which might cause problems for Latex or Word. For Latex, use xelatex or lualatex as a compiler or replace the problematic characters by their Latex commands. For Word, make sure that your encoding is set to utf-8.

The print to PDF button will lead you to the print option in your browser. Make sure that your settings are set to "Save to PDF" if you would like to download a PDF.

Note that pressing the "Search"-button again afterwards (possibly after the filters and search items have been altered), the previous search results are overridden on the server. Thus, we recommend that you save your search results in between. For example, you can first retrieve the intransitive examples from a certain language and afterwards the transitive ones.

Make sure that you have the newest version of your browser installed. Some functions might not be fully available if the browser version is too old.

Searching for words, glosses or POS tags

The database can be searched for certain words, glosses and part of speech tags (henceforth "search item"). These can be searched for in the field "Free search". The free search is case-insensitive.

Tones in the data are marked by acute accent (high tone) and grave accent (low tone). If you do not use these the search result will include toned and untoned versions of the search item. If you use accents, you will only find the search item with exactly the tone you want. For example, you can search for only high-toned versions of a particle "la" by entering the search item "lá".

If you want to use multiple search items, these must be connected by the symbol "*". Using a simple space will lead to finding a word together with its gloss. If you are looking for a specific English sentence to translate, simply enter the sentence as is.

A search option for sequences of words/glosses/POS tags and for regular expressions is in progress.

Filtering the data for certain constructions

Additionally, the data can currently be filtered according to

Filter Description
Language finds data only from a certain language
Audio finds either data that come with an audio or data that don't have an audio
Sentence embedding finds data which are either simple clauses or a certain type of complex clause, e.g. object clause, adverbial clause etc.
Sentence type finds data which are either declaratives or questions or relative clauses etc.
Focus type finds data that have no focus, or new information focus or constrastive focus etc.
Wh/rel/foc element finds the data where wh-movement, relativization or focalization targets only the chosen constituent
Transitivity finds data with intransitive, transitive, or ditransitive verbs or serial verb constructions
Aspect finds data with a certain type of aspect (e.g. imperfective, perfective)
Tense finds data in a certain tense (e.g. past, present, future)
Polarity finds either affirmative or negative sentences
Tone finds either data that do not contain tonal marking or data that are at least partially toned

If a filter is set to "Select all", the data are not filtered for the specific property.

In the future, further filters might be added.


Some examples in the database come with audiofiles where speakers where recorded pronouncing the respective example. If your browser allows it, a media player is displayed at the respective example (next to "date"). Here you can play the file and download it. The audiofiles are all mp3-files with the name of the respective example key (e.g. "Likpakpaanl-24.mp3").



Some closed-class words can be glossed by abbreviations or by English translations. The following lists the conventions that are used in the database.

  1. All glosses including functional glosses are in lower case. Formatting to small caps is done through formatting commands if needed.
  2. Pronouns are NOT glossed by English translations (i.e. "I", "you", "he" etc.), but by the phi-features they express (i.e. 1sg, 2sg, 3sg, etc.).
  3. If a language differentiates different types of locatives (e.g. in vs. on),the English translations are used. If there is no distinction, the gloss is "loc".
  4. Particles that are used for marking focus are glossed as "foc" throughout, even if they are used in non-focus contexts.

List of glosses

When entering the data and searching for data, the glosses are required to follow the Leipzig Glossing Rules. Glosses that are not in the LGR, should be listed here under "own convention". Importantly there should be no divergences from the list.

Gloss Meaning Source
1 first person Leipzig Glossing Rules
2 second person Leipzig Glossing Rules
3 third person Leipzig Glossing Rules
acc accusative Leipzig Glossing Rules
anim animate own convention
comp complementizer Leipzig Glossing Rules
compl completive Leipzig Glossing Rules
cj conjoined own convention
conj conjunction, conjoined own convention
cop copula Leipzig Glossing Rules
def definite Leipzig Glossing Rules
def demonstrative Leipzig Glossing Rules
dir directional own convention
dj disjoined own convention
emph emphatic own convention
foc focus Leipzig Glossing Rules
fut future Leipzig Glossing Rules
hest hesternal own convention
hum human own convention
ipfv imperfective Leipzig Glossing Rules
loc locative Leipzig Glossing Rules
nom nominative Leipzig Glossing Rules
 nc noun class own convention
neg negation, negative Leipzig Glossing Rules
pfv perfective Leipzig Glossing Rules
poss possessive Leipzig Glossing Rules
pro pronoun own convention
prog progressive Leipzig Glossing Rules
pst past Leipzig Glossing Rules
pl plural Leipzig Glossing Rules
ptcl particle own convention
q question particle/marker Leipzig Glossing Rules
rel relative Leipzig Glossing Rules
sg singular Leipzig Glossing Rules
tns tense own convention
top topic Leipzig Glossing Rules

Part-of-speech tagging

Part-of-speech tagging works similar to glossing. The users are instructed to adhere to the list of POS below. The tags correspond to abbreviations used commonly in linguistics.


  1. All POS tags are uppercase throughout.

List of POS tags

POS Meaning Example
ADJ adjective red
ADV adverb yesterday
ART article the, this, a
ASP independent aspect particle
COMP (subordinating) complementizer that, whether, because
CONJ (coordinating) conjunction and, or, but
COP copula is
DEM demonstrative this, that, those, these
FOC independent focus particle
N noun (incl. proper names) house, Adam
NEG negation not
P preposition in
PART (tense, focus, or aspect, and other) particle
POSS possessive pronoun his
PRO personal pronoun he
Q question particle, question tag right?
REL relative pronoun, relative marker that, which
SW sentence word yes, no
TNS independent tense particle
V verb slaughter
WH wh-pronoun or determiner which, who

Word and morpheme separation


  1. Focus particles are separate words throughout. They are not marked as suffixes to the words.
  2. Elements that are clearly suffixes are separated by a hyphen from the stem (e.g. conjoint, disjoint marking).

How the data are stored in the database

The database is xml-based, which means that the data are stored in xml format. The following gives an example of how an example (i.e. "datapoint") is stored:

<datapoint ID="Likpakpaanl-1">
            <embedding>simple cl.</embedding>
                                    <word><or>Adam </or><gl>Adam </gl><pos>N </pos></word>
                                    <word><or>fe </or><gl>hest.pst </gl><pos>TNS </pos></word>
                                    <word><or>tun </or><gl>work </gl><pos>V </pos></word>
                                    <word><or>(fenna) </or><gl>yesterday </gl><pos>ADV </pos></word>
                        <translation>Adam worked yesterday.</translation>

An example a.k.a. datapoint consists of three parts:

  1. Some metadata: language, dialect, speaker, date of elicitation and possibly an audiofile
  2. Information about the construction: see filters above.
  3. The example itself: It can consist of multiple sentences (e.g. question-answer pair and/or grammatical and ungrammatical versions of the same sentence). A sentence consists of triples for each word: the original word in the respective language, the gloss and the POS tag. After this a translation is added.

Each datapoint has a unique key to identify it.

The database uses xml, xsl, php, javascript, html, and css.

Contributors to this page: admin .
Page last modified on Wednesday March 15, 2023 14:17:38 CET by admin.