This is a guide to using the database. Here we note the conventions that we used when entering the data. This is supposed to help the user in searching the data optimally.
Go to the database.
Contents
How to search for data
When to search the corpus for data, the user can first optionally set the filters and enter certain search items (words, glosses, POS tags). After this, the user has to press the button "Search" and wait until all data are displayed below. Finally, the results can be downloaded in different file formats (xml, tex, docx and pdf).
Note that the data contain Unicode signs which might cause problems for Latex or Word. For Latex, use xelatex or lualatex as a compiler or replace the problematic characters by their Latex commands. For Word, make sure that your encoding is set to utf-8.
The print to PDF button will lead you to the print option in your browser. Make sure that your settings are set to "Save to PDF" if you would like to download a PDF.
Note that pressing the "Search"-button again afterwards (possibly after the filters and search items have been altered), the previous search results are overridden on the server. Thus, we recommend that you save your search results in between. For example, you can first retrieve the intransitive examples from a certain language and afterwards the transitive ones.
Make sure that you have the newest version of your browser installed. Some functions might not be fully available if the browser version is too old.
Searching for words, glosses or POS tags
The database can be searched for certain words, glosses and part of speech tags (henceforth "search item"). These can be searched for in the field "Free search". The free search is case-insensitive.
Tones in the data are marked by acute accent (high tone) and grave accent (low tone). If you do not use these the search result will include toned and untoned versions of the search item. If you use accents, you will only find the search item with exactly the tone you want. For example, you can search for only high-toned versions of a particle "la" by entering the search item "lá".
If you want to use multiple search items, these must be connected by the symbol "*". Using a simple space will lead to finding a word together with its gloss. If you are looking for a specific English sentence to translate, simply enter the sentence as is.
A search option for sequences of words/glosses/POS tags and for regular expressions is in progress.
Filtering the data for certain constructions
Additionally, the data can currently be filtered according to
Filter | Description | |
---|---|---|
Language | finds data only from a certain language | |
Audio | finds either data that come with an audio or data that don't have an audio | |
Sentence embedding | finds data which are either simple clauses or a certain type of complex clause, e.g. object clause, adverbial clause etc. | |
Sentence type | finds data which are either declaratives or questions or relative clauses etc. | |
Focus type | finds data that have no focus, or new information focus or constrastive focus etc. | |
Wh/rel/foc element | finds the data where wh-movement, relativization or focalization targets only the chosen constituent | |
Transitivity | finds data with intransitive, transitive, or ditransitive verbs or serial verb constructions | |
Aspect | finds data with a certain type of aspect (e.g. imperfective, perfective) | |
Tense | finds data in a certain tense (e.g. past, present, future) | |
Polarity | finds either affirmative or negative sentences | |
Tone | finds either data that do not contain tonal marking or data that are at least partially toned |
If a filter is set to "Select all", the data are not filtered for the specific property.
In the future, further filters might be added.
Audiofiles
Some examples in the database come with audiofiles where speakers where recorded pronouncing the respective example. If your browser allows it, a media player is displayed at the respective example (next to "date"). Here you can play the file and download it. The audiofiles are all mp3-files with the name of the respective example key (e.g. "Likpakpaanl-24.mp3").
Glosses
Conventions
Some closed-class words can be glossed by abbreviations or by English translations. The following lists the conventions that are used in the database.
- All glosses including functional glosses are in lower case. Formatting to small caps is done through formatting commands if needed.
- Pronouns are NOT glossed by English translations (i.e. "I", "you", "he" etc.), but by the phi-features they express (i.e. 1sg, 2sg, 3sg, etc.).
- If a language differentiates different types of locatives (e.g. in vs. on),the English translations are used. If there is no distinction, the gloss is "loc".
- Particles that are used for marking focus are glossed as "foc" throughout, even if they are used in non-focus contexts.
List of glosses
When entering the data and searching for data, the glosses are required to follow the Leipzig Glossing Rules. Glosses that are not in the LGR, should be listed here under "own convention". Importantly there should be no divergences from the list.
Gloss | Meaning | Source | |
---|---|---|---|
1 | first person | Leipzig Glossing Rules | |
2 | second person | Leipzig Glossing Rules | |
3 | third person | Leipzig Glossing Rules | |
acc | accusative | Leipzig Glossing Rules | |
anim | animate | own convention | |
comp | complementizer | Leipzig Glossing Rules | |
compl | completive | Leipzig Glossing Rules | |
cj | conjoined | own convention | |
conj | conjunction, conjoined | own convention | |
cop | copula | Leipzig Glossing Rules | |
def | definite | Leipzig Glossing Rules | |
def | demonstrative | Leipzig Glossing Rules | |
dir | directional | own convention | |
dj | disjoined | own convention | |
emph | emphatic | own convention | |
foc | focus | Leipzig Glossing Rules | |
fut | future | Leipzig Glossing Rules | |
hest | hesternal | own convention | |
hum | human | own convention | |
ipfv | imperfective | Leipzig Glossing Rules | |
loc | locative | Leipzig Glossing Rules | |
nom | nominative | Leipzig Glossing Rules | |
nc | noun class | own convention | |
neg | negation, negative | Leipzig Glossing Rules | |
pfv | perfective | Leipzig Glossing Rules | |
poss | possessive | Leipzig Glossing Rules | |
pro | pronoun | own convention | |
prog | progressive | Leipzig Glossing Rules | |
pst | past | Leipzig Glossing Rules | |
pl | plural | Leipzig Glossing Rules | |
ptcl | particle | own convention | |
q | question particle/marker | Leipzig Glossing Rules | |
rel | relative | Leipzig Glossing Rules | |
sg | singular | Leipzig Glossing Rules | |
tns | tense | own convention | |
top | topic | Leipzig Glossing Rules |
Part-of-speech tagging
Part-of-speech tagging works similar to glossing. The users are instructed to adhere to the list of POS below. The tags correspond to abbreviations used commonly in linguistics.
Conventions
- All POS tags are uppercase throughout.
List of POS tags
POS | Meaning | Example | |
---|---|---|---|
ADJ | adjective | red | |
ADV | adverb | yesterday | |
ART | article | the, this, a | |
ASP | independent aspect particle | ||
COMP | (subordinating) complementizer | that, whether, because | |
CONJ | (coordinating) conjunction | and, or, but | |
COP | copula | is | |
DEM | demonstrative | this, that, those, these | |
FOC | independent focus particle | ||
N | noun (incl. proper names) | house, Adam | |
NEG | negation | not | |
P | preposition | in | |
PART | (tense, focus, or aspect, and other) particle | ||
POSS | possessive pronoun | his | |
PRO | personal pronoun | he | |
Q | question particle, question tag | right? | |
REL | relative pronoun, relative marker | that, which | |
SW | sentence word | yes, no | |
TNS | independent tense particle | ||
V | verb | slaughter | |
WH | wh-pronoun or determiner | which, who |
Word and morpheme separation
Conventions
- Focus particles are separate words throughout. They are not marked as suffixes to the words.
- Elements that are clearly suffixes are separated by a hyphen from the stem (e.g. conjoint, disjoint marking).
How the data are stored in the database
The database is xml-based, which means that the data are stored in xml format. The following gives an example of how an example (i.e. "datapoint") is stored:
<datapoint ID="Likpakpaanl-1"> <language>Likpakpaanl</language> <dialect>--</dialect> <speaker>SA</speaker> <date>2021-10-28</date> <audio>audio</audio> <audiofile>Likpakpaanl-1.mp3</audiofile> <construction> <embedding>simple cl.</embedding> <type>declarative</type> <focus>no.focus</focus> <target>no.target</target> <transitivity>intransitive</transitivity> <aspect>perfective</aspect> <tense>past</tense> <polarity>affirmative</polarity> <tone>no.tone</tone> </construction> <example> <sentence> <name/> <judgment/> <or_gloss> <word><or>Adam </or><gl>Adam </gl><pos>N </pos></word> <word><or>fe </or><gl>hest.pst </gl><pos>TNS </pos></word> <word><or>tun </or><gl>work </gl><pos>V </pos></word> <word><or>(fenna) </or><gl>yesterday </gl><pos>ADV </pos></word> </or_gloss> <translation>Adam worked yesterday.</translation> </sentence> </example> </datapoint>
An example a.k.a. datapoint consists of three parts:
- Some metadata: language, dialect, speaker, date of elicitation and possibly an audiofile
- Information about the construction: see filters above.
- The example itself: It can consist of multiple sentences (e.g. question-answer pair and/or grammatical and ungrammatical versions of the same sentence). A sentence consists of triples for each word: the original word in the respective language, the gloss and the POS tag. After this a translation is added.
Each datapoint has a unique key to identify it.
The database uses xml, xsl, php, javascript, html, and css.