Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Größe: px
Ab Seite anzeigen:

Download "Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]"


1 Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds 8. Co-occurrence analysis 9. Application III: Word senses in lexicography 10. Keyword analysis Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] What is a corpus? A corpus is a collection of written or spoken utterances. Corpus data are usually digitalized, i. e. stored in computers and machine-readable. A corpus consists of the data themselves, i. e. texts, and possibly metadata, which describe these data, and linguistic annotations, which are assigned to the these data. Lemnitzer, Lothar und Heike Zinsmeister. Korpuslinguistik. Eine Einführung. Tübingen: Narr, S. 7. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 2] 1

2 What is corpus linguistics? Corpus linguistics is the description of utterances of natural languages, their elements and structures on the basis of analyses of corpora as well as the theory formation that builds on these descriptions.. Lemnitzer, Lothar und Heike Zinsmeister. Korpuslinguistik. Eine Einführung. Tübingen: Narr, S. 9. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 3] Advantages of the use of corpora Authentic data: Corpora contain authentic language data. Quantification: Corpora allow to quantify over linguistic phenomena. Context: Corpora provide context to linguistic phenomena (which often renders utterances acceptable which would be judged ungrammatical in isolation). Unbiased: Corpora are unbiased with respect to questions of linguistic norms (and thus help to uncover linguistic phenomena which native speakers reject because they are considered deviations from the norm). Diachrony: Corpora easily allow to add an historical dimension to theoretical inquiries. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 4] 2

3 Disadvantages of the use of corpora The problem of representativity: In order for a population (i. e. a set of texts) to be representative (i. e. for a language as a whole) the basic population (i. e. all texts in a language) has to be known (i. e. text corpora are never truly representative). The problem of relevance: A text corpus includes lots of data which are irrelvant for the theoretical question it is approached with. The problem of incompleteness: Corpora do not provide negative evidence. From the non-existence of a particular phenomenon in the corpus it can not be inferred that it doesn t occur in the language.. The problem of reliability: Text corpora also contain utterances that native speakers consider ungrammatical. Lemnitzer, Lothar und Heike Zinsmeister. Korpuslinguistik. Eine Einführung. Tübingen: Narr, S. 27. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 5] Applications Corpora are used in different areas of linguistics: Theoretical linguistics: from testing of hypotheses to automatic extraction of grammatical regularities. Lexicography: word frequencies, collocations, typical contexts of use, authentic examples. Grammaticography: evidence for grammatical structures, their frequency and distribution. Second language acquisition: analysis of learner s errors, determination of frequency of usage; authentic exampes. Translation: investigating translation techniques in parallel corpora. Computer linguistics: automatic translation, information retrieval, speech recognition, etc. Scherer, Carmen: Korpuslinguistik. Heidelberg: Winter, Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 6] 3

4 Types of corpora Corpora can be classified according to a number of criteria: Medium: corpora of written / spoken language Coverage: reference corpora ( representing a language) / special corpora Competence of speakers: learner corpora / corpora of first language acquisition, Corpus editing: (grammatically / semantically) annotated vs. nonannotated corpora Language stage: historical / modern corpora Number of languages: monolingual copora / parallel corpora / comparative corpora Scherer, Carmen: Korpuslinguistik. Heidelberg: Winter, Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 7] Editing of corpus texts Korpuslinguistik Grundlagen Korpora Metadata: data on the text (e. g. author, time of origin, title, place of publication, ) Annotations: linguistic descriptions of utterances (e. g. part of speech tagging) Encoded text (in XML-format) from GerManC (German newspaper corpus ) annotation example:<s> sentence </s> <foreign> foreign word </foreign> <rs> name </rs> Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 8] 4

5 Some German copora Deutsches Referenzkorpus (Institut für Deutsche Sprache): 3400 mio. running words; newspapers, fiction, specialized texts,, from 1950 on; online available. German corpus from the Leipzig Corpus Collection : more than 15 mio. running words; sentences from nespapers; can be downloaded. DWDS-Kernkorpus (Berlin-Brandenburgische Akademie): 100 mio. running words, equally distributed over the decades from 1900 on; newspapers, fiction, specialized texts, spoken language. Historisches Korpus (Institut für Deutsche Sprache): ca. 45 mio. running words (growing), newspapers, fiction, specialized texts,, century. TIGER-Korpus (Potsdam, Stuttgart, Saarbrücken); 0,9 mio. running words; treebank (sentences with descriptions of grammatical structure). FALKO (Humboldt-Univ. Berlin): error-annotated lerner corpus DaF. Archiv für gesprochenes Deutsch (Institut für Deutsche Sprache): corpus of spoken language; text-sound alignment. Vgl. die Übersichten in: Lemnitzer, Lothar, and Heike Zinsmeister. Korpuslinguistik. Eine Einführung. Tübingen: Narr, / Scherer, Carmen: Korpuslinguistik. Heidelberg: Winter, Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 9] DeReKo: Deutsche Referenzkorpus The IDS-corpora Size: more than 3,4 bn. running words (2008) Acquisition: size, variability, quality and up-to-dateness taken into consideration; protected by contract with respect to copyright Content: fiction, newspaper, scientific texts and others Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 10] 5

6 Archive of written language (2007) (publicly accessible) Belletristik des 20. und 21. Jahrhunderts; diverse Schriftsteller (loz-div-pub) Belletristik des 20. Jahrhunderts; Martin Walser (loz-wam) Berliner Morgenpost (bmp / ) Bonner Zeitungskorpus (bzk) COMPUTER ZEITUNG (cz; deutsch / ) Die Presse (dpr; österreichisch / ) Fachsprachen-Korpus 1 (fsp-pub) Frankfurter Rundschau (ffr / ) Goethe-Korpus (goe) Grammatik-Korpus (gr1) GRIMM-Korpus (gri) Hamburger Morgenpost (hmp05, hmp06 / 04/ /2006) Handbuchkorpora (hbk) Kleine Zeitung (klz; österreichisch / ) LIMAS-Korpus (lim / auch morphosyntaktisch annotiert) Korpus-Kartei der Gesellschaft für deutsche Sprache, Wiesbaden (gfds) Korpus Magazin Lufthansa Bordbuch (mld / ) Mannheimer Korpora (mk) Mannheimer Morgen (mmm / 1989, 1991, / teilweise morphosyntaktisch annotiert) Marx-Engels-Korpora Neue Kronen-Zeitung (nkz; österreichisch / ) Oberösterreichische Nachrichten (oon / ) Reden und Interviews Salzburger Nachrichten (sbn / ) St. Galler Tagblatt (sgt; schweizerisch / ) Tiroler Tageszeitung (ttz / ) VDI Nachrichten (vdi06 / 02/ /2006) Vorarlberger Nachrichten (van / ) Wendekorpus (wk) Wikipedia - Die freie Enzyklopädie (wpd / Stand 03/2005) Züricher Tagesanzeiger (zta / ) Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 11] Archive of written language (2007) (only inside the IDS accessible) Belletristik des 20. und 21. Jahrhunderts; diverse Schriftsteller (loz-div) Belletristik des 20. Jahrhunderts; Stefan Heym (loz-hes) Belletristik des 20. Jahrhunderts; Siegfried Lenz (loz-les) Berliner Zeitung (b97-b04 / ) Biografische Literatur (bio) Der Spiegel (s93, s94 / / auch morphosyntaktisch annotiert) Die Zeit (z94-z04 / teilw. nur Online-Ausgabe) die tageszeitung (t86-t06 / /2006) Meldungen der Deutschen Presse-Agentur (dpa06 / 2006) Fachsprachen-Korpus 1 (fsp) Fachsprachen-Korpus 2: Gentechnologie (dkg) Frankfurter Allgemeine (f93, f95 / 1993 und 1995) Herausgebertexte zum Korpus bio (bih) Historisches Korpus 1 (hi1) Historisches Korpus 2 (hi2) Interview-Korpus (iko) Süddeutsche Zeitung1 (u95-u99 / ) Thomas-Mann-Korpus (thm) Wendekorpus Vereinigung (wkv) Quelle: Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 12] 6

7 Corpus analysis step by step formulation of research question operationalization of research question (cf. next slide) choice of corpus (cf. last slides) choice of corpus analysis methods further processing and storing of data analysis and description of data Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 13] Definitions The formulation of a research question involves well-defined terms. They usually come in two types: A descriptive definition ( Realdefinition ) determines how an expression has to be used by precisely capturing the observations, the knowledge, and the intuition that are connected with the referent of the expression. A descriptive definition has an empirical basis. A stipulative definition ( Nominaldefinition ) determines how a (newly introduced) expression has to be used by reccurring to other known, well-defined expressions. A stipulative definition is normative. Descriptive definitions often have the form: An xxx is a xxx refers to a Stipulative definitions often have the form: xxx shall mean in the following Take xxx to mean Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 14] 7

8 Examples for descriptive definitons ( Realdefinitionen ): Examples for stipulative definitions ( Nominaldefinitionen ): Ein Internetwörterbuch ist eine sprachlexikographische Ressource, die auf einem oder mehreren Servern bereitgehalten wird, welche eine Verbindung in das Internet haben. [Engelberg, Stefan & Lothar Lemnitzer: Einführung in die Lexikographie und Wörterbuchbenutzung. Tübingen: Stauffenburg 2001, 232.] Als "come around" bezeichnet man einen Stein, der knapp an einer Guard vorbeigespielt wird und hinter dieser Guard versteckt liegen bleibt. [Deutscher Curling-Veband e.v.: Lexikon. Gesehen am ] Unter "Verschleisskombination" sei im folgenden verstanden die Kombination eines festen Grundkörpers, eines festen Gegenkörpers, eines gasförmigen oder flüssigen, eventuell mit Festteilchen vermischten Zwischenstoffes, wobei Grundkörper und Gegenkörper sich relativ zueinander bewegen und zwischen Grundkörper und Gegenkörper Druckkräfte übertragen werden. [Community Patent Review. Gesehen am ]»Nachbarschaft«sei im folgenden verstanden als durch räumliche Nähe bedingte Interessenkongruenz [ ]. [Weber, Max: Wirtschaft und Gesellschaft. Studienausgabe. 5. Aufl. Tübingen 1972, 215.] Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 15] Operational procedures The bridging problem Hypothesis Data acquisition Theoretical constructs have to be transformed into instructions for observation and measurement An operational definition is a definition of a theoretical construct in form of a number of operations and/or a measure which are appropriate to identify the state of affaires captured by the theoretical construct. Bortz, Jürgen & Nicola Döring: Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler. 4. überarb. Aufl. Heidelberg: Springer 2006, S.60ff. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 16] 8

9 An operational definition always follows a desciptive definition: First, capture the basic characteristics of the concept, then operationalize it. Question: How can you operationalize the following concepts? modern German word neologism archaism productivity sentence passive (in German) The operationalization depends on the empirical method you choose. Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 17] 9