Improving Part-Of-Speech Tagging for Social Media via Automatic Spelling Error Correction

Größe: px

Ab Seite anzeigen:

Download "Improving Part-Of-Speech Tagging for Social Media via Automatic Spelling Error Correction"

Nelly Reuter
vor 5 Jahren
Abrufe

1 Improving Part-Of-Speech Tagging for Social Media via Automatic Spelling Error Correction Vorstellung AI-Studienprojekt für das SoSe 2019 Benedikt Tobias Bönninghoff Cognitive Signal Processing Group Institute of Communication Acoustics

SecHuman: Sicherheit für Menschen im Cyberspace Forschungsgruppe Sprachliche Imitations- und Verschleierungsstrategien Motivation: Sprachliche Verschleierungen in inkriminierten Texten

2 SecHuman: Sicherheit für Menschen im Cyberspace Forschungsgruppe Sprachliche Imitations- und Verschleierungsstrategien Motivation: Sprachliche Verschleierungen in inkriminierten Texten Erpresserschreiben, Bekennerschreiben Hatespeech in sozialen Medien Chef-Masche: s mit Zahlungsaufforderungen Foren für Kinder und Jugendliche Ziel: Entwicklung von Methoden für die automatisierte Analyse der Autorenschaft Können Verschleierungsstrategien detektiert werden? SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 2 / 8

3 Part-Of-Speech (POS) Tagging: Geschriebene Standardsprache vs. Online-Sprachgebrauch ParZu: Tool for POS-Tagging (Wortartenerkennung) und Dependency Parsing 1 subj aux objd obja subj adv adv objp obji det pn det Ich ich PPER 1 Sg Nom 1 will wollen VMFIN 1 Sg Pres Ind 2 mir ich PRF 1 Sg Dat 3 das die ART Def Neut Sg 4 Bier Bier NN Neut Sg 5 nicht nicht PTKNEG 6 nochmal nochmal ADV 7 durch durch APPR Acc 8 den die ART Def Masc Acc Sg 9 Kopf Kopf NN Masc Acc Sg 10 gehen gehen VVINF 11 lassen lassen VVINF 12 subj aux objd adv obji objd adv adv adv adv Ich ich PPER 1 Sg Nom 1 will wollen VMFIN 1 Sg Pres Ind 2 mir ich PRF 1 Sg Dat 3 dat dat ADJD 4 bier bi ADJA Pos Fem Dat Sg St 5 nich nich PTKNEG 6 nochmal nochmal ADV 7 durchn durchn ADJD 8 kopf kopf ADJD 9 gehen gehen VVINF 10 lassen lassen VVINF 11 1 Rico Sennrich, Martin Volk, and Gerold Schneider. Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis. In: RANLP SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 3 / 8

Repräsentation eines Phonems (Laute) Ausschnitt des Grapheminventars [Thomé & Thomé (2017)]

4 Das deutsche Schriftsystem Was ist ein Graphem? Graphematische Form eines Wortes aus seiner phonologischen Struktur ableitbar Graphem = schriftliche Repräsentation eines Phonems (Laute) Ausschnitt des Grapheminventars [Thomé & Thomé (2017)] Orthographie vs. Graphematik <W> <a> <l> *<V> <a> <l> *<W> <aa> <l> *<W> <ah> <l> SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 4 / 8

5 Ziel des Projekts: Verbesserung von POS-Tagging mit Hilfe neuronaler Netze 2 Erkennung von Wortarten essenziell für syntaktische und semantische Analysen Probleme: (P1) Out of Vocabulary: Wörter, Wortformen, Satzendungen nicht im Trainingskorpus (P2) Ambiguitäten: Wörter tragen verschiedene POS-Labels (P3) Neuartige POS-Labels (z.b. Emoticons) (P4) Abweichende Wortstellungen nicht im Trainingskorpus (u.a. Grammatikfehler) (P5) Rechtschreibfehler nicht im Trainingskorpus Lösungsansätze: CRF Layer Backward LSTM Forward LSTM PRP VBP VBG NN LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM (L1) Probabilistisches Framework (L2) Einbeziehung des Wortkontexts (Long-Term-Dependencies) (L2) Berücksichtigung POS-relevanter Worteigenschaften (Suffixe, Präfixe) (L4/L5) Semi-Supervised Training (L4/L5) Data Augmentation (L4/L5) Text Normalization (i.e. Automatic Spelling Error Correction) Char Representation Word Embedding We are playing soccer 2 Xuezhe Ma and Eduard H. Hovy. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 5 / 8

6 Ziel des Projekts: Automatic Spelling Error Correction (ASEC) Ansatz: Neural Machine Translation (NMT) Beispiel: Encoder-Decoder Modell mit Attention-Mechanismus <d> <r> <eh> <e> <n> y1 y2 y3 y4 y5 Decoder s1 s2 s3 s4 s h1 h2 h3 h4 Encoder x1 x2 x3 x4 <d> <r> <eh> <n> Framework taken fr m the EACL 2017 T torial on Practical NMT Graphem <eh>, wobei h als Dehnungs-h gewertet wird SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 6 / 8

7 y1 y2 y3 y4 y5 s1 s2 s3 s4 s5 h1 h2 h3 h4 x1 x2 x3 x4 Aufgabenstellung (1) Einarbeitung und Recherche Rechtschreibfehleranalyse der dt. Sprache Neural Machine Translation Optional: End-to-end Sequence Labeling for POS-Tagging Eigene Implementierung verfügbar (2) Programmierung Entwicklung eines Tools für die Erzeugung von Pseudo-Trainingsdaten Implementierung eines Verfahrens für ASEC Optional: Data Augmentation in das POS-Framework integrieren (3) Evaluierung und Fehleranalyse Qualitatitv (POS-Tagger) und quantitativ (ASEC) Interdisziplinäre Zusammenarbeit mit einem Doktoranden der Linguistik (4) Dokumentation der Ergebnisse und Vortrag <H> <u> <n> <d> *<H> <u> <n> <t> *<h> <u> <n> <d> *<H> <uh> <nn> <t>... Decoder Encoder <d> <r> <eh> <e> <n> <d> <r> <eh> <n> Framework taken fr m the EACL 2017 T torial on Practical NMT SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 7 / 8

8 Organisation und Kontakt Eckdaten: Beginn: Ende: Wöchentliche Treffen Teilnemerzahl: 2-3 (Bachelor oder Master) Wünschenswerte Kenntnisse bzw. Fähigkeiten: Interdisziplinäres Arbeiten Python / Tensorflow Machine Learning / Deep Learning Linux (Ubuntu), Git, Latex Sehr gut geeignet für: Studierende mit Schwerpunkt Computerlinguistik oder Machine Learning Ansprechpartner/Kontakt bei Fragen: benedikt.boenninghoff[at]rub.de Raum: ID SecHuman Motivation Projektziel Aufgabenstellung Organisation B. Bönninghoff 8 / 8

Ähnliche Dokumente

Semi-supervised End-to-end Sequence Labeling for Real-world Data (Arbeitstitel)

Semi-supervised End-to-end Sequence Labeling for Real-world Data (Arbeitstitel) Vorstellung AI-Studienprojekt für das WS 2018/19 Benedikt Tobias Bönninghoff 11.07.2018 Cognitive Signal Processing Group