Einführung Beispiele Fazit Links. OpenNLP. Nico Beierle, Irina Glushanok

Ähnliche Dokumente
Softwaretechnologie für die Ressourcenlinguistik

GATE General Architecture for Text Engineering. Alexander Hein & Erik Dießler (VL Text Analytics )

1/19. Kern-Methoden zur Extraktion von Informationen. Sebastian Marius Kirsch Back Close

Erkennung fremdsprachiger Ausdrücke im Text

Nutzung maschinellen Lernens zur Extraktion von Paragraphen aus PDF-Dokumenten

Build Management Tool

LOVOO auf Wolke 7. Stefan Weigert (Head of Data) BI Trendforum, Dresden,

Wortdekodierung. Vorlesungsunterlagen Speech Communication 2, SS Franz Pernkopf/Erhard Rank

Named Entity Recognition, Extraction, und Linking in deutschen Rechtstexten

Named Entity Recognition (NER)

Maschinenlernen mit XML-Daten. und Weka

8. Mai Humboldt-Universität zu Berlin. LingPipe. Mark Kibanov und Maik Lange. Index. Allgemeine Infos. Features

Softwaretechnologie für die Ressourcenlinguistik

Can I use an older device with a new GSD file? It is always the best to use the latest GSD file since this is downward compatible to older versions.

Softwarepraktikum. Textanalyse mit Java/Python. Franz Matthies

Vorlesung Programmieren

Text Mining mit LingPipe

Learning Linear Ordering Problems for Better Translation

Rough copy for the art project >hardware/software< of the imbenge-dreamhouse artist Nele Ströbel.

Concorde: A Supersonic Story runterladen kostenlos Filme mit Untertitel auf Deutsch

Einführung in die Computerlinguistik

Non-Stop English 1 Unit 1

IATUL SIG-LOQUM Group

time marker cluster term term URL Link to AEC media

Unterrichtsmaterialien in digitaler und in gedruckter Form. Auszug aus: Vertretungsstunde Englisch 5. Klasse: Grammatik

Jenseits Von Gut Und Böse By Friedrich Nietzsche

Linux I II III Res WN/TT NLTK XML XLE I II Weka E. Freitag. 9 XLE Transfer. 10 Weka. Ressourcen-Vorkurs

Hausaufgabe 1-4. Name: If homework late, explanation: Last class homework is being accepted: If correction late, explanation: Student Self-Grading

L A T E X für Anfänger

Dun & Bradstreet Compact Report

Build Management Tool?

Lukas Hydraulik GmbH Weinstraße 39 D Erlangen. Mr. Sauerbier. Lukas Hydraulik GmbH Weinstraße 39 D Erlangen

Corpus based Identification of Text Segments. Thomas Ebert Betreuer: MSc. Martin Schmitt

Titelmasterformat Object Generator durch Klicken bearbeiten

Why you should care. by Thomas Krimmer. Thomas Slide 1

The ing form (gerund)

Aux Flip in German: A Walk in the Woods

Einführung zu den Übungen aus Softwareentwicklung 1

BIW Wahlpflichtmodul. Einführung in Solr, Pipeline und REST. Philipp Schaer, TH Köln (University of Applied Sciences), Cologne, Germany

HEALTH Institut für Biomedizin und Gesundheitswissenschaften

Berühmte Informatiker

Build Management Tool?

Named Entity Recognition auf Basis von Wortlisten

Hype und relevante Veränderung durch kognitive Technologien Was ist heute bereits möglich mit Künstlicher Intelligenz?

Programmier-Befehle - Woche 10

13 Reguläre Ausdrücke

Technische Universität Kaiserslautern Lehrstuhl für Virtuelle Produktentwicklung

rot red braun brown rot red RS-8 rot red braun brown R S V~

Treiber TZIDC(-200) HART

Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach. Programmes for refugees at Bielefeld University

Best of :48 Uhr Seite 2

Big Data in der Praxis

Rich Internet Applications Technologien. Leif Hartmann INF-M3 Anwendungen 2 - Wintersemester 2007/ Januar 2008

1. Die rekursive Datenstruktur Liste

IN 45 MINUTEN ZUR EIGENEN PENTAHO BI SUITE

DIGITAL RADIO USAGE IN SWITZERLAND Trend analysis, Spring 2018, SwissRadioDays, Zurich, 30th August 2018

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

The app the crashes, before the breakpoint is reached: Code to the event:

^~ Read Angebotsbeschreibungen fr Online-Einkaufsportale zur automatischen Klassifizierung und Informationsextraktion... free books to read online no

Die EnMAP-Box Ziele, Stand der Entwicklung und Ausblick

Programmentwicklung ohne BlueJ

KURZANLEITUNG. Firmware-Upgrade: Wie geht das eigentlich?

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Vorlesung Programmieren

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Extract of the Annotations used for Econ 5080 at the University of Utah, with study questions, akmk.pdf.

Maschinelle Sprachverarbeitung Übung

Projektseminar Modellbasierte Softwareentwicklung SoSe2016

Machine Learning and Data Mining Summer 2015 Exercise Sheet 11

Machinelles Lernen. «Eine kleine Einführung» BSI Business Systems Integration AG

ERWEITERUNG CONTAO INDEXIERUNG - SUCHE AUF OFFICE- UND PDF-DATEIEN

Erkennung und Visualisierung attribuierter Phrasen in Poetiken

Schneller als Hadoop?

Einführung in die Theoretische Informatik

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Softwareprojektpraktikum Maschinelle Übersetzung

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

VL-02: Turing Maschinen II. (Berechenbarkeit und Komplexität, WS 2018) Gerhard Woeginger

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

PRODUKTKONFIGURATION SCHNELL UND EINFACH MIT CONFIGURATOR 360 BEI ILCHMANN FÖRDERTECHNIK Torben Westhöfer Neufeld

Newsletter January 2013

Praktikum Einführung

DIBELS TM. German Translations of Administration Directions

Sprache, Mobilität und Interkulturalität

Projektseminar Modellbasierte Softwareentwicklung SoSe2017

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Inhalt. Was ist Nagios? Installation Konfiguration Demo

School of Business. Fachhochschule Stralsund University of Applied Sciences

Sporadischer Neustart von WPC-Geräten FINALE LÖSUNG

Listening Comprehension: Talking about language learning

Internetsichere Kennwörter

iid software tools QuickStartGuide iid USB base driver installation

Transkript:

OpenNLP Nico Beierle, Irina Glushanok 15.11.2012 1 / 21

Inhaltsverzeichnis 1 Einführung Allgemeines Möglichkeiten Installation 2 Beispiele Tools auf Kommandozeile Tokenizer Name Finder Training My Model Evaluation API Tokenizer Tokenizer (File Stream) Sentence Detector (File Stream) 3 Fazit 4 Links 2 / 21

Allgemeines Einführung The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. Version 1 im Jahr 2004 Lizenz: Apache 2.0. Frei in jedem Umfeld verwenden, modifizieren und verteilen. Lizenz muss beibehalten werden. Aktuelle Version: 1.5.2 vom November 2011 aktuell 6-11 Entwickler 3 / 21

Möglichkeiten Was kann OpenNLP Vorgehen Lernbasierte Algorithmen > Trainingsdaten notwenig Gelerntes wird in Modellen gespeichert Modelle Satzerkennung Tokenisierung Namenserkennung (Personennamen, Daten, Orte, Geldbeträge, Organisationennamen) Part-of-Speech-Tagging Parsing 4 / 21

Möglichkeiten Beispiel: Modell Satzerkennung Bestandteile Sentence Detection Sentence Detection Tool Sentence Detection API Sentence Detector Training Training Tool Training API Evaluation Evaluation Tool 5 / 21

Installation Installation Download apache-opennlp-1.5.2-incubating-bin.tar.gz von der Website, entpacken 6 / 21

Installation Installation Download fertiger Modelle von der Website: http://opennlp.sourceforge.net/models-1.5/ 6 / 21

Tools auf Kommandozeile Beispiel: Tokenizer Eingabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Kommando./opennlp TokenizerME en-token.bin <../input/input >../output/outputtokens Ausgabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. 7 / 21

Tools auf Kommandozeile Name Finder (Personennamen) Eingabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. Alan Mathison Turing was a British mathematician, logician, cryptanalyst, and computer scientist. John Backus was an American computer scientist. Gottfried Wilhelm Leibniz was a German mathematician and philosopher. Konrad Zuse was a German civil engineer, inventor and computer pioneer. Kommando./opennlp TokenNameFinder en-ner-person.bin <../input/input1 >../output/outputnamefinder 8 / 21

Tools auf Kommandozeile Name Finder (Personennamen) Ausgabe <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. <START:person> Alan Mathison Turing <END> was a British mathematician, logician, cryptanalyst, and computer scientist. <START:person> John Backus <END> was an American computer scientist. <START:person> Gottfried Wilhelm Leibniz <END> was a German mathematician and philosopher. Konrad Zuse was a German civil engineer, inventor and computer pioneer. 8 / 21

Tools auf Kommandozeile Training Trainigsdatei <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. <START:person> Mr. Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. <START:person> Rudolf Bayer <END> has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. <START:person> Alan Mathison Turing <END> was a British mathematician, logician, cryptanalyst, and computer scientist. <START:person> John Backus <END> was an American computer scientist. <START:person> Gottfried Wilhelm Leibniz <END> was a German mathematician and philosopher. <START:person> Konrad Zuse <END> was a German civil engineer, inventor and computer pioneer. Kommando./opennlp TokenNameFinderTrainer -encoding UTF-8 -lang en -data../input/en-namefinder-training -model my-en-ner-person.bin 9 / 21

Tools auf Kommandozeile My Model Eingabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. Alan Mathison Turing was a British mathematician, logician, cryptanalyst, and computer scientist. John Backus was an American computer scientist. Gottfried Wilhelm Leibniz was a German mathematician and philosopher. Konrad Zuse was a German civil engineer, inventor and computer pioneer. I don t know John Zuse. Konrad Smith s visit to Munich last summer was awesome. Mrs. Quatsch lives in Saint Moritz. Where has Mr. Vinken been so long? Kommando./opennlp TokenNameFinder my-en-ner-person.bin <../input/input4 >../output/outputnamefinder4 10 / 21

Tools auf Kommandozeile My Model Ausgabe <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. <START:person> Mr. Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. <START:person> Rudolf Bayer <END> has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. <START:person> Alan Mathison Turing <END> was a British mathematician, logician, cryptanalyst, and computer scientist. <START:person> John Backus <END> was an American computer scientist. <START:person> Gottfried Wilhelm Leibniz <END> was a German mathematician and philosopher. <START:person> Konrad Zuse <END> was a German civil engineer, inventor and computer pioneer. I don t know <START:person> John Zuse. <END> <START:person> Konrad Smith s <END> visit to Munich last summer was awesome. <START:person> Mrs. Quatsch <END> lives in Saint Moritz. Where has Mr. Vinken been so long? 11 / 21

Tools auf Kommandozeile Evaluation Kommando./opennlp TokenNameFinderEvaluator -encoding UTF-8 -model my-en-ner-person.bin -data../input/en-person-finder-test 12 / 21

Tools auf Kommandozeile Evaluation Kommando./opennlp TokenNameFinderEvaluator -encoding UTF-8 -model my-en-ner-person.bin -data../input/en-person-finder-test Testdatei <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. <START:person> Mr. Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years...... I don t know <START:person> John Zuse. <END> <START:person> Konrad Smith s <END> visit to Munich last summer was awesome. <START:person> Mrs. Quatsch <END> lives in Saint Moritz. Where has <START:person> Mr. Vinken <END> been so long? 12 / 21

Tools auf Kommandozeile Evaluation Kommando./opennlp TokenNameFinderEvaluator -encoding UTF-8 -model my-en-ner-person.bin -data../input/en-person-finder-test Ausgabe Loading Token Name Finder model... done (0,030s) Average: 155,6 sent/s Total: 14 sent Runtime: 0.09s Precision: 1.0 Recall: 0.9166666666666666 F-Measure: 0.9565217391304348 12 / 21

API Voraussetzung für API opennlp-tools-1.5.2-incubating.jar in den Java Library aufnehmen. 13 / 21

API Einbinden der API-Docs (optional) 14 / 21

API Einbinden der Sources (optional) 15 / 21

API Tokenizer Ausgabe [An, input, sample, sentence,.] 16 / 21

API Tokenizer (File Stream) 17 / 21

API Sentence Detector (File Stream) Ausgabe [First sentence., Second sentence.] 18 / 21

API Problem An welchen Stellen dürfen wir den Input-Stream trennen, so dass die Satzerkennung korrekt angewendet werden kann? Im Allgemeinen wissen wir das nicht und das gesamte Dokument muss in einen String eingelesen werden! Problematisch bei großen zusammenhängenden Texten. 19 / 21

Performace Tokenizer: 20MB Text etwa 40sec SentenceDetector: 20MB Text etwa 2sec Vorteile Dokumentation gut strukturiert und übersichtlich gute Handhabung über Kommandozeileninterpreter einfache Aufgaben sind einfach zu lösen Nachteile API-Dokumentation minimalistisch API: keine Streams, nur Strings bzw. String-Arrays offizielle Traininskorpora nicht verfügbar > kein Nachtrainieren von diesen möglich Algorithmen nicht beschrieben 20 / 21

Links Binaries und Sources, Modelle http://opennlp.apache.org/cgi-bin/download.cgi http://opennlp.sourceforge.net/models-1.5/ Forenbeiträge http://stackoverflow.com/questions/12895145/opennlp-sentencedetection-api-for-entire-text-file http://stackoverflow.com/questions/12081915/how-to-trainpostagger-opennlp-and-append-the-result-back-to-the-old-model 21 / 21