OpenNLP Nico Beierle, Irina Glushanok 15.11.2012 1 / 21
Inhaltsverzeichnis 1 Einführung Allgemeines Möglichkeiten Installation 2 Beispiele Tools auf Kommandozeile Tokenizer Name Finder Training My Model Evaluation API Tokenizer Tokenizer (File Stream) Sentence Detector (File Stream) 3 Fazit 4 Links 2 / 21
Allgemeines Einführung The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. Version 1 im Jahr 2004 Lizenz: Apache 2.0. Frei in jedem Umfeld verwenden, modifizieren und verteilen. Lizenz muss beibehalten werden. Aktuelle Version: 1.5.2 vom November 2011 aktuell 6-11 Entwickler 3 / 21
Möglichkeiten Was kann OpenNLP Vorgehen Lernbasierte Algorithmen > Trainingsdaten notwenig Gelerntes wird in Modellen gespeichert Modelle Satzerkennung Tokenisierung Namenserkennung (Personennamen, Daten, Orte, Geldbeträge, Organisationennamen) Part-of-Speech-Tagging Parsing 4 / 21
Möglichkeiten Beispiel: Modell Satzerkennung Bestandteile Sentence Detection Sentence Detection Tool Sentence Detection API Sentence Detector Training Training Tool Training API Evaluation Evaluation Tool 5 / 21
Installation Installation Download apache-opennlp-1.5.2-incubating-bin.tar.gz von der Website, entpacken 6 / 21
Installation Installation Download fertiger Modelle von der Website: http://opennlp.sourceforge.net/models-1.5/ 6 / 21
Tools auf Kommandozeile Beispiel: Tokenizer Eingabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Kommando./opennlp TokenizerME en-token.bin <../input/input >../output/outputtokens Ausgabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. 7 / 21
Tools auf Kommandozeile Name Finder (Personennamen) Eingabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. Alan Mathison Turing was a British mathematician, logician, cryptanalyst, and computer scientist. John Backus was an American computer scientist. Gottfried Wilhelm Leibniz was a German mathematician and philosopher. Konrad Zuse was a German civil engineer, inventor and computer pioneer. Kommando./opennlp TokenNameFinder en-ner-person.bin <../input/input1 >../output/outputnamefinder 8 / 21
Tools auf Kommandozeile Name Finder (Personennamen) Ausgabe <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. <START:person> Alan Mathison Turing <END> was a British mathematician, logician, cryptanalyst, and computer scientist. <START:person> John Backus <END> was an American computer scientist. <START:person> Gottfried Wilhelm Leibniz <END> was a German mathematician and philosopher. Konrad Zuse was a German civil engineer, inventor and computer pioneer. 8 / 21
Tools auf Kommandozeile Training Trainigsdatei <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. <START:person> Mr. Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. <START:person> Rudolf Bayer <END> has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. <START:person> Alan Mathison Turing <END> was a British mathematician, logician, cryptanalyst, and computer scientist. <START:person> John Backus <END> was an American computer scientist. <START:person> Gottfried Wilhelm Leibniz <END> was a German mathematician and philosopher. <START:person> Konrad Zuse <END> was a German civil engineer, inventor and computer pioneer. Kommando./opennlp TokenNameFinderTrainer -encoding UTF-8 -lang en -data../input/en-namefinder-training -model my-en-ner-person.bin 9 / 21
Tools auf Kommandozeile My Model Eingabe Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. Alan Mathison Turing was a British mathematician, logician, cryptanalyst, and computer scientist. John Backus was an American computer scientist. Gottfried Wilhelm Leibniz was a German mathematician and philosopher. Konrad Zuse was a German civil engineer, inventor and computer pioneer. I don t know John Zuse. Konrad Smith s visit to Munich last summer was awesome. Mrs. Quatsch lives in Saint Moritz. Where has Mr. Vinken been so long? Kommando./opennlp TokenNameFinder my-en-ner-person.bin <../input/input4 >../output/outputnamefinder4 10 / 21
Tools auf Kommandozeile My Model Ausgabe <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. <START:person> Mr. Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. <START:person> Rudolf Bayer <END> has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. <START:person> Alan Mathison Turing <END> was a British mathematician, logician, cryptanalyst, and computer scientist. <START:person> John Backus <END> was an American computer scientist. <START:person> Gottfried Wilhelm Leibniz <END> was a German mathematician and philosopher. <START:person> Konrad Zuse <END> was a German civil engineer, inventor and computer pioneer. I don t know <START:person> John Zuse. <END> <START:person> Konrad Smith s <END> visit to Munich last summer was awesome. <START:person> Mrs. Quatsch <END> lives in Saint Moritz. Where has Mr. Vinken been so long? 11 / 21
Tools auf Kommandozeile Evaluation Kommando./opennlp TokenNameFinderEvaluator -encoding UTF-8 -model my-en-ner-person.bin -data../input/en-person-finder-test 12 / 21
Tools auf Kommandozeile Evaluation Kommando./opennlp TokenNameFinderEvaluator -encoding UTF-8 -model my-en-ner-person.bin -data../input/en-person-finder-test Testdatei <START:person> Pierre Vinken, <END> 61 years old, will join the board as a nonexecutive director Nov. 29. <START:person> Mr. Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. <START:person> Rudolph Agnew, <END> 55 years...... I don t know <START:person> John Zuse. <END> <START:person> Konrad Smith s <END> visit to Munich last summer was awesome. <START:person> Mrs. Quatsch <END> lives in Saint Moritz. Where has <START:person> Mr. Vinken <END> been so long? 12 / 21
Tools auf Kommandozeile Evaluation Kommando./opennlp TokenNameFinderEvaluator -encoding UTF-8 -model my-en-ner-person.bin -data../input/en-person-finder-test Ausgabe Loading Token Name Finder model... done (0,030s) Average: 155,6 sent/s Total: 14 sent Runtime: 0.09s Precision: 1.0 Recall: 0.9166666666666666 F-Measure: 0.9565217391304348 12 / 21
API Voraussetzung für API opennlp-tools-1.5.2-incubating.jar in den Java Library aufnehmen. 13 / 21
API Einbinden der API-Docs (optional) 14 / 21
API Einbinden der Sources (optional) 15 / 21
API Tokenizer Ausgabe [An, input, sample, sentence,.] 16 / 21
API Tokenizer (File Stream) 17 / 21
API Sentence Detector (File Stream) Ausgabe [First sentence., Second sentence.] 18 / 21
API Problem An welchen Stellen dürfen wir den Input-Stream trennen, so dass die Satzerkennung korrekt angewendet werden kann? Im Allgemeinen wissen wir das nicht und das gesamte Dokument muss in einen String eingelesen werden! Problematisch bei großen zusammenhängenden Texten. 19 / 21
Performace Tokenizer: 20MB Text etwa 40sec SentenceDetector: 20MB Text etwa 2sec Vorteile Dokumentation gut strukturiert und übersichtlich gute Handhabung über Kommandozeileninterpreter einfache Aufgaben sind einfach zu lösen Nachteile API-Dokumentation minimalistisch API: keine Streams, nur Strings bzw. String-Arrays offizielle Traininskorpora nicht verfügbar > kein Nachtrainieren von diesen möglich Algorithmen nicht beschrieben 20 / 21
Links Binaries und Sources, Modelle http://opennlp.apache.org/cgi-bin/download.cgi http://opennlp.sourceforge.net/models-1.5/ Forenbeiträge http://stackoverflow.com/questions/12895145/opennlp-sentencedetection-api-for-entire-text-file http://stackoverflow.com/questions/12081915/how-to-trainpostagger-opennlp-and-append-the-result-back-to-the-old-model 21 / 21