Information-Retrieval: Vektorraum-Modell

Information-Retrieval: Vektorraum-Modell Claes Neuefeind Fabian Steeg 03. Dezember 2009

Themen des Seminars Boolesches Retrieval-Modell (IIR 1) Datenstrukturen (IIR 2) Tolerantes Retrieval (IIR 3) Vektorraum-Modell (IIR 6) Evaluation (IIR 8) Web-Retrieval (IIR 19-21)

Wiederholung: Boolesches Retrieval Suche alle Dokumente, die Term(e) der Anfrage enthalten Ganz oder gar nicht Gut für Experten und Anwendungen, weniger gut für Nutzer Erweiterungen: Positional Index (Phrasen, Nähe) Permuterm- oder k-gram-index (Unscharfes Matchen, Korrekturen) Ranking?

Ranking Grundgedanke: Bewertung von Term/Dokument-Paaren durch einen Score, der die Relevanz des Terms für das Dokument wiedergibt Ansätze: Parameter und Bereiche Termgewichtung

Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur

Parameter Nutzung von Metadaten: Strukturierte Informationen über das Dokument Kontrolliertes Vokabular Invertierter Index unzureichend Erweiterung: Parameter in Index aufnehmen Zuordnung Dokument - Felder

Dokumentbereiche Dokumentbereiche mit Freitext Abbildung: www.informationretrieval.org Erweiterter Index: Bereiche als Attribute von Termen

Dokumentbereiche Besser: Dokumentbereiche als Attribute von Dokumenten Abbildung: www.informationretrieval.org Dictionary bleibt (relativ) klein Vereinfacht Berechnung (vgl. Postings Intersection)

Weighted Zone Scoring Bewertung durch Gewichtung von Bereichen Ranked Boolean Retrieval l i=1 g is i l = Anzahl Bereiche g = Gewicht des Bereichs s = Boolescher Score (1/0) Gewichte festlegen oder berechnen Alternativ: Gewichte induktiv lernen

Bisher: Matrix mit Binärwerten Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... Dokumente als binäre Vektoren {0, 1} V.

Alternative: Nutzung der Termfrequenz Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 157 73 0 0 0 1 Brutus 4 157 0 2 0 0 Caesar 232 227 0 2 1 0 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 8 5 8 worser 2 0 1 1 1 5... Dokumente als Vektoren mit natürlichen Zahlen N V.

Termgewichtung und Ranking Ideen für Termgewichtung aus Textstatistik These von [Luhn, 1957]: Termverteilung spiegelt Inhalt von Dokumenten wider Termhäufigkeit und -dichte sind Faktoren für Signifikanz Termverteilung als Basis für Repräsentation

Termgewichtung: tf Termfrequenz (tf): Häufigkeit eines Terms innerhalb eines Dokuments Berechnung des Scores für ein Anfrage/Dokument-Paar: Score(q, d) = t q d tf t,d Probleme: kein direkter Zusammenhang Häufigkeit/Relevanz (lange Dokumente) Terme nicht alle gleich wichtig ( Stoppwörter )

Termgewichtung: tf Ausweg: Glätten mittels log { wenn tft,d > 0 1 + log w t,d = 10 tf t,d sonst 0 Engerer Wertebereich: 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Aber: Häufige Terme beschreiben ein Dokument nicht zwingend besser Weitere Maße nötig

Termgewichtung: cf, df Korpusfrequenz (cf): Häufigkeit eines Terms im Korpus Dokumentenfrequenz (df): Anzahl an Dokumenten, in denen ein Term auftritt Wort cf df try 10422 8760 insurance 10440 3997

Termgewichtung: idf Inverse Dokumentenfrequenz (idf): Informationsgehalt eines Terms Verteilung über Korpus idf t = log N df t

Beispiele für idf Berechnung: idf t = log N df t = log 1,000,000 df t term df t idf t calpurnia 1 animal 100 sunday 1000 fly 10,000 under 100,000 the 1,000,000

Beispiele für idf Berechnung: idf t = log N df t = log 1,000,000 df t term df t idf t calpurnia 1 6 animal 100 4 sunday 1000 3 fly 10,000 2 under 100,000 1 the 1,000,000 0

Termgewichtung: tf x idf w t,d = (1 + log tf t,d ) log N df t skaliert die Termfrequenz: Relation Termhäufigkeit zu Informationsgehalt steigt, wenn t in wenig Dokumenten und/oder häufig innerhalb eines Dokuments auftritt Overlap Score Measure für Anfragen: Score(q, d) = t q tf-idf t,d

Gewichtungs-Matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95... Dokumente als reellwertige Vektoren mit tf-idf-werten R V.

Das Vector Space Model hochdimensionaler Vektorraum Anzahl der Dimensionen = Größe des Vokabulars Abbildung: www.informationretrieval.org

Dokumente als Vektoren Dokumente als Punkte im Vektorraum Terme definieren die Achsen des Vektorraums Merkmale = Terme Werte = Gewichte Numerische Repräsentation im Vektorraum eröffnet Zugang zu Vergleichsmetriken

Anfragen als Vektoren Anfragen als kleine Dokumente Verarbeitung mittels Vektorvergleich Rangliste ähnlicher Vektoren Möglichkeiten: Distanz Winkel

Ähnlichkeit als Distanz Euklidische Distanz: Differenz zwischen Vektoren q d = n i=1 (q i d i ) 2 Problematisch bei versch. Längen: Distanz selbst bei ähnlicher Termverteilung sehr groß Normalisieren mittels Euklidischer Länge: d n j = i=1 d i,j 2

Ähnlichkeit als Winkel: Die Cosinus-Ähnlichkeit cos( q, d) = sim( q, d) = q d q d = V V i=1 q2 i i=1 q id i V i=1 d i 2 q i ist der tf-idf-wert für Term i in der Anfrage d i ist der tf-idf-wert für Term i im Dokument q und d sind die Längen von q und d

Cosinus für normalisierte Vektoren Cosinus-Ähnlichkeit von q und d ist äquivalent zum Cosinus des Winkels zwischen q und d. Bei bereits normalisierten Vektoren entspricht der Cosinus dem Skalarprodukt der Vektoren: cos( q, d) = q d = i q i d i q und d normalisiert mit Euklidischer Länge

Cosinus-Ähnlichkeit Abbildung: www.informationretrieval.org

Beispielberechnung für Cosinus-Ähnlichkeit Vergleich der Romane Sense and Sensibility, Pride and Prejudice und Wuthering Heights

Beispielberechnung für Cosinus-Ähnlichkeit Vergleich der Romane Sense and Sensibility, Pride and Prejudice und Wuthering Heights Einfache Termfrequenz Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38

Beispielberechnung für Cosinus Einfache Termfrequenz Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38

Beispielberechnung für Cosinus Einfache Termfrequenz Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588 cos(sas,pap) 0.789*0.832+0.515*0.555+0.335*0.0+0.0*0.0 0.94.

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588 cos(sas,pap) 0.789*0.832+0.515*0.555+0.335*0.0+0.0*0.0 0.94. cos(sas,wh) 0.79

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588 cos(sas,pap) 0.789*0.832+0.515*0.555+0.335*0.0+0.0*0.0 0.94. cos(sas,wh) 0.79 cos(pap,wh) 0.69

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588 cos(sas,pap) 0.789*0.832+0.515*0.555+0.335*0.0+0.0*0.0 0.94. cos(sas,wh) 0.79 cos(pap,wh) 0.69 Warum ist cos(sas,pap) > cos(sas,wh)?

Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588 cos(sas,pap) 0.789*0.832+0.515*0.555+0.335*0.0+0.0*0.0 0.94. cos(sas,wh) 0.79 cos(pap,wh) 0.69 Warum ist cos(sas,pap) > cos(sas,wh)? SaS, PaP: Jane Austen; WH: Emily Brontë

Komponenten der tf-idf-gewichtung Termfrequenz Dokumentfrequenz Normalisierung n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w2 2+...+w M 2 a (augmented) b (boolean) 0.5 + 0.5 tft,d maxt(tft,d ) { 1 if t,d > 0 0 otherwise p (prob idf) max{0, log N t t } u (pivoted unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(t,d ) 1+log(t d (t,d ))

Komponenten der tf-idf-gewichtung Termfrequenz Dokumentfrequenz Normalisierung n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w2 2+...+w M 2 a (augmented) b (boolean) 0.5 + 0.5 tft,d maxt(tft,d ) { 1 if t,d > 0 0 otherwise p (prob idf) max{0, log N t t } u (pivoted unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(t,d ) 1+log(t d (t,d )) Bekannteste Kombination

Komponenten der tf-idf-gewichtung Termfrequenz Dokumentfrequenz Normalisierung n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w2 2+...+w M 2 a (augmented) b (boolean) 0.5 + 0.5 tft,d maxt(tft,d ) { 1 if t,d > 0 0 otherwise p (prob idf) max{0, log N t t } u (pivoted unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(t,d ) 1+log(t d (t,d )) Default: Keine Gewichtung

Komponenten der tf-idf-gewichtung Anfrage und Dokument oftmals unterschiedlich gewichtet SMART-Notation: qqq.ddd Beispiel: ltn.lnc Anfrage: Logarithmische tf, idf, keine Normalisierung Dokument: Logarithmische tf, keine idf, Cosinus-Normalisierung Beispiel: Anfrage: best car insurance Dokument: car insurance auto insurance

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 1 best 1 0 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 1 best 1 1 0 car 1 1 1 insurance 1 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 1 1 best 1 1 0 0 car 1 1 1 1 insurance 1 1 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 1 1 best 1 1 50000 0 0 car 1 1 10000 1 1 insurance 1 1 1000 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 1 1 best 1 1 50000 1.3 0 0 car 1 1 10000 2.0 1 1 insurance 1 1 1000 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 best 1 1 50000 1.3 1.3 0 0 car 1 1 10000 2.0 2.0 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 best 1 1 50000 1.3 1.3 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 best 1 1 50000 1.3 1.3 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight 1 2 + 0 2 + 1 2 + 1.3 2 1.92 1/1.92 0.52 1.3/1.92 0.68

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Score für Anfrage/Dokument: i w qi w di = 0 + 0 + 1.04 + 2.04 = 3.08

Zusammenfassung: Das Vector Space Model Vorteile: Kompakte Darstellung der Eigenschaften von Dokumenten Numerische Repräsentation Vergleichsmetriken liefern graduelle Ähnlichkeiten Ranking der Dokumente relativ zur Anfrage Probleme: Bag of words Wildcards / unscharfes Matchen Dimensionalität / Sparseness Polysemie / Homonymie

VSM vs. Boolesches Modell VSM: Akkumulierte Evidenz: Termfrequenz erhöht Bewertung Nur für Freitext-Anfragen geeignet Boolesches Modell: Selektive Evidenz Wahr, wenn Gewicht 0 Kombination: implizites UND Weitere Operatoren für verfeinerte Anfragen

VSM und Wildcards Keine direkte Abfrage möglich Indexstrukturen nicht kompatibel (Matrix/Baum) Kombinierbar mittels k-gram-index und Query expansion : Aus k-gram-index passende Terme holen Daraus Anfragen-Vektor konstruieren

VSM und Phrase Queries VSM nicht für Positionsabhängige Suche geeignet Bei Mehrwort-Anfragen werden immer auch die Achsen der einzelnen Terme aktiviert Kombinierbar mittels Query Parsing

Wie geht es weiter? Evaluation (IIR 8) Web-Retrieval (IIR 19-21)

Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4):309 317. Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. Zum Nachlesen: [Manning et al., 2008], Kapitel 6 (siehe www.informationretrieval.org)