Information-Retrieval: Vektorraum-Modell

Transkript

1 Information-Retrieval: Vektorraum-Modell Claes Neuefeind Fabian Steeg 03. Dezember 2009

2 Themen des Seminars Boolesches Retrieval-Modell (IIR 1) Datenstrukturen (IIR 2) Tolerantes Retrieval (IIR 3) Vektorraum-Modell (IIR 6) Evaluation (IIR 8) Web-Retrieval (IIR 19-21)

3 Wiederholung: Boolesches Retrieval Suche alle Dokumente, die Term(e) der Anfrage enthalten Ganz oder gar nicht Gut für Experten und Anwendungen, weniger gut für Nutzer Erweiterungen: Positional Index (Phrasen, Nähe) Permuterm- oder k-gram-index (Unscharfes Matchen, Korrekturen) Ranking?

4 Ranking Grundgedanke: Bewertung von Term/Dokument-Paaren durch einen Score, der die Relevanz des Terms für das Dokument wiedergibt Ansätze: Parameter und Bereiche Termgewichtung

5 Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur

6 Parameter Nutzung von Metadaten: Strukturierte Informationen über das Dokument Kontrolliertes Vokabular Invertierter Index unzureichend Erweiterung: Parameter in Index aufnehmen Zuordnung Dokument - Felder

7 Dokumentbereiche Dokumentbereiche mit Freitext Abbildung: Erweiterter Index: Bereiche als Attribute von Termen

8 Dokumentbereiche Besser: Dokumentbereiche als Attribute von Dokumenten Abbildung: Dictionary bleibt (relativ) klein Vereinfacht Berechnung (vgl. Postings Intersection)

9 Weighted Zone Scoring Bewertung durch Gewichtung von Bereichen Ranked Boolean Retrieval l i=1 g is i l = Anzahl Bereiche g = Gewicht des Bereichs s = Boolescher Score (1/0) Gewichte festlegen oder berechnen Alternativ: Gewichte induktiv lernen

10 Bisher: Matrix mit Binärwerten Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Dokumente als binäre Vektoren {0, 1} V.

11 Bisher: Matrix mit Binärwerten Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Dokumente als binäre Vektoren {0, 1} V.

12 Alternative: Nutzung der Termfrequenz Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Dokumente als Vektoren mit natürlichen Zahlen N V.

13 Alternative: Nutzung der Termfrequenz Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Dokumente als Vektoren mit natürlichen Zahlen N V.

14 Termgewichtung und Ranking Ideen für Termgewichtung aus Textstatistik These von [Luhn, 1957]: Termverteilung spiegelt Inhalt von Dokumenten wider Termhäufigkeit und -dichte sind Faktoren für Signifikanz Termverteilung als Basis für Repräsentation

15 Termgewichtung: tf Termfrequenz (tf): Häufigkeit eines Terms innerhalb eines Dokuments Berechnung des Scores für ein Anfrage/Dokument-Paar: Score(q, d) = t q d tf t,d Probleme: kein direkter Zusammenhang Häufigkeit/Relevanz (lange Dokumente) Terme nicht alle gleich wichtig ( Stoppwörter )

16 Termgewichtung: tf Ausweg: Glätten mittels log { wenn tft,d > log w t,d = 10 tf t,d sonst 0 Engerer Wertebereich: 0 0, 1 1, 2 1.3, 10 2, , etc. Aber: Häufige Terme beschreiben ein Dokument nicht zwingend besser Weitere Maße nötig

17 Termgewichtung: cf, df Korpusfrequenz (cf): Häufigkeit eines Terms im Korpus Dokumentenfrequenz (df): Anzahl an Dokumenten, in denen ein Term auftritt Wort cf df try insurance

18 Termgewichtung: idf Inverse Dokumentenfrequenz (idf): Informationsgehalt eines Terms Verteilung über Korpus idf t = log N df t

19 Beispiele für idf Berechnung: idf t = log N df t = log 1,000,000 df t term df t idf t calpurnia 1 animal 100 sunday 1000 fly 10,000 under 100,000 the 1,000,000

20 Beispiele für idf Berechnung: idf t = log N df t = log 1,000,000 df t term df t idf t calpurnia 1 6 animal sunday fly 10,000 2 under 100,000 1 the 1,000,000 0

21 Termgewichtung: tf x idf w t,d = (1 + log tf t,d ) log N df t skaliert die Termfrequenz: Relation Termhäufigkeit zu Informationsgehalt steigt, wenn t in wenig Dokumenten und/oder häufig innerhalb eines Dokuments auftritt Overlap Score Measure für Anfragen: Score(q, d) = t q tf-idf t,d

22 Gewichtungs-Matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Dokumente als reellwertige Vektoren mit tf-idf-werten R V.

23 Gewichtungs-Matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony Brutus Caesar Calpurnia Cleopatra mercy worser Dokumente als reellwertige Vektoren mit tf-idf-werten R V.

24 Das Vector Space Model hochdimensionaler Vektorraum Anzahl der Dimensionen = Größe des Vokabulars Abbildung:

25 Dokumente als Vektoren Dokumente als Punkte im Vektorraum Terme definieren die Achsen des Vektorraums Merkmale = Terme Werte = Gewichte Numerische Repräsentation im Vektorraum eröffnet Zugang zu Vergleichsmetriken

26 Anfragen als Vektoren Anfragen als kleine Dokumente Verarbeitung mittels Vektorvergleich Rangliste ähnlicher Vektoren Möglichkeiten: Distanz Winkel

27 Ähnlichkeit als Distanz Euklidische Distanz: Differenz zwischen Vektoren q d = n i=1 (q i d i ) 2 Problematisch bei versch. Längen: Distanz selbst bei ähnlicher Termverteilung sehr groß Normalisieren mittels Euklidischer Länge: d n j = i=1 d i,j 2

28 Ähnlichkeit als Winkel: Die Cosinus-Ähnlichkeit cos( q, d) = sim( q, d) = q d q d = V V i=1 q2 i i=1 q id i V i=1 d i 2 q i ist der tf-idf-wert für Term i in der Anfrage d i ist der tf-idf-wert für Term i im Dokument q und d sind die Längen von q und d

29 Cosinus für normalisierte Vektoren Cosinus-Ähnlichkeit von q und d ist äquivalent zum Cosinus des Winkels zwischen q und d. Bei bereits normalisierten Vektoren entspricht der Cosinus dem Skalarprodukt der Vektoren: cos( q, d) = q d = i q i d i q und d normalisiert mit Euklidischer Länge

30 Cosinus-Ähnlichkeit Abbildung:

31 Beispielberechnung für Cosinus-Ähnlichkeit Vergleich der Romane Sense and Sensibility, Pride and Prejudice und Wuthering Heights

32 Beispielberechnung für Cosinus-Ähnlichkeit Vergleich der Romane Sense and Sensibility, Pride and Prejudice und Wuthering Heights Einfache Termfrequenz Term SaS PaP WH affection jealous gossip wuthering

33 Beispielberechnung für Cosinus Einfache Termfrequenz Term SaS PaP WH affection jealous gossip wuthering

34 Beispielberechnung für Cosinus Einfache Termfrequenz Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering

35 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering

36 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection jealous gossip wuthering

37 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection jealous gossip wuthering cos(sas,pap) 0.789* * * *

38 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection jealous gossip wuthering cos(sas,pap) 0.789* * * * cos(sas,wh) 0.79

39 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection jealous gossip wuthering cos(sas,pap) 0.789* * * * cos(sas,wh) 0.79 cos(pap,wh) 0.69

40 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection jealous gossip wuthering cos(sas,pap) 0.789* * * * cos(sas,wh) 0.79 cos(pap,wh) 0.69 Warum ist cos(sas,pap) > cos(sas,wh)?

41 Beispielberechnung für Cosinus log-tf-gewichtung Term SaS PaP WH affection jealous gossip wuthering log-tf-gewichtung & Cosinus-Normalisierung Term SaS PaP WH affection jealous gossip wuthering cos(sas,pap) 0.789* * * * cos(sas,wh) 0.79 cos(pap,wh) 0.69 Warum ist cos(sas,pap) > cos(sas,wh)? SaS, PaP: Jane Austen; WH: Emily Brontë

42 Komponenten der tf-idf-gewichtung Termfrequenz Dokumentfrequenz Normalisierung n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w M 2 a (augmented) b (boolean) tft,d maxt(tft,d ) { 1 if t,d > 0 0 otherwise p (prob idf) max{0, log N t t } u (pivoted unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(t,d ) 1+log(t d (t,d ))

43 Komponenten der tf-idf-gewichtung Termfrequenz Dokumentfrequenz Normalisierung n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w M 2 a (augmented) b (boolean) tft,d maxt(tft,d ) { 1 if t,d > 0 0 otherwise p (prob idf) max{0, log N t t } u (pivoted unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(t,d ) 1+log(t d (t,d )) Bekannteste Kombination

44 Komponenten der tf-idf-gewichtung Termfrequenz Dokumentfrequenz Normalisierung n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w w M 2 a (augmented) b (boolean) tft,d maxt(tft,d ) { 1 if t,d > 0 0 otherwise p (prob idf) max{0, log N t t } u (pivoted unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(t,d ) 1+log(t d (t,d )) Default: Keine Gewichtung

45 Komponenten der tf-idf-gewichtung Anfrage und Dokument oftmals unterschiedlich gewichtet SMART-Notation: qqq.ddd Beispiel: ltn.lnc Anfrage: Logarithmische tf, idf, keine Normalisierung Dokument: Logarithmische tf, keine idf, Cosinus-Normalisierung Beispiel: Anfrage: best car insurance Dokument: car insurance auto insurance

46 Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

47 Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

48 Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 1 best 1 0 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight

56 Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight / /

58 Beispielberechnung für tf-idf mit der Kombination ltn.lnc Anfrage: best car insurance. Dokument: car insurance auto insurance. Wort query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Score für Anfrage/Dokument: i w qi w di = = 3.08

59 Zusammenfassung: Das Vector Space Model Vorteile: Kompakte Darstellung der Eigenschaften von Dokumenten Numerische Repräsentation Vergleichsmetriken liefern graduelle Ähnlichkeiten Ranking der Dokumente relativ zur Anfrage Probleme: Bag of words Wildcards / unscharfes Matchen Dimensionalität / Sparseness Polysemie / Homonymie

60 VSM vs. Boolesches Modell VSM: Akkumulierte Evidenz: Termfrequenz erhöht Bewertung Nur für Freitext-Anfragen geeignet Boolesches Modell: Selektive Evidenz Wahr, wenn Gewicht 0 Kombination: implizites UND Weitere Operatoren für verfeinerte Anfragen

61 VSM und Wildcards Keine direkte Abfrage möglich Indexstrukturen nicht kompatibel (Matrix/Baum) Kombinierbar mittels k-gram-index und Query expansion : Aus k-gram-index passende Terme holen Daraus Anfragen-Vektor konstruieren

62 VSM und Phrase Queries VSM nicht für Positionsabhängige Suche geeignet Bei Mehrwort-Anfragen werden immer auch die Achsen der einzelnen Terme aktiviert Kombinierbar mittels Query Parsing

63 Wie geht es weiter? Evaluation (IIR 8) Web-Retrieval (IIR 19-21)

64 Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4): Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. Zum Nachlesen: [Manning et al., 2008], Kapitel 6 (siehe