Bayessches Lernen (3)

Unverstät Potsdam Insttut für Informatk Lehrstuhl Maschnelles Lernen Bayessches Lernen (3) Chrstoph Sawade/Nels Landwehr Jules Rasetaharson Tobas Scheffer

Überblck Wahrschenlchketen, Erwartungswerte, Varanz Grundkonzepte des Bayesschen Lernens (Bayessche) Parameterschätzung für Wahrschenlchketsvertelungen Bayessche Lneare Regresson, Nave Bayes 2

Überblck Bayessche Lneare Regresson Modellbasertes Klassfkatonslernen: Nave Bayes 3

Wederholung: Regresson Regressonsproblem: X Tranngsdaten Matrxschrebwese Problemstellung Vorhersage: L = ( x, y),...,( x, y N N Merkmalsvektoren N = ( x... xn ) = x x m Gegeben L, neues Testbespel x Fnde optmale Vorhersage y für x x x ) Nm x y m Merkmalsvektoren reelles Zelattrbut Zugehörge Labels (Werte Zelattrbut) y y =... y N 4

Wederholung: Lneare Regresson Modellraum lneare Regresson: T f ( x w ) = wx w Parametervektor, Gewchtsvektor m = w + 0 wx Zusätzlches konstantes Attrbut x = = 0 Lneare Abhänggket von ( x) von Parametern w Lneare Abhänggket von ( x) von Engaben x f w f w 5

Bayessche Regresson: Daten Modellvorstellung bem Bayesschen Lernen: Prozess der Datengenererung Echtes Modell f * wrd aus Pror-Vertelung P( w) gezogen Merkmalsvektoren x,..., xn werden unabhängg vonenander gezogen (ncht modellert) Für jedes x wrd das Label y gezogen nach Vertelung Py ( (Anschauung: ) x, f* ) y ( ) f* x Daten L fertg generert We seht Py ( x, f) für Regressonsprobleme aus? * 6

Bayessche Regresson: Daten T Annahme, dass es echtes Modell f *( x) = xw* gbt, dass de Daten perfekt erklärt, unrealstsch Daten folgen ne genau ener Regressons- Geraden/Ebene Alternatve Annahme: Daten folgen f ( x) * bs auf klene, zufällge Abwechungen (Rauschen) 7

Bayessche Regresson: Daten Alternatve Annahme: Daten folgen f ( x) * bs auf klene, zufällge Abwechungen (Rauschen) Modellvorstellung: Zelattrbut y generert aus f ( x) * normalverteltes Rauschen y = f ( x) + ε mt ε ~ ( ε 0, σ 2 * ) f ( x ) * 0 f ( x) * plus Py x f y f x σ 2 ( 0, *) = ( *( 0), ) Parameter σ modellert Stärke des Rauschens 8

Bayessche Regresson: Vorhersagevertelung Zel: Bayessche Vorhersage y* = arg max y Py ( x, L) Ernnerung: Berechnung mt Bayesan Model Averagng Vorhersage, gegeben Modell Modell gegeben Tranngsdaten Py ( x, L) = Py ( x, θ) P( θ Ld ) θ P( θ L) = PL ( θ) P( θ) Z Lkelhood:Tranngsdaten gegeben Modell Pror über Modelle 9

Bayessche Regresson: Lkelhood I Lkelhood der Daten L: Nachrechnen: Multdmensonale Normalvertelung mt Kovaranzmatrx σ 2 I Enhetsmatrx 0... 0 0... 0 =............ 0 0... P( y X, w) = Py (,..., y X, w) N = Py ( x, w) = N = N T 2 ( y, σ ) = x w ( X T, σ 2 ) = y w I X T T x w w =... T xn w Bespele unabhängg T f ( x w ) = x w Vektor der Vorhersagen 0

Bayessche Regresson: Pror Bayessches Lernen: Pror über Modelle f Modelle parametrsert durch Gewchtsvektor w Pror über Gewchtsvektoren Normalvertelung st konjugert zu sch selbst, normalvertelter Pror und normalvertelte Lkelhood ergeben weder normalvertelten Posteror, be fester Varanz P(w) Deshalb w ~ ( w 0, ) Σ p Σ p Kovaranzmatrx, oft Σ = σ I σ steuert Stärke des Prors 2 p erwarten klene Attrbutgewchte, w 2 klen p 2 p

Bayessche Regresson: Posteror Posteror-Vertelung über Modelle gegeben Daten Ohne Bewes P( w L) = PL ( w) P( w) Bayessche Regel Z = Σ Z T ( y X w, σ 2 I) ( w 0, p ) (, A ) = w w mt w = σ 2 A X Posteror st weder normalvertelt, mt neuem Mttelwert w und Kovaranzmatrx y A T = σ 2 XX +Σ p A 2

Bayessche Regresson: Posteror Posteror: MAP-Hypothese: w = MAP? p L A ( w ) = ( w w, ) 3

Bayessche Regresson: Posteror Posteror: MAP-Hypothese: w = w MAP σ 2 p L A ( w ) = ( w w, ) = A Xy 4

Sequentelles Update des Posterors Instanzen unabhängg Berechnung des Posteror als sequentelles Update: Aufmultplzeren der Lkelhood enzelner Instanzen P( w L) P( w) PL ( w) N = P( w) = Py (, ) x w Lkelhood für enzeln an Pror multplzeren Se P( w) = P( w ) 0, P ( ) n w der Posteror, wenn wr nur de ersten n Instanzen n L verwenden: P( w L) P( w) Py ( x, w) Py ( 2 x2, w) Py ( 3 x3, w)... Py ( N xn, w) P ( w) P2 ( w) P3 ( w) P N ( w) y 5

Sequentelles Update des Posterors Sequentelles Update: Datenpunkte nachenander anschauen Neue Informatonen (Datenpunkte) verändern Stück für Stück de Vertelung über w 6

Bespel Bayessche Regresson f( x) w wx = 0 + (endmensonale Regresson) Sequentelles Update: P( w) = P( w) 0 P( w) = P( w) Sample aus P 0 0( w) 7

Bespel Bayessche Regresson f( x) = w + wx 0 Sequentelles Update: Lkelhood Py ( x, w) (endmensonale Regresson) P( w) P( w) Py ( x, w) 0 P ( w) Datenpunkt x, y y = f( x ) + ε = w + wx + ε 0 w0= wx + y ε Sample aus P ( w) 8

Bespel Bayessche Regresson f( x) w wx = 0 + (endmensonale Regresson) Sequentelles Update: Lkelhood Py ( x, w) P( w) P( w) Py ( x, w) 0 Posteror P ( w) Sample aus P ( w) 9

Bespel Bayessche Regresson f( x) = w + wx 0 Sequentelles Update: Py ( 2 x2, w) (endmensonale Regresson) P( w) P( w) Py ( x, w) 2 2 2 P ( w) Sample aus P 2 2( w) 20

Bespel Bayessche Regresson f( x) = w + wx 0 Sequentelles Update: Py ( x, w) N N (endmensonale Regresson) P( w) P ( w) Py ( x, w) N N N N P ( ) N w Sample aus PN ( w) 2

Bayessche Regresson: Vorhersagevertelung Bayes-Hypothese: wahrschenlchstes y. Ernnerung: Berechnung mt Bayesan Model Averagng y* = arg max y Py ( x, L) Py ( x, L) = Py ( x, θ) P( θ Ld ) θ Bayesan Model Averagng Bayessche Vorhersage: Vorhersage, gegeben Modell Mtteln der Vorhersage über alle Modelle. Modell gegeben Tranngsdaten Gewchtung: we wahrschenlch st Modell a posteror. 22

Bayessche Regresson: Vorhersagevertelung Vorhersagevertelung weder normalvertelt: Py ( x, L) = Py ( xw, ) P( w Ld ) w T 2 = ( y xw, σ ) ( w w, A ) dw Ohne Bewes mt ( T T y, A ) = xwx x = σ 2 A T w Xy A σ 2 XX Optmale Vorhersage: Engabevektor multplzert: T y * = xw = +Σ p x wrd mt w 23

Bayessche Regresson: Vorhersagevertelung T f ( x) = xw Bayessche Regresson lefert ncht nur T Regressonsfunkton f ( x) = xwsondern Dchte von y und damt auch enen Konfdenzkorrdor. ( T T y xwx, A x) x z.b. 95% Konfdenz 24

Nchtlneare Bassfunktonen Enschränkung der bshergen Modelle: nur lneare Abhänggketen zwschen x und f(x). Lneare Daten Ncht-lneare Daten In velen Fällen nützlch: ncht-lneare Abhänggket Grössere Klasse von Funktonen repräsenterbar 25

Nchtlneare Bassfunktonen Enfachster Weg: Lneare Regresson auf nchtlnearen Bassfunktonen Idee: Ncht auf den ursprünglchen x arbeten, sondern auf nchtlnearer Transformaton φ( x) Enfach: Inferenz für Lernen, Vorhersage m Prnzp unverändert Bassfunktonen,..., : m φ φd X X Instanzenraum (mest X = ) φ( x) φ ( x) φ ( x)... φd ( x) 2 = d φ : m mestens d d m 26

Nchtlneare Bassfunktonen Lneare Regresson n den Bassfunktonen d f( x) w wφ ( x) = 0 + = = w T φ( x) f(x) st lneare Kombnaton von Bassfunktonen Anschauung: Abbldung n höherdmensonalen Raum φ( X ), lneare Regresson dort 27

Nchtlneare Bassfunktonen: Bespel y Bespel X = φ ( x) = f( x) = w + wφ ( x) + wφ ( x) Nchtlneare Funkton n X darstellbar als lneare Funkton n φ( X ) f( x) 3x x 2 = + φ x x 0 2 2 φ ( x) = 2 y x 2 x 2 x 28

Nchtlneare Bassfunktonen Bespele für ncht-lneare Bassfunktonen Polynome φ j ( x) = x j 29

Nchtlneare Bassfunktonen Bespele für ncht-lneare Bassfunktonen Gauss-Kurven ( x µ j ) φ j ( x) = exp 2 2s µ,..., µ d Mttelpunkte 2 s feste Varanz 2 30

Nchtlneare Bassfunktonen Bespele für ncht-lneare Bassfunktonen Sgmode φ ( x) j x µ j = σ s σ ( a) = + exp( a ) µ,..., µ d Mttelpunkte s feste Skalerung 3

Regresson mt Bassfunktonen φ Funkton bldet m-dmensonalen Engabevektor x auf d-dmensonalen Merkmalsvektor ab. T x = φ x w Regressonsmodell: f ( ) ( ) Optmale Vorhersage we bsher, mt φ(x) statt x. ( T T φ φ φ ) Py ( x, L) = y ( x) w, ( x) A ( x) y* = arg max py ( x, L) = φ( x) T w y φ( x) Transformerte Testnstanz A= σ ΦΦ + Σ = σ A Φ Φ = φ( X Transformerte Datenmatrx 2 T p, w 2 y und ) 32

Bespel Regresson mt Nchtlnearen Bassfunktonen Bespel für Regresson mt ncht-lnearen Bassfunktonen Generere N=25 ncht-lneare Datenpunkte durch y = x + x 2 sn(2 π ) ε ε ~ ( ε 0, σ ), [0,] 9 Gaussche Bassfunktonen ( x µ j ) φ j ( x) = exp 2 2s We seht der Posteror P( w L) Vorhersagevertelung Py ( x, L) 2 µ = 0.,..., µ = 0.9 9 und de aus? 33

Vorhersagevertelung f( x) N= N=4 Datenpunkt y = sn(2 πx) + ε N=2 N=25 34

Samples aus dem Posteror N= N=4 N=2 N=25 35

Überblck Bayessche Lneare Regresson Modellbasertes Klassfkatonslernen: Nave Bayes 36

Bayessche Klassfkaton Optmale Vorhersage von y gegeben neues x: Bayesan model averagng Regresson: geschlossene Lösung, Vorhersagevertelung normalvertelt Klassfkaton: kene geschlossene Lösung für Vorhersagevertelung Zwetbester Ansatz: MAP-Hypothese y* = arg max y Py ( x, L) = arg max Py ( x, θ) P( θ Ld ) θ y Manchmal geschlossene Lösung für Posteror (nave Bayes ja, logstsche Regresson nen) MAP-Hypothese kann (numersch) berechnet werden 37

Klassfkatonsprobleme Tranngsdaten L L = ( x, y),...,( x, y N N Matrxschrebwese für Tranngsdaten L X Merkmalsvektoren X x N = ( x... xn ) = x Lernen: MAP Modell ) m x x Nm θmap = arg max θ P( θ L) = arg max PL ( θ) P( θ) θ x y Merkmalsvektoren dskrete Klassenlabels Zugehörge Klassenlabel y y y =... y N 38

Modellbasertes und Dskrmnatves Lernen Lkelhood PLθ ( ) : welcher Tel der Daten L wrd modellert? Dskrmnatves Lernen: θ wrd so gewählt, dass es Werte der Klassenvarable y n den Daten gut modellert. Klassfkator soll nur y für jedes x gut vorhersagen. Wozu also gute Modellerung von X berückschtgen? Generatves (Modellbasertes) Lernen: θmap = arg max θ P( θ) P( y X, θ) θmap = arg max θ P( θ) P( y, X θ) Dskrmnatve Lkelhood Generatve Lkelhood θ wrd so gewählt, dass es Merkmalsvektoren X und Werte der Klassenvarable y n den Daten gut modellert 39

Modellbasert: Nave Bayes Nave Bayes: Modellbaserte Klassfkaton θmap = arg max θ P( θ) P( y, X θ) Lkelhood der Daten L: N unabhängge Instanzen mt Klassenlabels PL ( θ) = P( x,..., x, y,..., y θ) N = N = P( x, y θ ) N 40

Modellbasert: Nave Bayes We modelleren wr P( x, y θ )? Gemensame Vertelung (Produktregel) We modelleren wr P( x y, θ )? x... hochdmensonal, 2 m verschedene Werte (x bnär) P( x, y θ)= Py ( θ) P( x y, θ) Klassenwahrschenlchket: z.b. P(spam) vs P(ncht spam) x = x m Nave Unabhänggketsannahme x-vertelung gegeben Klasse: z.b. Wortvertelung n Spam-Emals 4

Nave Bayes: Unabhänggketsannahme Bedngte Unabhänggketsannahme: m p( x y, θ) = Px ( y, θ) = Annahme: zwe Klassen, bnäre Attrbute Attrbute unabhängg gegeben de Klasse y Modellerte Vertelungen (Modellparameter): y Pyθ ( ) Bernoull, mt Parameter θ = Py ( = θ) Für {,..., m} (Attrbute), c {0,} (Klassen): Px ( y cθ, ) = Bernoull, mt Parameter θ x xc = Px ( = θ, y= c) 42

Nave Bayes: Lkelhood Lkelhood der Daten L mt bshergen Annahmen: N PL ( θ) = P ( x y θ) j= j= j, j N y m x y Py j θ Px j j yj θ = = j = ( ) (, ) Unabhänggket Instanzen N = Py ( j θ) P( x j yj, θ) Produktregel Bedngte Unabhänggket Attrbute, zuständge Modellparameter ( N ) y m x 0 m x py ( (, (, j θ Px j j yj θ Px j yj θ = = = = ) ) ) Label-Lkelhood j=... N j=... N y = 0 y = j j Merkmals-Lkelhood negatve Instanzen Merkmals-Lkelhood postve Instanzen y = Klassenlabel j-te Instanz = Wert -tes Merkmal j-te Instanz j x j 43

Nave Bayes: Posteror Pror-Vertelung: unabhängg für enzelne Parameter ( m ) ( ) x 0 m x = = P y ( θ) = P ( θ ) P ( θ ) P ( θ ) Label- Pror Merkmals-Pror negatve Instanzen Merkmals-Pror postve Instanzen Posteror: unabhängge Blöcke der Form Pror x Münzwurf-Lkelhood ( N ) y y m x 0 x 0 p( θ L) = P( θ ) py ( ( (, j θ ) Pθ Px j j yj θ = Z ) = ) j=... N y j = 0 Münzwurf Münzwurf m x x P( θ ) Px ( j yj, θ = ) j=... N y j = Münzwurf 44

Nave Bayes: Posteror Konjugerter Pror Beta-Vertelung y y P( θ )~ Beta( θ α, α ) 0 Für {,..., m} (Attrbute), c {0,} (Klassen): P x c x c ( )~ Beta( x c, x c) θ θ α α θ = Px ( = θ, y= c) Konjugerter Pror: Posteror-Vertelung n jedem Block weder Beta-Vertelt A-posteror Vertelung Nave Bayes: Standard Lösung für Münzwurfszenaro xc 45

Modellbasert: Nave Bayes θ y A-posteror Vertelung für Parameter P( θ L) : y MAP = y y m y P( θ L) = P( θ ) Py ( j j θ ) = Z Beta-Pror Münzwurf Lkelhood N N+ α + α + N + α 2 0 0 Beta-vertelter Posteror y = Beta( θ α + N, α + N ) mt 0 0 N0: Anzahl Bespele mt Klasse 0 n L N : Anzahl Bespele mt Klasse n L 46

Modellbasert: Nave Bayes A-posteror Vertelung für Parameter P( θ xc ) : Für {,..., m} (Attrbute), c {0,} (Klassen): θ P( θ L) = P( θ ) Px ( y, θ ) xc MAP mt = N xc xc xc j j Z j= Beta-Pror y = c N xc xc j Münzwurf Lkelhood Beta-vertelter Posteror N : Anzahl Bespele mt x = und Klasse c n L N : Anzahl Bespele mt x = 0 und Klasse c n L N = Beta + N + N + α xc xc + α + N + α 2 xc xc xc xc xc ( θ αxc xc, αxc xc ) 47

Nave Bayes: Lernalgorthmus Engabe: L= ( x, y ),...,( x, y ) Schätze Klassenvertelung: θ y MAP = N N+ α + α + N + α 2 0 0 Für Klassen y=0 und y=, für alle Attrbute, schätze: θ x y MAP = N x y x y Alle Modellparameter gelernt! N N + α mt + α + N + α 2 x y x y x y x y mt x y x y N N : Anzahl Bespele mt Klasse n L N : Anzahl Bespele mt Klasse 0 n L 0 N : Anzahl Bespele mt x = und Klasse y n L N : Anzahl Bespele mt x = 0 und Klasse y n L 48

Nave Bayes: Klassfkaton Engabe: Rückgabe: x x =... x m y* = arg max y Py ( x, θmap ) Laufzet bem Klassfzeren: O( Y m) m= Anzahl Attrbute, Y = 2 Laufzet bem Lernen: ON ( Y m) y m x y arg max y Py ( θmap ) Px (, ) yθmap = = Klassenvertelung Produkt der Attrbutvertelungen, gegeben Klasse N = Anzahl Tranngsnstanzen 49

Nave Bayes: Bespel Tranngsdaten: 2 3 x : Schufa pos. x : Student y: Rückzahlung ok? 2 Instanz x Instanz x 0 Instanz x 0 0 Pror: alle Parameter α n den Beta-Vertelungen setzen wr auf α=2 (Pseudocounts: α-=) Gelernte Parameter/Hypothese? 50

Nave Bayes: Bespel Gelernte Parameter/Hypothese Merkmalsvertelungen x Px ( y= 0) 0?? x 2 Klassenvertelung 0?? y Px ( y= 0) 2 Py ( ) 0?? Py ( ) Px ( y) x Px ( y= ) 0?? x 2 Px ( y= ) 2 0?? 5

Nave Bayes: Bespel Gelernte Parameter/Hypothese Merkmalsvertelungen x Px ( y= 0) 0 2/3 /3 x 2 0 /3 2/3 Klassenpror y Px ( y= 0) 2 Py ( ) 0 2/5 3/5 Py ( ) Px ( y) x Px ( y= ) 0 /4 3/4 x 2 0 2/4 2/4 Px ( y= ) 2 52

Nave Bayes: Bespel Testanfrage: x = (Schufa pos = 0,Student = 0) Vorhersage: 4 3 > y* = 0 45 40 y* = arg max y Py ( x, θmap ) P( y = 0) P( x y = 0) = P( y = 0) P( x = 0 y = 0) P( x = 0 y = 0) 2 2 4 = = 5 3 3 45 2 P( y = ) P( x y = ) = P( y = ) P( x = 0 y = ) P( x = 0 y = ) 3 2 3 = = 5 4 4 40 = arg max Py ( θ ) Px ( y, θ ) 2 m y MAP MAP = 53

Nave Bayes: Mehrwertge Attrbute Parameter: x vy j θ = MAP für alle Werte v von x, alle Klassen yj Pror: Drchlet statt Beta. Schätzen der Parameter: θ x= vy MAP j = v' x= vy x= vy j j x = v' y x = v' y j j mt N : Anzahl Bespele mt Wert v für Attrbut x und Klasse y n L x= vy j N j N + α + α 54

Nave Bayes: Egenschaften Enfach zu mplementeren, effzent, populär. Funktonert ok, wenn de Attrbute wrklch unabhängg snd. Das st aber häufg ncht der Fall. Unabhänggketsannahme und modellbasertes Tranng führen häufg zu schlechten Ergebnssen. Logstsche Regresson, Wnnow, Perzeptron snd mest besser. 55