Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Zusammenfassung: Entscheidungsbäume. Paul Prasse

Transkript

1 Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Zusammenfassung: Entscheidungsbäume Paul Prasse

2 Entscheidungsbäume Eine von vielen Anwendungen: Kreditrisiken nein abgelehnt nein Kredit - Sicherheiten > 2 x verfügbares Einkommen? ja nein Arbeitslos? ja angenommen Länger als 3 Monate beschäftigt? abgelehnt ja Ja Schufa-Auskunft positiv? nein abgelehnt abgelehnt Kredit - Sicherheiten > 5 x verfügbares Einkommen? nein nein angenommen ja Student? ja abgelehnt 2

3 Entscheidungsbäume Warum? Einfach zu interpretieren. Liefern Klassifikation plus Begründung. Abgelehnt, weil weniger als 3 Monate beschäftigt und Kredit-Sicherheiten < 2 x verfügbares Einkommen. Können aus Beispielen gelernt werden. Einfacher Lernalgorithmus. Effizient, skalierbar. Verschiedene Lernprobleme werden abgedeckt. Klassifikations- und Regressionsbäume. Klassifikations-, Regressions-, Modellbäume häufig Komponenten komplexer (z.b. Risiko-)Modelle. 3

4 Anwendung von Entscheidungsbäumen Testknoten: Führe Test aus, wähle passende Verzweigung, rekursiver Aufruf. Terminalknoten: Liefere Wert als Klasse zurück. nein abgelehnt nein Kredit - Sicherheiten > 2 x verfügbares Einkommen? ja nein Arbeitslos? ja angenommen Länger als 3 Monate beschäftigt? abgelehnt ja Ja Schufa-Auskunft positiv? nein abgelehnt abgelehnt Kredit - Sicherheiten > 5 x verfügbares Einkommen? nein nein angenommen ja Student? ja abgelehnt 4

5 Entscheidungsbäume Wann sinnvoll? Entscheidungsbäume sinnvoll, wenn: Instanzen durch Attribut-Werte-Paare beschrieben werden. Zielattribut einen diskreten Wertebereich hat. Bei Erweiterung auf Modellbäume auch kontinuierlicherer Wertebereich möglich. Interpretierbarkeit der Vorhersage gewünscht ist. Anwendungsgebiete: Medizinische Diagnose, Kreditwürdigkeit, Teilsystem von komplexeren Systemen. 5

6 Lernen von Entscheidungsbäumen Eleganter Weg: Unter den Bäumen, die mit den Trainingsdaten konsistent sind, wähle einen möglichst kleinen Baum (möglichst wenige Knoten). Kleine Bäume sind gut, weil: sie leichter zu interpretieren sind; sie in vielen Fällen besser generalisieren. Es gibt mehr Beispiele pro Blattknoten. Klassenentscheidungen in den Blättern stützen sich so auf mehr Beispiele. 6

7 Greedy-Algorithmus Top-Down Konstruktion Solange die Trainingsdaten nicht perfekt klassifiziert sind: Wähle das beste Attribut A, um die Trainingsdaten zu trennen. Erzeuge für jeden Wert des Attributs A einen neuen Nachfolgeknoten und weise die Trainingsdaten dem so entstandenen Baum zu. Problem: Welches Attribut ist das beste? [29 +, 35 -] ja x 1 x 2 nein [29 +, 35 -] ja nein [21+, 5 -] [8+, 30 -] [18+, 33 -] [11+, 2 -] 7

8 Information Gain / Gain Ratio Infornation Gain: Unsicherheit bzgl. der Zielklasse: Informationsgewinn (IG): Problem: Informationsgehalt riesig bei stark verzweigenden Tests. 8 log ), ( 2 L L L L x L SplitInfo v x v v x ), ( ), ( ), ( x L SplitInfo x L IG x L GainRatio v v x L y L H v x p y L H x L IG ), ( ) ( ˆ ), ( ), ( ' 2 ) ' ( ˆ )log ' ( ˆ ), ( ), ( v L L v x v x v y p v x v y p v x y L H y L H

9 Algorithmus ID3 Voraussetzung: Klassifikationslernen, Alle Attribute haben festen, diskreten Wertebereich. Idee: rekursiver Algorithmus. Wähle das Attribut, welches die Unsicherheit bzgl. der Zielklasse maximal verringert Information Gain, Gain Ratio, Gini Index Dann rekursiver Aufruf für alle Werte des gewählten Attributs. Solange, bis in einem Zweig nur noch Beispiele derselben Klasse sind. Originalreferenz: J.R. Quinlan: Induction of Decision Trees

10 Kontinuierliche Attribute Idee: Binäre Entscheidungsbäume, -Tests. Problem: Unendlich viele Werte für binäre Tests. Idee: Nur endlich viele Werte kommen in Trainingsdaten vor. Beispiel: Kontinuierliches Attribut hat folgende Werte: 0,2; 0,4; 0,7; 0,9 Mögliche Splits: 0,2; 0,4; 0,7; 0,9 Andere Möglichkeiten: Attribute in einen diskreten Wertebereich abbilden. 10

11 Algorithmus C4.5 Weiterentwicklung von ID3 Verbesserungen: auch kontinuierliche Attribute behandelt Trainingsdaten mit fehlenden Attributwerten behandelt Attribute mit Kosten Pruning Nicht der letzte Stand: siehe C5.0 Originalreferenz: J.R. Quinlan: C4.5: Programs for Machine Learning

12 Pruning Problem: Blattknoten, die nur von einem (oder sehr wenigen) Beispielen gestützt werden, liefern häufig keine gute Klassifikation(Generalisierungsproblem). Pruning: Entferne Testknoten, die Blätter mit weniger als einer Mindestzahl von Beispielen erzeugen. Dadurch entstehen Blattknoten, die dann mit der am häufigsten auftretenden Klasse beschriftet werden. 12

13 Pruning mit Schwellwert Für alle Blattknoten: Wenn weniger als r Trainingsbeispiele in den Blattknoten fallen Entferne darüber liegenden Testknoten. Erzeuge neuen Blattknoten, sage Mehrheitsklasse voraus. Regularisierungsparameter r. Einstellung mit Cross Validation. 13

14 Reduced Error Pruning Aufteilen der Trainingsdaten in Trainingsmenge und Pruningmenge. Nach dem Aufbau des Baumes mit der Trainingsmenge: Versuche, zwei Blattknoten durch Löschen eines Tests zusammenzulegen, Solange dadurch die Fehlerrate auf der Pruningmenge verringert (nicht vergrößert) wird. 14

15 Entscheidungsbäume aus großen Datenbanken: SLIQ C4.5 iteriert häufig über die Trainingsmenge Wenn die Trainingsmenge nicht in den Hauptspeicher passt, wird das Swapping nicht mehr praktikabel! SLIQ: Vorsortieren der Werte für jedes Attribut Baum breadth-first aufbauen, nicht depth-first. Originalreferenz: M. Mehta et. al.: SLIQ: A Fast Scalable Classifier for Data Mining

16 SLIQ- Pruning Minimum Description Length (MDL): Bestes Modell für einen gegebenen Datensatz minimiert Summe aus Länge der im Modell kodierten Daten plus Länge des Modells. cost M, D = cost D M + cost M cost M = Kosten des Modells (Länge). Wie groß ist der Entscheidungsbaum. cost D M = Kosten, die Daten durch das Modell zu beschreiben. Wie viele Klassifikationsfehler entstehen. 16

17 Modellbäume Entscheidungsbaum, aber lineares Regressionsmodell in Blattknoten. ja ja x <.9 x <.5 x < 1.1 nein nein nein

18 Entscheidungsbaum - Vorteile Einfach zu interpretieren. Können effizient aus vielen Beispielen gelernt werden: SLIQ, SPRINT, C4.5, ID3. Viele Anwendungen. Hohe Genauigkeit. 18

19 Entscheidungsbaum - Nachteile Nicht robust gegenüber Rauschen. Tendenz zum Overfitting. Instabil. 19

20 Zusammenfassung Entscheidungsbäume Klassifikation : Vorhersage diskreter Werte. Diskrete Attribute: ID3. Kontinuierliche Attribute: C4.5. Skalierbar für große Datenbanken: SLIQ, SPRINT. Regression: Vorhersage kontinuierlicher Werte. Regressionsbäume: konstante Werte in Blättern. Modellbäume: lineare Modelle in den Blättern. 20

21 Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Zusammenfassung: Bayes sches Lernen, Evaluierung Niels Landwehr

22 Überblick Bayes sches Lernen Evaluierung 22

23 Diskrete und kontinuierliche Zufallsvariablen Diskrete Zufallsvariablen Endliche, diskrete Menge von Werten D möglich Verteilung beschrieben durch diskrete Wahrscheinlichkeiten Kontinuierliche Zufallsvariablen Kontinuierliche, unendliche Menge von Werten möglich Verteilung beschrieben durch kontinuierliche Dichtefunktion px ( ) 0 p( x) [0, 1] mit p( x) 1 mit p( x) dx 1 xd Wahrscheinlichkeit f ür x[ a, b]: p( x[ a, b]) p( x) dx b a 23

24 Diskrete Verteilungen Binomialverteilung: Anzahl Köpfe bei N Münzwürfen X X i ~ Bern(X ) N i1 ~ Bin( X N, ) N Bin( X N, ) (1 ) X i X X, X {0,..., N} i X N X N 10,

25 Kontinuierliche Verteilungen Normalverteilung (Gaußverteilung) Beispiel: Körpergröße X Annähernd normalverteilt: X 2 ~ ( x, ) Dichte der Normalverteilung z.b. 170, 10 25

26 Erwartungswert und Varianz Erwartungswert einer Zufallsvariable Gewichteter Mittelwert E( X ) xp( X x) Varianz einer Zufallsvariable E( X ) xp( x) dx x X diskrete ZV Erwartete quadrierte Abweichung der Zufallsvariable von ihrem Erwartungswert Mass für die Stärke der Streuung Var X E X E X 2 ( ) (( ( )) ) X kontinuierliche ZV mit Dichte p(x) Verschiebungssatz: Var( X ) E( X ) E( X )

27 Beispiel: Erwartungswert Normalverteilung zx Erwartungswert Normalverteilung X 2 ~ ( x, ) 2 E( X ) x ( x, ) dx x exp ( ) 2 1/2 2 (2 ) 2 x dx ( z ) exp 2 1/2 2 (2 ) 2 z dz exp exp 2 1/ /2 2 (2 ) 2 z dz z (2 ) 2 z dz

28 Bayessche Lernen Bayes sches Lernen: Anwendung probabilistischer Überlegungen auf Modelle, Daten, und Vorhersagen Bayes sche Regel: Modellwahrscheinlichkeit gegeben Daten und Vorwissen p( Modell Daten) Posterior: wie ist die Wahrscheinlichkeit für Modelle, gegeben Evidenz der Trainingsdaten? Likelihood: Wie hoch ist die Wahrscheinlichkeit, bestimmte Daten zu sehen, unter der Annahme dass Modell das korrekte Modell ist? p( Daten Modell ) p( Modell ) p( Daten) Wahrscheinlichkeit der Daten, unabhängig von Modell A-priori Verteilung über Modelle: Vorwissen

29 Parameter von Verteilungen schätzen Erste Anwendung Bayes scher Überlegungen: Parameterschätzung in Münzwurfexperimenten Münze mit unbekanntem Parameter wird N Mal geworfen Daten L in Form von N K Kopfwürfen und N Z Zahlwürfen Was sagen uns diese Beobachtungen über den echten Münzwurfparameter? Wissen über echten Münzwurfparameter abgebildet in a- posteriori Verteilung p( L) Ansatz mit Bayesscher Regel: p( L) p( L p( pl ( ) 29

30 Parameterschätzungen von Wahrscheinlichkeitsverteilungen: Binomial Geeignete Prior-Verteilung: Beta-Verteilung Kontinuierliche Verteilung über Konjugierter Prior: a-posteriori Verteilung wieder Beta K 5, 5 1, 1 4, 2 p( ) Beta( Z k k z k (1 k z z 1 1 z K Z 1 0 Beta(, ) d 1 K Z K Z 30

31 Parameterschätzungen von Wahrscheinlichkeitsverteilungen: Binomial Mit Hilfe der Bayes schen Regel werden Vorwissen (Beta- Verteilung) und Beobachtungen (N K, N Z ) zu neuem Gesamtwissen P(θ L) integriert A-priori-Verteilung Beta P( ) Beta( 5 5 Daten L: N K =50x Kopf, N z =25x Zahl Konjugierter Prior: A-posteriori Verteilung wieder Beta P( L) Beta(

32 Bayessche Lineare Regression Bayes sche Lineare Regression: Bayes sches Modell für Regressionsprobleme Lineares Modell T f ( x w ) w x w0 Annahme über datengenerierenden Prozess: echtes lineares Modell f ( x) * plus Gauß sches Rauschen y ( ) i f* xi mit m i1 w x i ~ ( 0, ) i f ( x) * 32

33 Bayessche Lineare Regression: Prior Ziel zunächst Berechnung der a-posteriori Verteilung mit Bayes scher Regel P( w L) P( L w) P( w) pl ( ) Geeignete (konjugierte) Prior-Verteilung: Normalverteilung über Parametervektoren w Größte Dichte bei w=0 Erwarten kleine Attributgewichte 33

34 Bayessche Lineare Regression: Posterior Posterior-Verteilung über Modelle gegeben Daten 1 P( w L) P( L w) P( w) Bayessche Regel Z 1 Z T ( y X w, I) ( w 0, p) 1 ( w w, A ) mit w 1 A X Posterior ist wieder normalverteilt, mit neuem Mittelwert w und Kovarianzmatrix y A XX T A 1 p 34

35 Bayes sche Vorhersage Nicht auf MAP-Modell festlegen, solange noch Unsicherheit über Modelle besteht Stattdessen Bayes sche Vorhersage: berechne direkt wahrscheinlichstes Label für Testinstanz Vorhersage y arg max p( y x, L) Bayes sche Vorhersage: Neue Testinstanz x * arg max p( y w, xp( w L) dw Bayesian Model Averaging Mitteln der Vorhersage über alle Modelle. y y Vorhersage, gegeben Modell Modell gegeben Trainingsdaten Gewichtung: wie gut passt Modell zu Trainingsdaten. 35

36 Bayes sche Vorhersage für die Lineare Regression Für die Bayes sche lineare Regression kann die Bayesoptimale Vorhersage direkt berechnet werden: P( y x, L) P( y x, w) P( w L) dw mit T 2 1 ( y x w, ) ( w w, A ) dw 1 A T w Xy A XX Vorhersageverteilung wieder normalverteilt y, T 2 T 1 x w x A x p 36

37 Lineare Regression auf nichtlinearen Basisfunktionen y Nichtlineare Zusammenhänge in Daten darstellbar durch lineare Regression in nichtlinearen Basisfunktionen f ( x) w T ( x) w w ( x) 0 : Abbildung von in höherdimensionalen Raum Lineare Regression in entspricht nichtlinearer Regression in. f ( x) 1 3x x d i1 ( ) 2 x i i y x ( ) 2 x 37

38 Beispiel nichtlineare Regression: Vorhersageverteilung f( x) N=1 N=4 Datenpunkt y sin(2 x) N=2 N=25 38

39 Beispiel nichtlineare Regression: Samples aus dem Posterior N=1 N=4 N=2 N=25 39

40 Naive Bayes Klassifikator (Übungen) Naive Bayes Klassifikator als einfaches probabilistisches Klassifikationsmodell (binäre Attribute, binäre Klassen) Definiert eine gemeinsame Verteilung über x und y durch Münzwürfe p( x, y ) p( y ) p( x y, ) Münzwurf für Klasse y p( y ) p( x y, Konjugierter Prior Betaverteilung, Lösung für MAP Parameter wie bei Münzwurfexperimenten m i1 i x y Münzwurf für Attribut, gegeben Klasse i 40

42 Modellevaluierung Problem der Risikoschätzung für gelernte Modelle Risiko eines Modells: erwarteter Verlust auf Testbeispielen R( f ) E[ ( y, f ( x))] ( y, f ( x)) p( x, y) dxdy (, x y) ~ p( x, y) Testbeispiel Risiko lässt sich aus Daten schätzen 1 m m Rˆ( f ) ( y, f ( x )) "Risikoschätzer, empirisches Risiko" 1 1 j1 j j T ( x, y ),...,( x, y ) ( x, y ) ~ p( x, y) m m i i 42

43 Der Risikoschätzer als Zufallsvariable Wert Rˆ ( f) Risikoschätzer ist Zufallsvariable Charakterisiert durch Bias und Varianz: systematischer Fehler und zufällige Streuung R R Bias dominiert Varianz dominiert Fehler des Schätzers lässt sich zerlegen in Bias und Varianz ˆ 2 ˆ 2 ˆ 2 [( ( ) ) ] [ ( ) 2 ( ) ] E R f R E R f RR f R ˆ ˆ 2 2 E[ R( f ) ] 2 RE[ R( f )] R E[ Rˆ( f )] 2 RE[ Rˆ( f )] R E[ Rˆ( f ) ] E[ Rˆ( f )] ˆ ( E[ R( f )] R) Var[ R( f )] Bias Rˆ f 2 [ ( )] Var[ R( f ) ˆ ˆ ] 43

44 Risikoschätzung auf Trainingsdaten? Risikoschätzung kann nicht auf Trainingsdaten erfolgen: optimistischer Bias E[ Rˆ ( f )] R( f ) L Problem ist die Abhängigkeit von gewähltem Modell und zur Fehlerschätzung verwendeten Daten Ansatz: Testdaten verwenden, die von den Trainingsdaten unabhängig sind. 44

45 Holdout-Testing Gegeben Daten Teile Daten auf in Trainingsdaten L ( x1, y1),...,( xm, ym) und Testdaten Starte Lernalgorithmus mit Daten L, gewinne so Modell. ˆ ( ) Ermittle empirisches Risiko R f auf Daten T. Starte Lernalgorithmus auf Daten D, gewinne so Modell. f D Ausgabe: Modell, benutze RT fl als Schätzer für das Risiko von f D D ( x, y ),...,( x, y ) 1 1 T ( x, y ),...,( x, y ) m1 m1 d d T L ˆ ( ) L T d d f L f D 45

46 Cross-Validation Gegeben: Daten Teile D in n gleich große Blöcke Wiederhole für i=1..n Trainiere f i mit L i =D \ D i. Bestimme empirisches Risiko auf D i. D Fehlerschätzung ( x, y1),...,( x R Starte Lernalgorithmus auf Daten D, gewinne so Modell., y 1 d d 1 n n i 1 D1 D2 D3 D4 ˆ ) D,..., 1 Dn R i Rˆ D ( f ) i i ( D f i ) Training examples f D 46

47 Bias und Varianz in Holdout-Testing und Cross-Validation Bias und Varianz der Fehlerschätzung aus Holdout-Testing und Cross-Validation? Fehlerschätzungen aus Holdout-Testing und Cross-Validation jeweils leicht pessimistisch Aber im Gegensatz zum Trainingsfehler in der Praxis brauchbar Cross-Validation hat geringere Varianz als Holdout-Testing, weil wir über mehrere Holdout-Experimente mitteln 47

49 Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Linear Classifiers - Summary Blaine Nelson, Christoph Sawade, Tobias Scheffer

50 Classification Input: an instance x X E.g., X can be a vector space over attributes The Instance is then an assignment of attributes. x = x 1 x m is a feature vector Output: Class y Y; where Y is a finite set. The class is also referred to as the target attribute y is also referred to as the (class) label x classifier y 50

51 Major Themes Classification Problem Linear Classifiers Bayesian Classifier Decision, MAP Models Regularized Empirical Risk Minimization Logistic Regression, Perceptron, Support Vector Machine Dual Form Linear Models & Kernels Representer Theorem & Dualized Learning Algorithms Kernel as Inner Product in Feature Space Kernels for Structured Data Spaces String Kernels, Graph Kernels Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 51

52 LINEAR CLASSIFICATION MODELS 52

53 Linear Models Hyperplane given by normal vector & displacement: H θ = x f θ x = x T θ + b = 0 Decision function: f θ x = x T θ + b x 2 b θ θ f θ x f θ x > 0 f θ x < 0 = 0 x 1 x 2 p x y = 1, θ f θ x = 0 x 1 53

54 Linear Classification Data is mapped by φ x to feature space. The offsets b are appended to the end of the vector θ & a 1 is added to the end of each φ x i. In the binary case, the linear classifier has a decision function: f θ x = φ x T θ & a classifier: y x = sign f θ x In the multi-class case, the linear classifier has a decision function: f θ x, y = φ x T θ y & a classifier: y x = argmax z Y f θ x, z 54

55 Linear Classifier Learning Separable data There are several good linear classifiers How do we choose? Nonseparable data We must choose a best classifier for the data How do we minimize error? 55

56 BAYESIAN APPROACH 56

57 Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n Inference of the most likely class y = argmax y p y x, T n We must make assumptions about the process by which the data is generated to be able to calculate the most probable class. We assume all data are independent given model θ. 57

58 Empirical Inference Inference of the probability of y given instance x and training data? Likelihood of the Class p y x, T n = p y x, θ p θ T n dθ where θ MAP = argmax θ p y x, θ MAP p y X, θ p θ Classification through the most probable single model instead of a sum over all models. Prior over model parameters 58

59 Graphical Model for Classification A graphical model defines a stochastic process constituting our modeling assumptions about data n i=1 Independence: p y X, θ = p y i x i, θ Discriminative model: class probabilities p y i x i, θ are directly specified by the model. x i θ y i n x y Generative model: apply Bayes Rule, p y i x i, θ = p x i y i,θ p y i θ y Y p x i z,θ p z θ where p x i y i, θ & p y i θ are model specific. x i θ y i n x y 59

60 Discriminative Logistic Regression We model p y x, θ directly as a function of φ x T θ: p y x, θ = q y φ x T θ Binary classification: p y = +1 x, θ = 1 1+exp φ x T θ This is called the logistic regression model. The set x x T θ = 0 forms a separating plane between classes -1 & +1. Multiclass case: probability for class y given by p y x, θ = exp φ x T θ y z Y exp φ x T θ z Parameters θ y estimated (e.g. gradient descent). 60

61 Generative Logistic Regression Bayes Rule applied to class conditional: p x y, θ p y θ p y x, θ = p x z, θ p z θ z Y Using the generative approach & assumptions p y θ = π y exponential family: conditional probability of x is We arrived at this conditional distribution for y: p y x, θ = exp φ x T θ y z Y exp φ x T θ z Parameters θ y estimated (e.g. gradient descent). 61

62 EMPIRICAL RISK MINIMIZATION 62

63 Reg. Empirical Risk Minimization (ERM) Regularized empirical risk minimization - choose θ to minimize the sum of empirical risk & a regularizer. argmin θ n i=1 l f θ x i, y i + c Ω θ Loss function l f θ x, y measures the cost of predicting f θ x when the true class is y n Empirical risk is R n θ = i=1 l f θ x i, y i Regularizer Ω θ penalizes complexity & gives stability Optimization Approaches Gradient descent & stochastic gradient methods. Choice of loss & regularizer gives different methods Logistic regression, perceptron, SVM 63

64 Solution for the Optimization Problem Goal: for loss l and regularizer Ω, minimize Gradient method: n L θ = l f θ x i, y i i=1 RegERM(Data: x 1, y 1,, x n, y n ) Set θ 0 = 0 and t = 0 DO Compute gradient L θ t Compute step size α t Set θ t+1 = θ t α t L θ t Set t = t + 1 WHILE θ t θ t+1 RETURN θ t > ε + c Ω θ L θ = Step size computed L θ θ 1 1 L θ θ m 1 L θ θ m k α t = argmin L θ t α L θ t α>0 α t = α t /2 until L θ t α t L θ t L θ t etc. 64

65 Stochastic Gradient Method Idea: Determine the gradient for a random subset of the samples (e.g. a single sample). RegERM-Stoch(Data: x 1, y 1,, x n, y n ) Set θ 0 = 0 and t = 0 DO Shuffle data randomly FOR i = 1,, n END WHILE θ t θ t+1 RETURN θ t Compute subset gradient xi L θ t Compute step size α t Set θ t+1 = θ t α t xi L θ t Set t = t + 1 Converges to optimum if: > ε t=1 α t = and t=1 α t 2 < 65

66 ERM: Loss Functions for Classification Zero-one loss: l 0/1 f θ x i, y i = I sign f θ x i y i Logistic loss: l log f θ x i, y i = log 1 + e y if θ x i Perceptron loss: l p f θ x i, y i = max 0, y i f θ x i Hinge loss: l h f θ x i, y i = max 0,1 y i f θ x i Zero-one loss is not convex difficult to minimize! y i f θ x i Logistic, perceptron, & hinge losses are convex approximations!

67 ERM: Logistic Regression The MAP-(Maximum-A-Posteriori-)parameters of logistic regression (binary case) are the minimum of the sum of A sum over all examples of their logistic loss, l f θ x i, y i = log 1 + exp y i f θ x i. An L2-Regularizer Ω θ = 1 2 θt Σ 1 θ For the Multi-class case, the loss is changed: l f θ x i,, y i = log exp f θ x i, z f θ x i, y i z Y An implementation of logistic regression must identify the minimum of this sum. 67

68 ERM: Perceptron Algorithm Perceptron algorithm minimizes the sum of the perceptron loss over all samples (no regularizer) Perceptron(Instances x i, y i ) Set θ = 0 DO FOR i = 1,, n END IF y i f θ x i 0 THEN θ = θ + y i x i WHILE θ changes RETURN θ Stochastic gradient method with ε = 0 & α t = 1. Terminates when data is linearly separable. 68

69 ERM: Support Vector Machine (SVM) The SVM minimizes hinge loss l h f θ x i, y i = max 0,1 y i f θ x i & L2-norm θ 2 = θ T θ of the parameter vector. It can be solved as convex optimization The SVM is a large margin classifier it separates most samples of the classes with a margin of 1 θ 2. Here regularized ERM trades off between maximizing the margin 1 and minimizing the sum of the slack errors Points that lie 1 θ 2 n i=1 ξ i θ 2 from the plane are support vectors 69

70 ERM: Regularizer L1-Regularization also leads to sparse solutions (= solution with many zero entries in θ). Optimization problem min l f θ x i, y i θ i=1 is equivalent to min θ R n θ d θ 1 Regularization parameter: the larger c is, the sparser the solution is. n R n θ θ 2 θ 1 Larger c shifts the intersection of the contours with a smaller scaling. + c θ 1 θ 1 Solution for c

71 FEATURE SPACES: MAPPINGS, DUAL MODELS, & KERNELS 71

72 Review: Feature Mappings All considered linear methods can be made nonlinear by means of feature mapping φ. Better separations can be obtained in feature space φ x 1, x 2 = x 1 x 2, x 1 2, x 2 2 Hyperplane in feature space corresponds to a nonlinear surface in original space 72

73 Feature Mappings 1 for displacement Linear Mapping: φ x i = 1 x i Quadratic Mapping: φ x i = Polynomial Mapping: φ x i = 1 x i x i x i 1 x i x i x i x i x i p factors Tensor product Feature mappings may not have a closed form expression, but can be specified by an inner product E.g., RBF kernel, Hash kernel functions 73

74 Dual Linear Model: Motivation The feature mapping φ x can be high dimensional. The size θ depends on the dimensionality of φ! Computation of φ x is expensive. φ is computed for each training x i & each prediction x. The dual form α has as many parameters α i as there are training samples. Primal: f x = θ T φ x θ has as 1 parameter per dimension of φ x. Good if #samples >> # features. n Dual: f x = i=1 α i φ x T i φ x 1 parameter α i per sample. Good if #samples << # features. 74

75 Dual Form Linear Model Representer Theorem: If g is strictly monotonically increasing, then the θ that minimizes has the form θ = n L θ = l f θ x i, y i i=1 n i=1 α i φ x i n + g f θ 2 f θ x = α i φ x i T φ x i=1, with α i R and Will be expressed as a kernel k x, x = φ x T φ x 75

76 Dualized Perceptron Perceptron loss, no regularizer Dual form of the decision function: f α x = α i k x i, x i=1 Dual form of the perceptron: Perceptron(Instances x i, y i ) Set α = 0 DO n FOR i = 1,, n END IF y i f α x i 0 THEN α i = α i + y i WHILE α changes RETURN α 76

77 Dualized Support Vector Machine Optimization criterion of the dual SVM: max β n i=1 1 β i 2 n i,j=1 such that 0 β i λ β i β j y i y j k x i, x j Optimization β solvable with QP-Solver in O n 2. Sparse solution: Lagrange multipliers > 0 when their corresponding constraint is fulfilled with equality. β i = 0: distance to the hyperplane exceeds the margin β i = λ: sample violates the margin 0 < β i < λ: sample lies on the margin Dual decision function: f β x = β i y i k x i, x x i SV 77

78 Kernel Functions Kernel function k x, x = φ x T φ x computes the inner product of the feature mapping of 2 instances. The kernel function can often be computed without an explicit representation φ x. Eg, polynomial kernel: k poly x i, x j = x i T x j + 1 p Infinite-dimensional feature mappings are possible Eg, RBF kernel: k RBF x i, x j = e γ x i x j 2 For every positive definite kernel there is a feature mapping φ x such that k x, x = φ x T φ x. For a given kernel matrix, the Mercer map provides an equivalent feature mapping. 78

79 Properties of Kernels Theorem: For every positive definite function k there exists a mapping φ x such that k x, x = φ x T φ x for all x and x. This mapping is not unique. Gram matrix or kernel matrix K; with K ij = k x i, x j Matrix of inner products = pairwise similarity between samples; an n n matrix. k x, x ist PSD iff K is a PSD matrix for every dataset 79

80 Mercer Map Mercer Map: given a PSD kernel matrix, we can construct feature mapping Φ for learning/prediction. Every symmetric matrix K has an eigendecomposition given by K = UΛU 1. Feature mapping for used training data can is then φ x 1 φ x n = UΛ 1/2 T Kernel matrix between training and test data K test = Φ X T train Φ X test = UΛ 1/2 Φ X test Equation results in a mapping of the test data: Φ X test = Λ 1/2 U T K test 80

81 String Kernels Here we have seen a number of string kernels that can be efficiently computed (using dynamic programming, tries, etc.) Bag-of-word kernel p-spectrum kernel All-subsequences kernel Many other variants exist (fixed length subsequence, gap-weighted subsequence, mismatch, etc.) Choice of kernel depends on notion of similarity appropriate for application domain 81

82 Graph Kernels 1 st Possibility: number of isomorphisms between all (sub-) graphs NP-Hard to compute 2 nd Possibility: number of common paths in their product graph (with adjacency matrix A ). Number of paths of length k: n i,j=1 A k ij = 1 T A k 1 With cycles, there may be infinitely many paths so we downweight the influence of long paths. Random Walk Kernels: k G 1, G 2 = 1T 1 I λa 1 k G V 1 V 1, G 2 = 1T exp λa 1 2 V 1 V 2 These kernels can be calculated in O n 3. Additional kernels: Shortest-Path & Subtree Kernels 82

83 MULTICLASS CLASSIFICATION 83

84 Linear Classification for Several Classes Binary classifier f θ x = φ x T θ y x = sign f θ x Multiclass classifier f θ x, y = φ x T θ y y x = argmax z Y f θ x, z θ 84

85 Multiclass Classification We extend classification to multiclass: Y = 1,, k Each class y has a separate linear function f θ x, y used to predict how likely y is given x. We predict class y with the highest score for x. Classification for more than two classes: y x f θ x, y = argmax y f θ x, y = Φ x, y T θ A parameter vector for each class: θ = θ 1,, θ k T x 2 f θ x, y 1 > 0 θ y1 f θ x, y 2 θ y3 θ y2 f θ x, y 3 > 0 x 1 > 0 85

86 Multiclass SVM Idea: use large-margin learning for multiclass problems Large margin gives a reasonable separation of score of true class & all other classes with slack error Optimization problem: min θ,ξ λ such that n i=1 ξ i θt θ ξ i 0 and y y i f θ x i, y i f θ x i, y + 1 ξ i 86

87 Classification with Information over Classes Each class has its own parameter vector: y x = argmax y f θ x, y where f θ x, y = Φ x, y T θ. Joint feature map Φ x, y = φ x Λ y captures the information from both input & class structure The structure of the output class Y is encoded by defining Λ y. This defines a multiclass kernel on input & output: k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j correspondence between classes Correspondence of the classes Λ y i T Λ y j act as an inner product for class descriptions 87

88 Complex Structured Output Suppose output space Y contains complex objects. E.g., Part-of-speech tagging, natural language parsing, & sequence alignment A multistage process propagates its errors, but if each structure is a separate class, there are exponentially many classes leading to exponentially-many parameters to estimate We must develop an efficient feature encoding inefficient predictions We must develop an efficient decoder for the structure (e.g. Viterbi) an inefficient learning algorithm. We use the decoder via the cutting plane algorithm 88

89 Cutting Plane Algorithm Idea: Negative constraints are added when an error occurs during training. Given: L = x 1, y 1,, x n, y n Repeat until all sequences are correctly predicted. Iterate over all examples x i, y i. Compute y = argmax y y i Φ x i, y T θ If Φ x i, y i T θ < Φ x i, y T θ + 1 (Margin-Violation) then add the constraint Φ x i, y i T θ < Φ x i, y T θ + 1 ξ i to the working set of constraints. Solve the optimization problem for input x i, output y i and negative pseudo-examples y (in the working set) Return the learned θ. 89

90 Conclusion: Fundamental Ideas Linear models are simple, they extend to arbitrary input / output spaces, & they are efficiently learned. Feature Map allows non-linear learning in X. Loss Function penalizes errors. Regularizier controls complexity, sparcity, & stability. Convex learning problems have unique solutions (Stochastic) gradient descent, Perceptron, SVM, & Cutting Plane Algorithms Kernel-based learning separates data & learning Learner is built to find a reasonable class separation. Kernel function expresses a pairwise notion of similarity in a domain-specific feature space. 90

91 91

92 ERM: Logistic Regression Gradients for binary logistic regression: n L θ θ = y i 1 + exp y i f θ x i i=1 n L θ b = y i i=1 1 + exp y i f θ x i φ x i Gradient for multiclass logistic regression L θ = n exp f θ x i,y θ y i=1 z Y exp f θ x i,z L θ = n exp f θ x i,y b y i=1 z Y exp f θ x i,z I y i = y I y i = y + Σ 1 y, θ φ x i + Σ 1 y, θ 92

93 Example: All-Subsequences Kernel Dynamic programming gives efficient computation of all matching subsequences between pairs of strings. Ignore last character of s # matching subsequences of s 1:n 1 and t κ s 1:n 1, t κ s, t # matching subsequences of s and t + # matching subsequences of s 1:n 1 and t 1:k 1 k:t k =s n Match last character of s to k th character of t κ s 1:n 1, t 1:k 1 AllSubseqKernel( s, t ) FOR j = 0: t : DP[0,j] = 1; FOR i = 1: s : last = 0; cache[0]=0; FOR k = 1: t : cache[k] = cache[last]; IF t k = s i THEN cache[k] += DP[i-1,k-1]; last = k; FOR k = 0: t : DP[i,k] = DP[i-1,k] + cache[k]; RETURN DP[ s, t ]; 93

94 Classification with a Taxonomy Classes in a tree structure (depth d): Λ y = Λ y 1 Λ y d Φ x, y = φ x Λ y = φ x Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = Λ y i = x 0 x 0 0 x Λ y 1 Λ y d v 1 3 I y i = v 1 i v 1 2 v 2 2 v 2 3 I y i = v ni v 1 1 i = φ x v 3 3 I y 1 1 = v 1 I y 1 1 = v n1 I y d d = v 1 d I y d = v nd 94

95 Example: Sequential Input & Output (Feature Mapping / Prediction / Decoding ) Feature-Mapping: Entries for adjacent states φ N,V y t, y t+1 & stateobservations φ cat,n x t, y t. To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 Once γ t is computed for entire sequence, maximize γ T and follow pointers p t back to find max sequence 95