Unit 14 Decision Trees Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 259
Introduction Nearest prototype classifiers are computationally expensive black-box models Decisions can be made in a more structured way, e.g. by asking questions successively A decision tree is a classifier which makes classifications by asking questions successively ; each level corresponds to a question, each leaf corresponds to a final classification. Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 260
Construction of Decision Trees The top-down construction of a decision tree is, more or less, straightforward. For constructing a decision tree from data, we have to determine which questions to ask in order to achieve an acceptable result. In the popular ID3 method, this is done by considering the gain of information at each node. In the following, for convenience, let us make the convention X p+1 = Y. Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 261
The ID3 Algorithm 1. Given: data set X = {x i i = 1,..., n}, assume that all p + 1 variables are categorical, i.e. X i = {1,..., C i } 2. Call ID3(X,Root,{1,..., p}) 3. ID3(X,N,I) (a) If all x in X belong to the same output class, exit. (b) Determine component i I such that gain of information g i (X) is maximal (c) Divide X into disjoint subsets (for j = 1,..., C i ) (d) For all j such that X ji Generate new node N j Call ID3(X ji,n j,i\{i}) X ji = {x X x i = j}. (1) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 262
Computing the Gain of Information H(Y ) = C p+1 i=1 Y ip+1 Y log 2 Y ip+1 Y g i (X) = H(X) C i j=1 X ji X H(X ji ) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 263
Fuzzy Decision Trees Classical decision trees can only process crisp categorical attributes There are extensions that can process real-valued attributes (CART, C4.5), but they all split the real line into crisp intervals with artificially sharp boundaries; therefore, no interpolative behavior can be modeled To work with fuzzy instead of crisp predicates overcomes this problem The FS-ID3 algorithm is an efficient variant accommodating also classical decision trees Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 264
The Basic Setting Data samples (i = 1,..., n): x i = (x i 1,..., xi p, xi p+1 ) X 1 X p X p+1 A fuzzy predicate in this setting is an X 1 X p+1 [0, 1] mapping The dummy mapping t(.) gives the actual truth value (from [0, 1]) for a given linguistic expression Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 265
Crisp Categorical Attributes Assume that X r = {L r,1,..., L r,nr }, then the following two predicates can be defined: 1 if x r = L r,j t(x is L r,j ) = 0 otherwise 1 if x r L r,j t(x is not L r,j ) = 0 otherwise Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 266
Fuzzy Categorical Attributes For a fuzzy categorical attribute r we have an unstructured set of N r labels {L r,1,..., L r,nr }. X r = F ( {L r,1,..., L r,nr } ) [0, 1] Nr. x r = (t r,1,..., t r,nr ) t(x is L r,j ) = t r,j t(x is not L r,j ) = 1 t r,j Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 267
Fuzzy Attributes Given a set of linguistic labels M r,1,..., M r,nr and their corresponding semantics modeled by fuzzy sets, we can define 4 N r atomic fuzzy predicates: t(x is M r,j ) = µ Mr,j (x r ) t(x is not M r,j ) = 1 µ Mr,j (x r ) t(x is at least M r,j ) = sup{µ Mr,j (u) u x r } t(x is at most M r,j ) = sup{µ Mr,j (u) u x r } Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 268
Default Predicate for Missing Values t(x is NA r ) = 1 if x r is missing 0 otherwise Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 269
Compound Fuzzy Predicates t ( p(x) ) = 1 t(p(x)) t ( (p q)(x) ) = T ( t(p(x)), t(q(x)) ) t ( (p q)(x) ) = S ( t(p(x)), t(q(x)) ) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 270
Binary Fuzzy Decision Trees Each decision tree has a root node and child nodes Each child node is a leaf node or root node of a subtree Each non-leaf node has exactly two child nodes Each non-leaf node is associated with a fuzzy predicate Each leaf node is associated with a class assignment Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 271
The FS-ID3 Algorithm Input: Output: (fuzzy) goal predicates C = {C 1,..., C R }, fuzzy set of samples X cur, set of test predicates P tree node N cur IF stopping criterion is fulfilled THEN BEGIN compute class assignment C cur N cur is leaf node with class assignment C cur END ELSE BEGIN find best predicate p = argmax p P G(p, X cur) compute new memberships for the left branch µ X (xi ) = t ( (x i is X cur ) p(x i ) ) compute left branch N = FS-ID3(C, X, P) compute new memberships for the right branch µ X (xi ) = t ( (x i is X cur ) p(x i ) ) compute right branch N = FS-ID3(C, X, P) N cur is parent node with children N and N END Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 272
Computing the Gain of Information N = µ N (x) p(x) = t ( p(x) ) x X x X G(p, X) =E({p i (X) i = 1,..., R}) ( r i (X) E({p i r i (X) E({p i (X) i = 1,..., R})+ (X) i = 1,..., R})) E(P ) = q P q log 2 q Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 273
Computing the Gain of Information (cont d) p i (X) = p i (X) = p i (X) = r (X) = r (X) = c i(x) Ri=1 c i (X) (p c i)(x) Ri=1 (p c i )(X) ( p c i)(x) Ri=1 ( p c i )(X) Ri=1 (p c i )(X) Ri=1 c i (X) Ri=1 ( p c i )(X) Ri=1 c i (X) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 274
Stopping Criteria No more samples: if the number of samples decreases under a certain threshold Only one class remains: if most samples belong to the same class Maximum depth reached: if the depth of the tree reaches a predefined maximum No new rules found: if no new rule which increases the classification quality is found Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 275
Applying FS-ID3 Goal predicates: R = N p+1, t(c j (x)) = t(x is L p+1,j ), (with i = 1,..., N p+1 ) Test predicates: set of all predicates defined for variables 1,..., p Sample set: sample set is first initialized with Xcur = X Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 276
Class Assignments (1/3) At each leaf node, a fuzzy sample set Xcur remains The class assignment C cur can either be crisp (the leaf node is assigned to one goal predicate) or fuzzy (the leaf node is fuzzily assigned to the goal predicates) Crisp majority decision: the leaf node is assigned to that goal predicate C j for which is maximal. x X t(c j (x p+1 )) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 277
Class Assignments (2/3) Proportional assignment: the leaf node is assigned to each goal predicate C j with a degree of x X t(c j(x p+1 )) Ri=1 x X t(c j(x p+1 )) Normalized assignment: the leaf node is assigned to each goal predicate C j with a degree of max R i=1 x X t(c j(x p+1 )) x X t(c j(x p+1 )) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 278
Class Assignments (3/3) Note that, although the approach seems plausible at first glance, the results of proportional assignment cannot be understood as fuzzy membership degrees (since relative frequencies are not truth-functional) More correctly, one can understand the relative frequencies in the leaf nodes (proportional assignment) as probabilities to which a sample potentially belongs to the respective class Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 279
Fuzzy Decision Trees vs. Fuzzy Rules Every fuzzy decision tree can be interpreted as a set of fuzzy rules Each leaf node corresponds to one rule The antecedent (i.e. IF part) of each rule is the conjunction of predicates corresponding to the path from the root note to the respective leaf node The consequent part (i.e. THEN part) is determined by the class assignment of the respective leaf node Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 280
Example: FS-ID3 Decision Tree for the Iris Data Set class_is_iris setosa class_is_iris versicol class_is_iris virginica T F petal_length_isatleast_l 97 petal_width_isatleast_h 68 T class_is_iris virginica 35 F class_is_iris setosa 33 class_is_iris versicol 29 Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 281
Example: FS-ID3 Decision Tree for the Wine Data Set Class_Is_C1 Class_Is_C2 Class_Is_C3 T F Flavanoids_IsAtLeast_M 134 Alcohol_IsAtLeast_M 78 T Class_Is_C1 48 F Class_Is_C2 30 ColorIntensity_IsAtLeast_L 56 T Class_Is_C3 45 F Class_Is_C2 11 Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 282