Unit 14. Decision Trees. Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 259

Unit 14 Decision Trees Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 259

Introduction Nearest prototype classifiers are computationally expensive black-box models Decisions can be made in a more structured way, e.g. by asking questions successively A decision tree is a classifier which makes classifications by asking questions successively ; each level corresponds to a question, each leaf corresponds to a final classification. Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 260

Construction of Decision Trees The top-down construction of a decision tree is, more or less, straightforward. For constructing a decision tree from data, we have to determine which questions to ask in order to achieve an acceptable result. In the popular ID3 method, this is done by considering the gain of information at each node. In the following, for convenience, let us make the convention X p+1 = Y. Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 261

The ID3 Algorithm 1. Given: data set X = {x i i = 1,..., n}, assume that all p + 1 variables are categorical, i.e. X i = {1,..., C i } 2. Call ID3(X,Root,{1,..., p}) 3. ID3(X,N,I) (a) If all x in X belong to the same output class, exit. (b) Determine component i I such that gain of information g i (X) is maximal (c) Divide X into disjoint subsets (for j = 1,..., C i ) (d) For all j such that X ji Generate new node N j Call ID3(X ji,n j,i\{i}) X ji = {x X x i = j}. (1) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 262

Computing the Gain of Information H(Y ) = C p+1 i=1 Y ip+1 Y log 2 Y ip+1 Y g i (X) = H(X) C i j=1 X ji X H(X ji ) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 263

Fuzzy Decision Trees Classical decision trees can only process crisp categorical attributes There are extensions that can process real-valued attributes (CART, C4.5), but they all split the real line into crisp intervals with artificially sharp boundaries; therefore, no interpolative behavior can be modeled To work with fuzzy instead of crisp predicates overcomes this problem The FS-ID3 algorithm is an efficient variant accommodating also classical decision trees Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 264

The Basic Setting Data samples (i = 1,..., n): x i = (x i 1,..., xi p, xi p+1 ) X 1 X p X p+1 A fuzzy predicate in this setting is an X 1 X p+1 [0, 1] mapping The dummy mapping t(.) gives the actual truth value (from [0, 1]) for a given linguistic expression Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 265

Crisp Categorical Attributes Assume that X r = {L r,1,..., L r,nr }, then the following two predicates can be defined: 1 if x r = L r,j t(x is L r,j ) = 0 otherwise 1 if x r L r,j t(x is not L r,j ) = 0 otherwise Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 266

Fuzzy Categorical Attributes For a fuzzy categorical attribute r we have an unstructured set of N r labels {L r,1,..., L r,nr }. X r = F ( {L r,1,..., L r,nr } ) [0, 1] Nr. x r = (t r,1,..., t r,nr ) t(x is L r,j ) = t r,j t(x is not L r,j ) = 1 t r,j Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 267

Fuzzy Attributes Given a set of linguistic labels M r,1,..., M r,nr and their corresponding semantics modeled by fuzzy sets, we can define 4 N r atomic fuzzy predicates: t(x is M r,j ) = µ Mr,j (x r ) t(x is not M r,j ) = 1 µ Mr,j (x r ) t(x is at least M r,j ) = sup{µ Mr,j (u) u x r } t(x is at most M r,j ) = sup{µ Mr,j (u) u x r } Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 268

Default Predicate for Missing Values t(x is NA r ) = 1 if x r is missing 0 otherwise Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 269

Compound Fuzzy Predicates t ( p(x) ) = 1 t(p(x)) t ( (p q)(x) ) = T ( t(p(x)), t(q(x)) ) t ( (p q)(x) ) = S ( t(p(x)), t(q(x)) ) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 270

Binary Fuzzy Decision Trees Each decision tree has a root node and child nodes Each child node is a leaf node or root node of a subtree Each non-leaf node has exactly two child nodes Each non-leaf node is associated with a fuzzy predicate Each leaf node is associated with a class assignment Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 271

The FS-ID3 Algorithm Input: Output: (fuzzy) goal predicates C = {C 1,..., C R }, fuzzy set of samples X cur, set of test predicates P tree node N cur IF stopping criterion is fulfilled THEN BEGIN compute class assignment C cur N cur is leaf node with class assignment C cur END ELSE BEGIN find best predicate p = argmax p P G(p, X cur) compute new memberships for the left branch µ X (xi ) = t ( (x i is X cur ) p(x i ) ) compute left branch N = FS-ID3(C, X, P) compute new memberships for the right branch µ X (xi ) = t ( (x i is X cur ) p(x i ) ) compute right branch N = FS-ID3(C, X, P) N cur is parent node with children N and N END Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 272

Computing the Gain of Information N = µ N (x) p(x) = t ( p(x) ) x X x X G(p, X) =E({p i (X) i = 1,..., R}) ( r i (X) E({p i r i (X) E({p i (X) i = 1,..., R})+ (X) i = 1,..., R})) E(P ) = q P q log 2 q Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 273

Computing the Gain of Information (cont d) p i (X) = p i (X) = p i (X) = r (X) = r (X) = c i(x) Ri=1 c i (X) (p c i)(x) Ri=1 (p c i )(X) ( p c i)(x) Ri=1 ( p c i )(X) Ri=1 (p c i )(X) Ri=1 c i (X) Ri=1 ( p c i )(X) Ri=1 c i (X) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 274

Stopping Criteria No more samples: if the number of samples decreases under a certain threshold Only one class remains: if most samples belong to the same class Maximum depth reached: if the depth of the tree reaches a predefined maximum No new rules found: if no new rule which increases the classification quality is found Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 275

Applying FS-ID3 Goal predicates: R = N p+1, t(c j (x)) = t(x is L p+1,j ), (with i = 1,..., N p+1 ) Test predicates: set of all predicates defined for variables 1,..., p Sample set: sample set is first initialized with Xcur = X Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 276

Class Assignments (1/3) At each leaf node, a fuzzy sample set Xcur remains The class assignment C cur can either be crisp (the leaf node is assigned to one goal predicate) or fuzzy (the leaf node is fuzzily assigned to the goal predicates) Crisp majority decision: the leaf node is assigned to that goal predicate C j for which is maximal. x X t(c j (x p+1 )) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 277

Class Assignments (2/3) Proportional assignment: the leaf node is assigned to each goal predicate C j with a degree of x X t(c j(x p+1 )) Ri=1 x X t(c j(x p+1 )) Normalized assignment: the leaf node is assigned to each goal predicate C j with a degree of max R i=1 x X t(c j(x p+1 )) x X t(c j(x p+1 )) Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 278

Class Assignments (3/3) Note that, although the approach seems plausible at first glance, the results of proportional assignment cannot be understood as fuzzy membership degrees (since relative frequencies are not truth-functional) More correctly, one can understand the relative frequencies in the leaf nodes (proportional assignment) as probabilities to which a sample potentially belongs to the respective class Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 279

Fuzzy Decision Trees vs. Fuzzy Rules Every fuzzy decision tree can be interpreted as a set of fuzzy rules Each leaf node corresponds to one rule The antecedent (i.e. IF part) of each rule is the conjunction of predicates corresponding to the path from the root note to the respective leaf node The consequent part (i.e. THEN part) is determined by the class assignment of the respective leaf node Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 280

Example: FS-ID3 Decision Tree for the Iris Data Set class_is_iris setosa class_is_iris versicol class_is_iris virginica T F petal_length_isatleast_l 97 petal_width_isatleast_h 68 T class_is_iris virginica 35 F class_is_iris setosa 33 class_is_iris versicol 29 Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 281

Example: FS-ID3 Decision Tree for the Wine Data Set Class_Is_C1 Class_Is_C2 Class_Is_C3 T F Flavanoids_IsAtLeast_M 134 Alcohol_IsAtLeast_M 78 T Class_Is_C1 48 F Class_Is_C2 30 ColorIntensity_IsAtLeast_L 56 T Class_Is_C3 45 F Class_Is_C2 11 Knowledge-Based Methods in Image Processing and Pattern Recognition; Ulrich Bodenhofer 282