CPE/CSC 480 Lab Exercise 9: Decision Trees

In this lab exercise, you will use another tool in the Computational Intelligence Lab at UBC in Vancouver ( http://www.cs.ubc.ca/labs/lci/CIspace). As with the previous exercise, you can use the online version, or download the code, and run it locally on your machine as a Java application.

Decision Trees

The topic of the lab is learning with decision trees. The tool allows you to experiment with predefined examples or your own data sets.

Instructions

Invoke the Decision Tree (dTree) applet, and load the sample "CarExample.txt." By using the "Step" button, the program will create the decision tree for you; with "Auto-Solve" the full tree is created at once. You can select the information to be displayed when a node is clicked on by choosing one of the "View Node Info," "View Mapped Examples," and "View Histogram" buttons. The "Split Node" button allows you to determine which property to use as the decision criterion by clicking on a node that has not been expanded yet (shown in blue). After a tree has been created, the "Test" and "Test New Example" buttons can be used to see if the decision tree makes the right choice.

Tasks

Answer the following questions based on some experiments with the decision tree tool. You will have to compare different results against each other, so it will be helpful to print out or copy and paste them.

Auto-generated Tree

Reset the graph, and use the "Auto-Create" button to generate the tree automatically from the original data set.

Results: When you press the "Test" button, a panel comes up showing the results predicted correctly, incorrectly, and the ones with no prediction. Why does the tool generate a tree that gets a significant number of examples wrong, and can't handle others?

The numbers I got were as follows:

First attribute: Price
correct: 71%
no prediction: 3%
incorrect: 26%

The numbers may vary since the selection of the training data and the sequence in which they are processed seems to differ for different runs of the program. The table below lists a few alternative attributes and values that students in F04 reported.

Attribute	Correct	Undetermined	Incorrect
Price Size	79	3	18

There are several methods to rank the attributes, and select the best one:

Attribute that matches best with the result: Highest number of cases that map directly onto the outcome (True/False), and the smallest number of unresolved cases.
Information gain: Highest amount of information provided by an attribute. This relies on possibly complex, tedious calculations not suitable for humans.
Balanced information gain: Choose the node with an information gain closest to 0.5.
For humans, a frequent strategy is to rank the attributes according to their priority, and construct the tree accordingly. "Safety" or "Price" would frequently be the one to start from, irrespective of their information gain, or the shape of the tree generated. A common fallacy is to impose this human perspective on automatically generated trees, and to interpret the higher levels of the tree as more "important" than lower ones. The algorithm, however, does not know about "importance" (which is probably subjective anyway), and simple selects the nodes based on a given strategy.

Here are some problems:

Not enough samples. Ideally, there should be at least one sample for each decision point in the tree.
Samples may not be representative of the domain. In this case, the distribution of test cases seems to be significantly different from that of the samples.
The decisions captured in the samples may have been biased. For example, it is plausible that "safety" was a more important criterion for the person making the decisions than "trunk size," and the samples seem to suggest that high "maintenance costs" were quite acceptable.
Inconsistencies (contradictions) in the training set.
There may also be problems with misspellings, and the use of undefined values for attributes (e.g. "llow", "vhigh", "med" for price.

- Provide a table to enter the test results. - Ask for the attribute selected first. - Ask for the strategy used by the tool, and how it works.

"Safety-First" Manually Generated Tree

Reset the graph, and use the "Split Node" button to manually control the generation of the tree. After activating that button, click on the blue rectangle. Then select the "safety" attribute, and generate the rest of the tree via the "Step" or "Auto-Solve" buttons. Make sure that you are using the original data set, in case you changed some values or added new examples.

Results: Press the "Test" button for this tree, and compare the results against the auto-generated one. Why do you think the automatically generated tree performs differently?

First attribute: Safety
correct: 71%
no prediction: 21%
incorrect: 8%

The numbers again may vary since the selection of the training data and the sequence in which they are processed seems to differ for different runs of the program. The table below lists a few alternative attributes and values that students in F04 reported.

Attribute	Correct	Undetermined	Incorrect
Safety	71	12	17
Safety	82	0	18
Safety
Safety
Safety
Safety

This tree has fewer incorrect test results, but more with no prediction. This reflects again the way the test set relates to the training set.

Here are some problems:

Not enough samples.
Samples may not be representative of the domain. In this case, the distribution of test cases seems to be significantly different from that of the samples.

High Performance Tree

Reset the graph, and use the "Split Node" button to manually control the generation of the tree again. Try to find a criterion that results in a tree with the highest percentage of correctly predicted examples.

Results: What is the best first-choice criterion that you could find? Can you describe a strategy to generate such a high-performance tree?

The attribute that seems to generate the highest percentage of correct predictions seems to be "Price." This is somewhat difficult to tell since the tool scrambles the samples, so for different runs different attributes may yield the best results.

The table below lists some of the results that I got in different runs, and others that students reported. Since they are based on different traning sets and arrangements, comparisons should be made with care.

Attribute	Correct %	Undetermined %	Incorrect %
Price	85	?	?
Safety	79	9	12
Price	79	3	18
Maintenance	79	0	21
Persons	76	6	18
Doors	76	3	21
Doors	76	0	24
Safety	71	12	17
Persons	71	6	23

Finding a good strategy here is not necessarily straightforward. While the ones listed above help, they are not guaranteed to find one with high performance, nor one that reflects human priorities. Many students used a trial-and-error strategy, and ended up checking all possible attributes as initial choices. One student reported finding a tree with 80% correct predictions by randomly picking nodes.

Use cross-validation to restrict or ideally eliminate the influence of training set selection and arrangements.

Low Performance Tree

Reset the graph, and use the "Split Node" button to manually control the generation of the tree again. Try to find a criterion that results in a tree with a small percentage of correctly predicted examples.

Results: What is the least suitable first-choice criterion that you could find? Can you describe a strategy to generate such a low-performance tree?

Attribute	Correct	Undetermined	Incorrect
Trunk Size	32	53	15
Doors	41	?	?
Trunk Size	41	?	?
Trunk Size	47	32	21
Trunk Size	50	21	29
Trunk Size	53	?	?
Doors	59	21	20
Doors	65	6	29

Strategies to find the lowest-performance tree can be derived from the ones used for highest performance by selecting the worst node according to the ranking chosen. So, for example, this could be the node with the lowest information gain, or the one that leaves the most unresolved cases in a given step.

Maybe leave this out, and do something with cross-validation instead.

FJK Home

CPE/CSC 480

Labs

Name and Section:
Status	Final
Points:	10
Deadline:	Tuesday, Nov. 23, end of lab