FJK Home | CPE/CSC 480 | Syllabus | Schedule | Lecture Notes | Assignments | Labs | Project | Other Links |
Name and Section: | |
Status | Final |
Points: | 10 |
Deadline: | Tuesday, Nov. 23, end of lab |
In this lab exercise, you will use another tool in the Computational Intelligence Lab at UBC in Vancouver ( http://www.cs.ubc.ca/labs/lci/CIspace). As with the previous exercise, you can use the online version, or download the code, and run it locally on your machine as a Java application.
The topic of the lab is learning with decision trees. The tool allows you to experiment with predefined examples or your own data sets.
Invoke the Decision Tree (dTree) applet, and load the sample "CarExample.txt." By using the "Step" button, the program will create the decision tree for you; with "Auto-Solve" the full tree is created at once. You can select the information to be displayed when a node is clicked on by choosing one of the "View Node Info," "View Mapped Examples," and "View Histogram" buttons. The "Split Node" button allows you to determine which property to use as the decision criterion by clicking on a node that has not been expanded yet (shown in blue). After a tree has been created, the "Test" and "Test New Example" buttons can be used to see if the decision tree makes the right choice.
Answer the following questions based on some experiments with the decision tree tool. You will have to compare different results against each other, so it will be helpful to print out or copy and paste them.
Reset the graph, and use the "Auto-Create" button to generate the tree automatically from the original data set.
Results: When you press the "Test" button, a panel comes up showing the results predicted correctly, incorrectly, and the ones with no prediction. Why does the tool generate a tree that gets a significant number of examples wrong, and can't handle others?
The numbers I got were as follows:
First attribute: Price
correct: 71%
no prediction: 3%
incorrect: 26%
Attribute | Correct | Undetermined | Incorrect |
Price Size | 79 | 3 | 18 |
There are several methods to rank the attributes, and select the best one:
Attribute that matches best with the result: Highest number of cases that map directly onto the outcome (True/False), and the smallest number of unresolved cases.
Information gain: Highest amount of information provided by an attribute. This relies on possibly complex, tedious calculations not suitable for humans.
Balanced information gain: Choose the node with an information gain closest to 0.5.
For humans, a frequent strategy is to rank the attributes according to their priority, and construct the tree accordingly. "Safety" or "Price" would frequently be the one to start from, irrespective of their information gain, or the shape of the tree generated. A common fallacy is to impose this human perspective on automatically generated trees, and to interpret the higher levels of the tree as more "important" than lower ones. The algorithm, however, does not know about "importance" (which is probably subjective anyway), and simple selects the nodes based on a given strategy.
Here are some problems:
Not enough samples. Ideally, there should be at least one sample for each decision point in the tree.
Samples may not be representative of the domain. In this case, the distribution of test cases seems to be significantly different from that of the samples.
The decisions captured in the samples may have been biased. For example, it is plausible that "safety" was a more important criterion for the person making the decisions than "trunk size," and the samples seem to suggest that high "maintenance costs" were quite acceptable.
Inconsistencies (contradictions) in the training set.
There may also be problems with misspellings, and the use of undefined values for attributes (e.g. "llow", "vhigh", "med" for price.
Reset the graph, and use the "Split Node" button to manually control the generation of the tree. After activating that button, click on the blue rectangle. Then select the "safety" attribute, and generate the rest of the tree via the "Step" or "Auto-Solve" buttons. Make sure that you are using the original data set, in case you changed some values or added new examples.
Results: Press the "Test" button for this tree, and compare the results against the auto-generated one. Why do you think the automatically generated tree performs differently?
First attribute: Safety
correct: 71%
no prediction: 21%
incorrect: 8%
Attribute | Correct | Undetermined | Incorrect |
Safety | 71 | 12 | 17 |
Safety | 82 | 0 | 18 |
Safety | |||
Safety | |||
Safety | |||
Safety |
This tree has fewer incorrect test results, but more with no prediction. This reflects again the way the test set relates to the training set.
Here are some problems:
Not enough samples.
Samples may not be representative of the domain. In this case, the distribution of test cases seems to be significantly different from that of the samples.
Reset the graph, and use the "Split Node" button to manually control the generation of the tree again. Try to find a criterion that results in a tree with the highest percentage of correctly predicted examples.
Results: What is the best first-choice criterion that you could find? Can you describe a strategy to generate such a high-performance tree?
The attribute that seems to generate the highest percentage of correct predictions seems to be "Price." This is somewhat difficult to tell since the tool scrambles the samples, so for different runs different attributes may yield the best results.
The table below lists some of the results that I got in different runs, and others that students reported. Since they are based on different traning sets and arrangements, comparisons should be made with care.
Attribute | Correct % | Undetermined % | Incorrect % |
Price | 85 | ? | ? |
Safety | 79 | 9 | 12 |
Price | 79 | 3 | 18 |
Maintenance | 79 | 0 | 21 |
Persons | 76 | 6 | 18 |
Doors | 76 | 3 | 21 |
Doors | 76 | 0 | 24 |
Safety | 71 | 12 | 17 |
Persons | 71 | 6 | 23 |
Finding a good strategy here is not necessarily straightforward. While the ones listed above help, they are not guaranteed to find one with high performance, nor one that reflects human priorities. Many students used a trial-and-error strategy, and ended up checking all possible attributes as initial choices. One student reported finding a tree with 80% correct predictions by randomly picking nodes.
Reset the graph, and use the "Split Node" button to manually control the generation of the tree again. Try to find a criterion that results in a tree with a small percentage of correctly predicted examples.
Results: What is the least suitable first-choice criterion that you could find? Can you describe a strategy to generate such a low-performance tree?
Attribute | Correct | Undetermined | Incorrect |
Trunk Size | 32 | 53 | 15 |
Doors | 41 | ? | ? |
Trunk Size | 41 | ? | ? |
Trunk Size | 47 | 32 | 21 |
Trunk Size | 50 | 21 | 29 |
Trunk Size | 53 | ? | ? |
Doors | 59 | 21 | 20 |
Doors | 65 | 6 | 29 |
Strategies to find the lowest-performance tree can be derived from the ones used for highest performance by selecting the worst node according to the ranking chosen. So, for example, this could be the node with the lowest information gain, or the one that leaves the most unresolved cases in a given step.
FJK Home | CPE/CSC 480 | Syllabus | Schedule | Lecture Notes | Assignments | Labs | Project | Other Links |
Franz Kurfess |