CSc 103 Programming Project

Famous Names

The goal of this project is to write a Java implementation of a generic Binary Search Tree and write a client application that uses it.

Overview - Frequency Analysis of Famous Names

Using an online database, the instructor has created a file of nearly 34,000 names of famous people.  Here's a brief excerpt:

Linus Torvalds
Arturo Toscanini
Peter Tosh
Nina Totenberg
Blanche M. Touhill
Marie J. Toulantis
Henri de Toulouse-Lautrec

Let's imagine someone is curious to know what the most common first names are. 

We want a program that can read a text file of names and analyze it to determine the frequency of occurrence of first names. 
Assume that there is only one name per line and that the person's first name starts at the beginning of the line and ends at the first blank. Valid first names contain only alphabetic characters and have more than one letter. Ignore invalid first names such as J'Nae. Disregard differences between lower and upper case letters.
The output report should show the number of times each name was found in the text, ordered by frequency, then by upper case name (alphabetically).  Each line contains the name, the count, and the percent of the total number of names (to 2 decimal places).

For example, an analysis of this data file would yield:
BOB 3  20.00
MARY 3 20.00
TOM 3 20.00
ED 2 13.33
JILL 1 6.67
MARK 1 6.67
PRINCE 1 6.67
REVEREND 1 6.67

Note that even though "REVEREND" is a title, because it appears at the start of the line we will treat it as a first name.

The application should obtain the input text from standard input.  (Do not print any prompt to the user). 

Software Design

Click on a class in the diagram to view the javadocs.

Class Diagram


You are to implement all the classes in the above design, except for FamousNameApp, BinaryNode and Natural.

BinarySearchTree   This is a generic version of StringSearchTree that we developed in lab.
NameCount, an ADT for a name count.
NameCountComparator, implements java.util.Comparator<NameCount>, to enable sorting on frequency.
FamousName, an ADT for a special kind of string that represents a famous name.
Document, process a stream of english text to count the famous names.
NameFrequencyReport, generate the finished report.

Here are the .class files for the instructor components in a JAR formatted file:     FamousNameComponents.jar
DO NOT UNZIP THIS FILE. Add it as a custom library to your BlueJ project. It will be considered cheating if you unzip the jar file, submit the jar file to Web-CAT, or submit the unzipped classes with your project.


For your convenience, skeletons.zip contains the skeletons for the files you need to implement.

Testing

You must write your own JUnit tests.

Your components will be tested with instructor reference tests.  The reference tests are JUnit tests of your classes (shown in green in the class diagram above).  There will be a performance test: the 34,000 names from the online database. There's a time limit of 10 seconds.

On this project your solution must conform to the class coding standard. Your unit tests must demonstrate "statement" coverage (they cause every statement in your program to be executed.)


Handing in Your Source Electronically

GRADING

Your score on Web-CAT contributes to your project score as follows:
Results from Running Your Tests (100)
Estimate of Problem Coverage (100)
Both of these must be 100 or your project score is zero.
60%
Code Coverage from Your Tests  (0-100)
20%
Style/Coding (0-10)  
10%
Design/Readability (inspection by grader)
10%




FAQ

Q: Web-CAT keeps deducting style points from me because my equals() methods in NameCount and FamousName are too similar, but generally speaking most equals() methods are rather similar looking so I see no way around it.
A: Yes, you've interpreted the error correctly. I agree, in this case Checkstyle is being a bit too sensitive. It's annoying, but we have to be creative and rewrite equals() in a different manner so as to fool Checkstyle. If you need ideas, try this reference.

Q: In the NameCount class, are you sure that we are not to have a getName() method? Otherwise, we would have to do string parsing when we're trying to implement the compareTo() method.
A: There is no getName() method and you don't have to do any string parsing. I suspect you are forgetting that compareTo() has access to all of an object's fields. Perhaps this reference will help.

Q: Does the number returned by the countNames() method of Document include duplicates or not?
A: Yes, a legal name is counted every time it appears.

Q: Is Collections.sort() a stable sort ?
A: Yes.


Binary Search Tree extra credit (30%)


This extra credit has two parts.  The first part is described here.  When you submit your results to the instructor he will evaluate your work and if it is adequate he will present you with part 2.   Each part is worth 50% of the extra credit.

In theory the Binary Search Tree used in this problem should be faster than if we stored the names in an ArrayList, because searching an ArrayList is an O(N) operation, whereas searching a BST should be O(logN).   You are to perform an experiment to verify this theoretical result. 

There is a hidden command line flag in FamousNameApp that will display the elapsed time.
java -cp +libs/FamousNameComponents.jar:. FamousNameApp famousnames.txt -t
Elapsed time (secs): 2.179
Report length is 5721


Create a file of at least 30,000 names.   
You can use this file of over 170,000 names:  170Knames.zip  (0.5MB zipped, 2.3MB unzipped)

Run the app and record the elapsed time.

Modify the program so it uses an ArrayList instead of a BinarySearchTree.
Run the modified app with the same data file and record the elapsed time.
Our theoretical prediction is that it should be significantly slower.

Create a short report that includes the experimental observations you made and an explanation of the results.
You may email your report as a PDF to the instructor or bring it to office hours.

Part 1 is due 3 days after the Project deadline.
If your part 1 is accepted, part 2 is due 3 days later.