CSc 103 Programming Project
Famous Names
The goal of this project is to write a Java implementation of a
generic Binary Search Tree and write a client application that uses
it.
Overview - Frequency Analysis of Famous Names
Using an online database, the instructor has created a file of
nearly
34,000 names of famous people. Here's a brief excerpt:
Linus Torvalds
Arturo Toscanini
Peter Tosh
Nina Totenberg
Blanche M. Touhill
Marie J. Toulantis
Henri de Toulouse-Lautrec
Let's imagine someone is curious to know
what the most common first names are.
We want a program that can read a text file of names and analyze it
to determine the frequency
of occurrence of first names.
Assume that there is only one name per line and that the person's first name starts at the beginning of the line and ends at the first blank. Valid first names contain only alphabetic characters and have more than one letter. Ignore invalid first names such as J'Nae. Disregard differences between lower and upper case letters.
The output report should show the number of times
each name was found in the text, ordered by frequency, then by upper case name
(alphabetically). Each line contains the name, the count, and
the
percent of the total number of names (to 2 decimal places).
For example, an analysis of this data
file
would yield:
BOB 3 20.00
MARY 3 20.00
TOM 3 20.00
ED 2 13.33
JILL 1 6.67
MARK 1 6.67
PRINCE 1 6.67
REVEREND 1 6.67
Note that even though "REVEREND" is a title, because it appears at
the
start of the line we will treat it as a first name.
The application should obtain the input text from standard
input.
(Do not print any prompt to the user).
Software Design
Click on a class in the diagram to view the javadocs.

You are to implement all the classes in the above design, except for
FamousNameApp,
BinaryNode
and Natural.
BinarySearchTree This is a generic
version
of StringSearchTree that we developed in lab.
NameCount, an ADT for a name count.
NameCountComparator, implements java.util.Comparator<NameCount>,
to
enable sorting on frequency.
FamousName, an ADT for a special kind of string
that
represents a famous name.
Document, process a stream of english text to count
the famous names.
NameFrequencyReport, generate the finished report.
Here are the .class files for the instructor components in
a
JAR formatted file: FamousNameComponents.jar
DO NOT UNZIP THIS FILE. Add it as a custom library
to your BlueJ project. It will be considered cheating if you unzip
the jar file, submit the jar file to Web-CAT, or submit the unzipped
classes with your project.
For your convenience, skeletons.zip
contains the skeletons for the files you need to implement.
Testing
You must write your own JUnit tests.
Your
components
will be tested with instructor reference
tests. The reference tests are JUnit tests of your classes
(shown
in green in the class diagram above). There will be a
performance
test: the 34,000 names from the online
database. There's a time limit of 10 seconds.
On this project your solution must conform to the class coding
standard. Your unit tests
must demonstrate "statement" coverage (they cause every
statement in your program to be executed.)
Handing in Your Source
Electronically
- Submit a zip file containing your source code and unit tests
to
Web-CAT. You do not need to submit FamousNameComponents.jar.
- There is a limit of 20
submissions.
GRADING
Your score on Web-CAT contributes to your project score as follows:
Results from Running Your Tests (100)
Estimate of Problem Coverage (100)
Both of these must be 100 or your project score is zero. |
60%
|
Code Coverage from Your Tests (0-100)
|
20%
|
Style/Coding (0-10)
|
10%
|
Design/Readability (inspection by grader)
|
10%
|
FAQ
Q: Web-CAT keeps deducting style points from me because my equals()
methods
in NameCount and FamousName are too similar, but generally speaking
most
equals() methods are rather similar looking so I see no way around
it.
A: Yes, you've interpreted the error correctly.
I agree, in this case Checkstyle is being a bit too sensitive.
It's annoying, but we have to be creative and rewrite equals()
in a different manner so as to fool Checkstyle. If you
need ideas, try this reference.
Q: In the NameCount class, are you sure that we are not to have a
getName() method? Otherwise, we would have to do string parsing when
we're
trying to implement the compareTo() method.
A: There is no getName() method and you don't have to do any string
parsing. I suspect you are forgetting that compareTo() has access
to all of an object's fields. Perhaps this
reference
will help.
Q: Does the number returned by the countNames() method of Document
include duplicates or not?
A: Yes, a legal name is counted every time it appears.
Q: Is Collections.sort() a
stable sort
?
A: Yes.
Binary Search Tree extra credit (30%)
This extra credit has two parts. The first part is described
here. When you submit your results to the instructor he will
evaluate your work and if it is adequate he will present you with
part
2. Each part is worth 50% of the extra credit.
In theory the Binary Search Tree used in this problem should be
faster
than if we stored the names in an ArrayList, because searching an
ArrayList is an O(N) operation, whereas searching a BST should be
O(logN). You are to perform an experiment to verify this
theoretical result.
There is a hidden command line flag in FamousNameApp that will
display
the elapsed time.
java -cp +libs/FamousNameComponents.jar:. FamousNameApp
famousnames.txt -t
Elapsed time (secs): 2.179
Report length is 5721
Create a file of at least 30,000 names.
You can use this file of over 170,000 names: 170Knames.zip
(0.5MB
zipped, 2.3MB unzipped)
Run the app and record the elapsed time.
Modify the program so it uses an ArrayList instead of a
BinarySearchTree.
Run the modified app with the same data file and record the elapsed
time.
Our theoretical prediction is that it should be significantly
slower.
Create a short report that includes the experimental observations
you
made and an explanation of the results.
You may email your report as a PDF to the instructor or bring it to
office hours.
Part 1 is due 3 days after the Project deadline.
If your part 1 is accepted, part 2 is due 3 days later.