CSC 357 Programming Assignment 2
nwc -- A Numbered Word Counter



ISSUED: Wednesday, 11 April 2007
DUE: On or before 11:59:59PM Monday 23 April, via handin on falcon/hornet
POINTS POSSIBLE: 100
WEIGHT: 6% of total class grade
READING: Lecture Notes Weeks 2 and 3, K&R Chapters 5-7, selected parts of Stevens,
various cited man pages

Specification

This programming assignment is like the first in that it's an implementation of a modified version of an existing UNIX utility. In this case, you are implementing a program that operates similar to UNIX wc, for word counting. Your program is named "nwc", for "numbered word counter".

The nwc program reads one or more files from the command line, and outputs a count of the n most frequent words in the files. The output is to stdout, and is sorted by decreasing order of word frequency. In the case of two or more words of the same frequency, the sub-sorting order is alphabetic by word.

The default value of n is 10. The program accepts an optional "-n" command-line argument that specifies the number of words in the output list. There is always at least one space between the "-n" flag and the following numeric argument. If no input files are given on the command line, input is from stdin.

The definition of "word" is a string of characters delimited by one or more whitespace characters. "Whitespace" is any non-alphabetic character, i.e., any character c for which isalpha(c) is false.

Output of the program is in the following form:

The top n words (out of t) are:
c1 w1
c2 w2
...
cn wn
where n is the program input count, t is the total count of words in the file input(s), ci are the counts, and wi are the words. Each count is right justified in an output column of ten characters, followed by a single space, followed by the word corresponding to that count.

If the value of the -n argument is not a positive integer, the program outputs the following usage message

usage: nwc [-n num] [ file1 [ file 2 ...] ]
During the course of file processing, if the program encounters a file that cannot be opened or read, it outputs an error message for that file and proceeds with the remaining files, if any. The "Usage ..." and "non-existent file" error messages go to stderr, not stdout.

Word counting is case INsensitive. I.e., "In" and "in" are considered the same word. The reported word counts are given for the lowercase spelling of each word.

Program Performance

There is no O-notation requirement for the amount of time your program takes to run. As a specific empirical performance measure, your implementation of nwc must process /usr/dict/words in ten seconds or less, running on falcon/hornet under normal operating conditions.

In terms of filespace and memory resources, the program must not have more than ten files open at one time, nor misuse memory. "Misuse" means that it must not free memory it has not allocated, nor use an amount of memory that is more than a constant amount larger than any input file.

Sample Inputs and Outputs

% nwc /usr/man/*/*
The top 10 words (out of 28390) are:
  193966 the
  191535 para
  156692 literal
  104872 refentrytitle
  104865 manvolnum
   97672 citerefentry
   90060 listitem
   88800 entry
   85343 term
   83944 varlistentry

% nwc nonexistant
nonexistant: No such file or director y
The top 10 words (out of 0) are:

% nwc -n xxx
usage: nwc [-n num] [ file1 [ file 2 ...] ]

% nwc -n 1 main.c
The top 1 words (out of 101) are:
       13 infile

% nwc xxx main.c
xxx: No such file or director y
The top 10 words (out of 101) are:
        13 infile
        13 fileidx
        12 the
        11 s
        11 argv
        10 words
        10 if
         7 n
         7 int
         7 argc

The complete set of inputs is in:


http://www.csc.calpoly.edu/~gfisher/classes/357/programs/2/testing/inputs

with corresponding correct outputs in

http://www.csc.calpoly.edu/~gfisher/classes/357/programs/2/testing/expected-output



Implementation Suggestions

The following C library functions may be particularly useful in your implementation. You can read about these in K&R, Stevens, and the man pages.

Function Description
malloc allocate memory
calloc allocate memory, clearing it to zeros
realloc re-allocate memory, after having malloc'd
free free no-longer needed memory
fopen open a file
fclose close a file
fgetc get a character from a file stream
putc output a character to a file stream
feof test to see if a given stream has encountered an end of file
strxxx string handling functions, documented in man string(3C)
isxxx character handling functions, documented in man ctype(3C)

A key program design decision for this assignment is the data structure to use for the frequency count table. A hash table is strongly recommended.

Deliverables

You must submit a set of .c and .h files plus a Makefile that compiles your program into an executable named nwc. The files must follow the 357 coding conventions.

Scoring Details

The testing plan file in the 357 program2 directory has the precise point breakdown.

The design convention document specifies deductions for violations of the conventions.

Collaboration

NO collaboration is allowed on this assignment. Everyone must do their own individual work.

How to Submit the Deliverable

Submit your deliverable using the handin program on falcon/hornet. The command is

handin gfisher prog2 ... Makefile
where "..." are your program files.



index | lectures | labs | programs | handouts | solutions | examples | documentation | bin