CSC 357 Programming Assignment 2
nwc -- A Numbered Word
Counter
This programming assignment is like the first in that it's an implementation of a modified version of an existing UNIX utility. In this case, you are implementing a program that operates similar to UNIX wc, for word counting. Your program is named "nwc", for "numbered word counter".
The nwc program reads one or more files from the command line, and outputs a count of the n most frequent words in the files. The output is to stdout, and is sorted by decreasing order of word frequency. In the case of two or more words of the same frequency, the sub-sorting order is alphabetic by word.
The default value of n is 10. The program accepts an optional "-n" command-line argument that specifies the number of words in the output list. There is always at least one space between the "-n" flag and the following numeric argument. If no input files are given on the command line, input is from stdin.
The definition of "word" is a string of characters delimited by one or more whitespace characters. "Whitespace" is any non-alphabetic character, i.e., any character c for which isalpha(c) is false.
Output of the program is in the following form:
The top n words (out of t) are:where n is the program input count, t is the total count of words in the file input(s), ci are the counts, and wi are the words. Each count is right justified in an output column of ten characters, followed by a single space, followed by the word corresponding to that count.
c1 w1
c2 w2
...
cn wn
If the value of the -n argument is not a positive integer, the program outputs the following usage message
During the course of file processing, if the program encounters a file that cannot be opened or read, it outputs an error message for that file and proceeds with the remaining files, if any. The "Usage ..." and "non-existent file" error messages go to stderr, not stdout.usage: nwc [-n num] [ file1 [ file 2 ...] ]
Word counting is case INsensitive. I.e., "In" and "in" are
considered the same word. The reported word counts are given for the lowercase
spelling of each word.
There is no O-notation requirement for the amount of time your program takes to run. As a specific empirical performance measure, your implementation of nwc must process /usr/dict/words in ten seconds or less, running on falcon/hornet under normal operating conditions.
In terms of filespace and memory resources, the program must not have more than
ten files open at one time, nor misuse memory. "Misuse" means that it must not
free memory it has not allocated, nor use an amount of memory that is more than
a constant amount larger than any input file.
% nwc /usr/man/*/* The top 10 words (out of 28390) are: 193966 the 191535 para 156692 literal 104872 refentrytitle 104865 manvolnum 97672 citerefentry 90060 listitem 88800 entry 85343 term 83944 varlistentry % nwc nonexistant nonexistant: No such file or director y The top 10 words (out of 0) are: % nwc -n xxx usage: nwc [-n num] [ file1 [ file 2 ...] ] % nwc -n 1 main.c The top 1 words (out of 101) are: 13 infile % nwc xxx main.c xxx: No such file or director y The top 10 words (out of 101) are: 13 infile 13 fileidx 12 the 11 s 11 argv 10 words 10 if 7 n 7 int 7 argc
The complete set of inputs is in:
with corresponding correct outputs inhttp://www.csc.calpoly.edu/~gfisher/classes/357/programs/2/testing/inputs
http://www.csc.calpoly.edu/~gfisher/classes/357/programs/2/testing/expected-output
The following C library functions may be particularly useful in your
implementation. You can read about these in K&R, Stevens, and the man pages.
Function Description malloc allocate memory calloc allocate memory, clearing it to zeros realloc re-allocate memory, after having malloc'd free free no-longer needed memory fopen open a file fclose close a file fgetc get a character from a file stream putc output a character to a file stream feof test to see if a given stream has encountered an end of file strxxx string handling functions, documented in man string(3C) isxxx character handling functions, documented in man ctype(3C)
A key program design decision for this assignment is the data structure to use
for the frequency count table. A hash table is strongly recommended.
You must submit a set of .c and .h files plus a
Makefile that compiles your program into an executable named
nwc. The files must follow the 357 coding conventions.
The testing plan file in the 357 program2 directory has the precise point breakdown.
The design convention document specifies deductions for violations of the
conventions.
NO collaboration is allowed on this assignment. Everyone must do their own
individual work.
Submit your deliverable using the handin program on falcon/hornet. The command is
where "..." are your program files.handin gfisher prog2 ... Makefile