All The Little Phrases Say, "Here I Am!"

Topical Information

The purpose of this project is to test your ability to use files, strings, classes, and operator overloading effectively in program design.

Program Information

Create a program to build a concordance from a file specified by the user. In essence, a concordance is like an index of all words used in a document. Once a concordance is built, we can of course use it to speed searching a document for keywords. But we can also use it to study the use of words within different contexts (useful in linguistics).

To be useful, a concordance would not include trivial words like articles (a, an, the) — as these would just get in the way of studying keywords. In addition, the concordance might even contain references to common phrases used within the document.

Your concordance should provide the following capabilities in a nice menu-driven interface:

Allow the user to indicate what words are to be considered trivial (at least entering new words and possibly removing words no longer considered trivial).
Build the concordance index on the user's file. If an index already exists, ask if they want to rebuild it. The concordance should include line numbers and position within the line for each occurrence of each word not in the trivial words list. Use a minimal amount of space by grouping all the occurrence information about a word with a single copy of that word.
Search for words using the concordance. Report to the user the locations the word is found within the document and offer ~20 surrounding characters of context to give them a better idea of what they've found. (Surrounding context should be allowed to have wrapped across line boundaries — but don't display it on multiple lines.)
Give a statistical summary of what words occurred most frequently. Report both the frequency of the words and the percentage of the text made up of the words. (Does the percentage account for trivial words?) Default the list to the top 5 most frequent words but allow the user to specify a different number either when this option is chosen or through an 'Options' menu setting.

Tip: Perhaps a map data structure would be of use here? Mapping words to lists of page/line/position references? *shrug*

This assignment is (Level 6).

Options

Add (Level 3) to include 2-word 'phrases' in your concordance index. Of course, make sure they can then search for these 2-word phrases... (Be careful how you handle the trivial words in this context!) (Also watch for 2-word phrases which cross line boundaries...but not paragraph boundaries; those words would be from separate concepts and not to be considered part of a common phrase.) (Can you imagine if both those weird circumstances happened at once?!)

I'd recommend that you keep the 1-word concordance index in a separate file from the 2-word concordance index. They are separate conceptually, after all; why not separate them physically? Maybe document.name--1.word.ndx and document.name--2.word.ndx or some such clever file names...

In fact, can you build the 2-word concordance from the 1-word index somehow? That should be faster than re-processing the whole document. It might even take account of those weird fringe issues mentioned above...*hmm*

When doing the most frequent list, give separate lists for single words and 2-word phrases.

In fact, add another (Level 2) to add a menu 'option' to change the maximum phrase length to use in the concordance. All phrase lengths up to and including the user's specified maximum length should be included in the concordance index for later searching. (Like with the 2-word concordance index, keep the index file for each phrase length as a separate file. Try to justify the numbering within the file name — 0-filled and right-justified with an appropriate width based on the maximum phrase length. Perhaps you can even utilizing the smaller phrases to build the new index somehow.)
Add (Level 2) to make searches case insensitive. In fact, this would make an ideal option for that 'Options' [sub]menu.
Add (Level 2) to manage all these configuration issues in an actual Options sub-menu: amount of context when displaying search results; case sensitivity when performing searches; number of most frequent items for each phrase length category; maximum phrase length to concord in index; and any other items you think should be generally configurable. (Note: The trivial words list probably merits its own sub-menu since they can add words to it; view it; and probably remove words from it.)

When the program is exiting, save the configuration to a file specific to this project. (Even though the trivial words list has its own sub-menu, it should be saved with the other configuration options for this project.) When they next mention using a particular document, look for and load this configuration information so they are ready to pick up where they left off.
Add (Level 4) to add a menu selection to rebuild a document from its concordance. And, of course, implement this feature. (Don't let the loss of articles and other trivial words worry you. This loss is expected and can be fixed by forensic linguists — given the trivial words list, of course.)