Searching the Concordances of a Book

Modern computer science has produced many applications that are used by hundreds of millions of people every day. One of the most well-known is the search engine. In this assignment, you'll build a simple search engine to search the contents of a book.

Note: this is a difficult assignment, perhaps the most difficult of this course. Beware! However, it is also one of the most instructive assignments, in that it will allow you to see how real applications are designed.

This assignment will give you practice with files, Strings, arrays, 2D arrays, sorting, and binary search.

To begin with, the user will supply the name of a text file, which your program will then process to build a search engine.

The first step of processing is to build something called an ``index'' (which is what makes this assignment hard). For the purposes of this assignment, the index will consist of 5 different arrays, one of which will be discarded partway through. Let's consider these one at a time.

The first array will store all of the distinct words (also called "word types") that are contained in a text file. That means, the array will not store the same word twice. To create this array, you'll need to do several steps, and in the process create a second array that eventually will be discarded:

At this point, your program has created the vocabArray, which should contain a single copy of every different word in the input file. That's the first part of the index. You can forget about the wordArray at this point; we only needed it to create the vocabArray.

The second part of the index is called the concordanceArray. It will contain every line from the input file exactly once. These lines are the concordances of the book. Again, this will take a few steps to create. As before, I recommend writing print statements saying "Creating concordanceArray" and "Finished creating concordanceArray" before and after these steps, to help debug.

The last two parts of the index are called the invertedIndex and wordCountArray. The invertedIndex will be a two-dimensional array of integers. This is the most important part of the index, and it will enable you to perform your searches very quickly. The wordCountArray is a one-dimensional array of integers that stores how many times every word in the vocabArray appears in the input file. As before, include print statements to say when this process has started and ended. Here's what you'll need to do:

When you're done with this step, each row of the invertedIndex will contain up to 10 numbers, indicating ten lines from the concordanceArray that contain the same word. Also, each element of wordCountArray will contain the count of a word from the vocabArray.

That's the end of building the index for your search engine. Now you can ask the user to enter a search term, and use the index to help find concordances for the search term. Your program should repeatedly prompt the user to enter a search term. For each search term, the program should display the number of times the word appears in the file, and the first 10 concordances (lines) for that word. The user should be able to quit searching (and end the program) by entering "\\Quit".

For example, if the user supplies the text of The Adventures of Huckleberry Finn and then searches for the token "scrambled", the program should print

scrambled (2):
  184  scrambled out of the window on to the shed.  Then I slipped down to the
 2638  So they started, and I lit out, all in a cold sweat, and scrambled

To use the index to do a search, your program should do the following:

Style

DO NOT PUT ALL OF YOUR CODE IN THE main() METHOD! To get full credit for this assignment, you must break your code up into manageable sections, and write a method for each one of these sections. Use your own judgment to decide what belongs together in a method.

Suggestions

This is optional, but will probably make it much easier to debug your program. I recommend writing a displayArraySection(String [] arr, int numElements) method, and perhaps also a displayArraySection(int [][] arr, int numRows) method. These methods would display the first numElements cells of the array (or first numRows rows of the array) on the screen. Since you are creating some very large arrays in this program, it is difficult to see if those arrays contain the right thing, unless you can display some portion of those arrays on the screen and inspect whether they contain the right thing. These methods would help you do that.