The computer revolution has produced a society that feeds on information. Yet much of the information is its raw form: data. There is no shortage of this raw material. It is created in vast quantities by financial transactions, legal proceedings, and government activities; reproduced in an overwhelming flood of reports, magazines, and newspapers; and dumped wholesale into filing cabinets, libraries, and computers. The challenge is to manage the stuff efficiently and effectively, so that pertinent items can be located and information extracted without undue expense or inconvenience.
The traditional method of storing documents on paper is expensive in terms of both storage space and, more importantly, the time it takes to locate and retrieve information when it is required. It is becoming ever more attractive to store and access documents electronically. The text in a stack of books hundreds of feet high can be held on just one computer disk, which makes electronic media astonishingly efficient in terms of physical space. In addition, the information can be accessed using keywords drawn from the text itself. Compared with manual document-indexing schemes, this approach provides both flexibility (all words are keywords) and reliability (because indexing is accomplished without any human interpretation or intervention). Moreover, organizations nowadays have to cope with diverse sources of electronic information such as machine-readable text, fax and other scanned documents, and digitized graphics. All these can be stored and accessed efficiently using electronic media rather than paper.
This book discusses how to manage large numbers of documents -- gigabytes of data. A gigabytes is approximately one thousand million bytes, enough to store the text of a thousand books, about the size of an office wall packed floor to ceiling. The term has gained currency only recently, as the capacity of mass storage devices has grown. Just two decades ago, requirements measured in megabytes (one million bytes) seemed extravagant, even fanciful. Now personal computers come with gigabytes of storage, and it is commonplace for even small organizations to store many gigabytes of data. Since the first edition of this book, the explosion of the World Wide Web has made terabytes (one trillion bytes) of data available to the public, making even more people aware of the problems involved in handling this quantity of data.
There are two challenges when managing such huge volumes of data, both of which are addressed in this book. The first is storing the data efficiently. This is done by compressing it. The second is providing fast access through keyword searches. For this, a tailor-made electronic index must be constructed. Traditional methods of compression and searching need to be adapted to meet these challenges. These are two topics examined in this book. The end result of applying the techniques described here is a computer system that can store millions of documents and retrieve the documents that contain any given combination of keywords in a matter of seconds, or even in a fraction of a second.
Here is an example to illustrate the power of the methods described in this book. With them, you can create a database from a few gigabytes of text and use it to answer a query like "retrieve all documents that include paragraphs containing the two words 'managing' and 'gigabytes'" in just a few seconds on an office workstation. In truth, given an appropriate index to the text, this is not such a remarkable feat. What is impressive, through, is that the database that needs to be created, which includes the index and the complete text (both compressed, of course), is less than half the size of the original text alone. In addition, the time it takes to build this database on a workstation of moderate size is just a few hours. And perhaps most amazing of all, the time required to answer the query is less than if the database had not been compressed.
Many of the techniques described in this book have been invented and tested recently and are only now being put into practice. Ways to index the text for rapid search and retrieval are thoroughly examined; this material forms the core of the book. Topics covered include text compression and modeling, methods for the compression of images, and page layout recognition to separate pictures and diagrams from text.
Full-text indexes are inevitably very large and therefore potentially expensive. However, this book shows how a complete index to every word-and, if desired, every number - in the text can be provides with minimal storage overhead and extremely rapid access.
The objective of this book is to introduce a new generation of techniques for managing large collections of documents and images. After reading it, you will understand what these techniques are and appreciate their strengths and applicability.