Limitations of the Keyword Search

A number of criminal and civil cases we have been involved in have relied upon the use of keywords to locate relevant nuggets of data from large data sets. Keyword searching involves selecting a set of words chosen for their potential in appearing in relevant documents and using them to search a data set.

There are several factors that can limit the effectiveness of keyword searching.

Keyword Selection

Keyword selection is a crucial task to effectively identifying relevant material in a data set. Keywords can be overly inclusive resulting in the capture of predominantly irrelevant data and limiting the usefulness of the results. Some examples of these are using first names, generic terms or an address that appears in the signature of every email.  Keywords like these can result in thousands upon thousands of search hits; making the review process painful. On the other hand, Keywords can be overly restrictive, missing valuable relevant data. The task of selecting keywords is key to an effective search.

Process

Often, parties will agree to a search process that is a one-time review of the data with no return to review the original data. Completing a thorough review of data for relevant material can seldom be completed thoroughly under these terms. Generally reviewing initial keyword hits often results in identifying higher quality keywords with greater potential to identify relevant data. This adds emphasis on the importance of initial keyword selection if you are in fact stuck with such an agreement. Ideally the process should include the ability to cycle through keyword selection, search and analysis repeatedly until all relevant responsive material is sufficiently identified.

Collections limited to extracting documents that host keywords will inevitably miss other related material that does not contain keywords but is readily identifiable through context. For instance a set of documents may be found within a folder that is relevant to the matter at hand. Only the documents responsive to the keywords are extracted for review while other records within the same relevant folder are left behind. The name of the folder containing the documents may be the only item that responds to the keyword. Are it’s enclosed documents producible when none of them responded to keywords? Generally agreements and orders specify that documents containing the keywords are producible and there is no provision for material related to the keywords in the fashion described.

Frequently the state of the data is overlooked or not thoroughly explored prior to selecting and enacting a process. The following conditions can often hinder an efficient data review for relevant material.

  • The data is in a format unresponsive to keywords such as non-searchable PDF files, tiffs, audio clips, scanned images. In one case we were puzzled to find a data set that was supposed to be made up primarily of spreadsheets entirely unresponsive to keywords. A closer look at the data set revealed that the spreadsheets had been scanned onto the system as JPG formatted pictures. This was problematic in two ways. First, JPG’s being pictures are not responsive to keywords. Secondly in e-discovery collections the file type JPG are rarely chosen for review.
  • It is not uncommon to encounter data that is rendered unsearchable by compression, encryption, password protection or other methods of obfuscation.

While these issues in most cases do not entirely prevent keyword searching they do slow the process down as the data needs to be rendered searchable prior to the actual search taking place. This becomes a significant issue when on-site data searches are prescribed without considering these possibilities.

The limitations of keyword searching can be overcome with effective planning and execution.  However, the use of keyword searching to triage media on-site for collection should be avoided and only be relied upon when the option of collecting all media for later review is not available. Keyword searching on site can result in relevant material being overlooked as there is usually insufficient time and equipment to overcome the limitations listed above.

The requirement to pre-search data prior to collection for responsive keywords is far more disruptive to the targeted business than simply collecting the data and identifying relevant material afterward.