The last word on keyword searches
In my last two posts on keyword searching, I’ve discussed what makes a good keyword, how to use keywords to locate the evidence you’re after and the concept of precise versus fuzzy searching. In this post, I want to round this all off with a few final thoughts and some strategies that can really help you home in on the material you’re looking for.
One helpful tool available to your forensic analysts that you should be aware of is ‘grep’ (rhymes with ‘prep’). I don’t want to get in to a discussion about how grep works and the syntax of it, there are great tutorials available for that – but I do want to make sure you’re aware that powerful search tools are available to the technical analysts that you instruct to carry out your searches. In a nutshell, grep allows for very fuzzy searching.
Say, for example that we know that a victim of our subject was called John Douglas. This was told to us verbally so we can’t be sure of the spelling – in order to capture John & Jon and Douglas and Douglass we would need four different searches to cover all of the possible options. By using grep, we can have a single search defined which will look for ‘jo’ optionally followed by an ‘h’ followed by an ‘n’ and a space. We do the same for ‘douglas’ and mark an additional optional ‘s’ on the end and mark the whole thing as not case sensitive. We now have a single search expression which will cover all eventualities, including upper and lower case.
Grep expressions can really make the difference when deadlines are tight. When searching for our four iterations of John Douglas, above, had we not used grep, then our target dataset would need to be searched for four terms instead of one. The time saving isn’t as simple as being three times faster as most tools are optimised for search functions like this – however if you’re throwing a large number of search terms at a case, reducing the number by 60% will have a significant impact on search times without any reduction in the number of results you obtain. You do need to take a moment to consider all the variations for each keyword expression you want to look for so that nothing falls between the cracks.
Another great use for grep expressions is the ability to carry out proximity searches. This means we can look for all occurrences of the phrase ‘funds transfer’ within 20 words, or characters of ‘bank account’. This allows us to filter out the false positives around legitimate mentions of ‘funds transfer’ and of ‘bank account’ – we only want to see the instances when they appear in close proximity to each other.
The final point I’d like to touch on relates to keyword strategies, the concept called ‘de-NISTing’. It comes from the use of data made available by the US National Institute of Standards and Technology (NIST) who produce a library of hash values, or digital fingerprints for all the system and application files used in standard operating systems and software.
These digital fingerprints allow us to filter out all the operating system files from Windows, Mac or Linux along with all of the Office programs and other standard software that we simply don’t need to search. By reliably filtering out, or de-NISTing our data, we can usually reduce the amount of material we need to search by around 40%. In large corporate environments that 40% may account for days of search time being expended that is simply wasted effort – we are never going to find anything of relevance in these binary program files.
The second part of this filtering activity is to create a population of files that you do want to search – for example; all Office documents (Word, Excel, PowerPoint) Adobe Acrobat PDF files and so on. Take care to consider whether the data may contain scans of documents, like faxes. These are essentially pictures and are not editable using a word processor, so they can’t be searched for keywords unless they are scanned in and converted to text using an Optical Character Recognition (OCR) process to turn the scans into searchable text.
Selecting good keywords to provide to your technical team so they can use their tools and expertise to find the material you’re interested in is less a science and more an art form. Hopefully having read these series of posts you’re more aware of the type of terms that will bear fruit and the jargon that your technical analysts will use and understand.
One final thought – the technical analysts you’re instructing generally have a good understanding of these strategies and want to provide you useful results, so talk to them and discuss your provisional search term list. Be flexible and prepared to modify these terms based on their advice. The last thing anyone needs is to sit reviewing a mountain of false positives.
By John Douglas – Technical Director, First Response