Need to review 650,000 emails in eight days? Easy with a computer

News flash: Computers can do things really quickly.

Need to review 650,000 emails in eight days? Easy with a computer

Here's a wake-up call for anyone who hasn't heard that technology can do things a heck of a lot quicker than humans shuffling paper.

Like this chap...

Firstly, it's simply not 650,000 Hillary Clinton emails that need to looked at.

The emails belonged to avid sexter Anthony Weiner. That means the vast majority of them were unlikely to have anything connected to Hillary Clinton or even Weiner's (now estranged) wife Huma Abedin.

How do you find the emails of interest? Simple - just use some simple email filters, as Robert Graham explains:

The point is is this. Computer geeks have tools that make searching the emails extremely easy. Given those emails, and a list of known email accounts from Hillary and associates, and a list of other search terms, it would take me only a few hours to do reduce the workload from 650,000 emails to only a couple hundred, which a single person can read in less than a day.

The question isn't whether the FBI could review all those emails in 8 days, but why the FBI couldn't have reviewed them all in one or two days. Or even why they couldn't have reviewed them before Comey made that horrendous announcement that they were reviewing the emails.

The numbers can be reduced even further when you remove messages that you have already examined in earlier stages of the investigation.

Frankly, I'm not convinced of the IT savviness of either Donald Trump or Hillary Clinton (whose use of a personal email server was clearly ill-advised).

Let's hope that whoever gets the top job gets some sensible advice on computer security, and the media doesn't perpetuate the myth that for the FBI to have completed its hunt through the Weiner email archive is impossible.

Read more in Robert Graham's article or in Wired's exploration of the issue.

Tags: ,

Smashing Security audio podcast
Check out "Smashing Security", the new weekly audio podcast, with Graham Cluley, Carole Theriault, and special guests from the world of information security.

"Three people having fun in an industry often focused on bad news" • "It's brilliant!" • "The Top Gear of computer security"

Latest episode:

Listen now

Subscribe to the free GCHQ newsletter

,

9 Responses

  1. Bruce

    November 7, 2016 at 8:20 pm #

    If you've ever actually had to do it, it's not so easy. Search words, addresses, etc. do help but it takes time to develop a decent set of search words. Let's say they had a good start list. Probably did. That list gives you hits that need sorted through manually or via an additional search. Refining the list and turning the list into specific topical lists as you really begin to target interesting data also takes time. You have to get into the mindset of the email writers to refine the lists. Hits get broken into categories and those categories take you down paths (some interesting, some not). Even when you feel that you have data of interest there is no guarantee that you haven'e missed key data or topics.

    650,000 emails is a lot of emails computer or not. I've been managing users emails for awhile and my worst pack rats never came close to that individually. I'd love to hear where that volume is common. I'm sure there are industries where it is common. Maybe when a legal requirement is in play.

    • Chris in reply to Bruce.

      November 8, 2016 at 12:14 pm #

      In the world of Electronic Discovery 650k emails is chicken feed. I have personally run jobs where I have collected, indexed and examined tens of millions of emails in a timeframe of days, not weeks, in addition to hundreds of Gigabytes of electronic and scanned docs etc. This is a totally different beast to being a company sysadmin (I have worn that hat, too). I believe that the emails of interest would have been classified/protectively marked and/or contained fairly specific content/topics so it's trivial to get a much smaller dataset for review. Keyword searching is just one of the tools, which is typically used in addition to deduplication/(fuzzy) hashing, predictive coding, indexing/threading of emails and so on. Once you've drilled down from 650k to, for example, 10k, you then get a team or teams of reviewers looking at specific datasets (possibly with relevant translation skills) around the globe so that the data is in constant review (follow the sun). In this way you can get through 650k emails in a matter of hours, all being well.

      • CyberNemesis in reply to Chris.

        November 8, 2016 at 9:53 pm #

        If you as the author of said email are dumb enough to write in keywords that happen to be the same keywords the FBI would search for, then sure this would work. Misdirected intent, subversive messages, code words and obfuscation can easily defeat a keyword system. Thus real due diligence on something this important shouldn't be left to keyword indices and fuzzing in my opinion.

    • Adrian Barrett in reply to Bruce.

      November 8, 2016 at 2:42 pm #

      Chris makes a good point, this is quite small scale, in fact reviewing billions is perfectly practical now. However, you ask a good question regarding volume of data. I can't speak to emails specific, however in the discovery work we at Exonar conduct within corporates we find:

      – An average of 10GB of documents per employee (that's a 700m high stack of 10 point plain text on A4),
      – 1% of them contain passwords,
      – 46% is duplicated,
      – 9% contains personal data,
      – There is very little sign that people are securing it (15% of sensitive data broadly available, vs 20% of all data).

      So yes, it is easy to review and understand this number of emails (or documents) and figure what you should care about. Surely it makes sense to do this in advance of the information being stolen?

      • CyberNemesis in reply to Adrian Barrett.

        November 8, 2016 at 9:54 pm #

        Reading email for conspiracy isn't the same as running regex looking for SSNs or PII data, in my opinion.

        • Adrian Barrett in reply to CyberNemesis.

          November 9, 2016 at 12:06 am #

          very true, you need something much more sophisticated to look for things other than simple number patterns (like SSNs), machine learning and natural language processing helps, but isn't there quite yet

  2. Tom

    November 8, 2016 at 12:09 pm #

    If the FBI can filter even 10s of thousands of emails in 8 days why does the State Department require a year (and counting) to do the same thing? The reality is the AG won't invoke a grand jury, is restricting what the FBI may or may not do. I think he just gave up trying to do the right thing because of politics.

  3. Robert Shapiro

    November 8, 2016 at 1:35 pm #

    Excuse me – but it took the FBI over a year to review the 30,000 emails that they received from the State Dept (in digital form). If even one email was included in the 650K trove that was not previously turned over – Clinton would have then lied to the FBI, to Congress and is guilty of a felony.

  4. Lisa B.

    November 8, 2016 at 2:39 pm #

    Programs search for specific words and phrases. They don't understand the nuances and intricacies of the human language and are likely to miss emails with information.

    I have as yet to see any AI that is truly realistic and fools a human into believing it's real. Why? The AI cannot understand the human mind completely. If they did, we would turn over a lot of functions to AI. Any cop on the beat will tell you his/her instincts are worth more than any program out there. They can tell when a person is lying, when they are hiding something and when they're going to run, break down, etc. These are things a computer/AI is still not able to comprehend or predict.

    I would wager there are a quite a few emails in there that have vital and damning information in them. Unfortunately, the programs searching for such information failed to recognize their significance and without humans actually reading the emails, we'll never know for certain.

    This was a political ploy, plain and simple. Hillary Clinton should have been indicted before and was given a special pass due to her status and financial holdings. I suspect quite a few folks were threatened and/or paid off for this to all disappear.

    While these emails may have been from Weiner, it is entirely possible Huma had access to his email and/or that Weiner slipped some information into emails. Either way, the public should be outraged.

Leave a Reply