De-duping emails for devs

The question isn't how long it takes to review 650,000 emails, rather it's how long it takes to review the new emails.

by Dave Winer Tuesday, November 8, 2016

You hear Repubs saying it's impossible to scan 650,000 emails in nine days looking for confidential information. Now there must be some Republican developers out there who can explain to your marketing people that they're wrong about this.

The algorithm

I'm going to describe a simple algorithm that does it.

Assume you have two folders, call them A and B.

You want to eliminate the duplicates in folder B.

You need to set up a database, and since the data set is relatively small, you can probably do it in-memory.

First create an object called hashcodes.

var hashcodes = new Object ();

Loop over folder A. For each file, read it,

var fileContent = file.readWholeFile (f);

and compute a hash code from its contents.

var theCode = CryptoJS.MD5 (fileContent);

Add it to the object.

hashcodes [theCode] = true;

Note that the value doesn't matter, it could be false, or 0, whatever. We never use it.

Now loop over folder B. Compute the hashcode for each file, call it x, and see if it exists in folder A, as follows:

var fileExists = hashcodes [x] !== undefined;

You could delete it if it exists, because it's a duplicate.

If they're all dupes, folder B will be empty after the script runs.

I'd be surprised if it takes more than a minute on an average laptop.

Bottom-line

The question isn't how long it takes to review 650,000 emails, rather it's how long it takes to review the new emails.