Federal government data mining EPIC FAIL

by: Dirty D

Wed Oct 08, 2008 at 17:24


First time diarist, here. This was cross-posted at Overdetermined.

The Ever Amazing Cory Doctorow managed to unearth this little beauty on how government usage of data mining is best described as a catastrophic failure:

They admit that far more Americans live their lives online, using everything from VoIP phones to Facebook to RFID tags in automobiles, than a decade ago, and the databases created by those activities are tempting targets for federal agencies. And they draw a distinction between subject-based data mining (starting with one individual and looking for connections) compared with pattern-based data mining (looking for anomalous activities that could show illegal activities).
But the authors conclude the type of data mining that government bureaucrats would like to do--perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."

There are several points in this to discuss.

Dirty D :: Federal government data mining EPIC FAIL
Let's start by conceding that the massive intrusion of the government into our daily lives by means of data collection and profiling is absolutely terrifying, etc. We all agree to this, and it's non-controversial.  Of course, it's also non-trivial, but I think that we can take it as axiomatic that all of us agree that it's pretty scary. The question, though, is how scary it can be when they can't even do it correctly, and the answer is, even more terrifying. The fact that an aggressive lunatic with a rifle has terrible aim doesn't do anything to reassure the people around his target that they won't get hit. The phrase "false positive", while used correctly, somewhat euphemises what's actually happening. What this means is that the government are arresting and charging the wrong people. 

Now, obviously, the government frequently make mistakes in their charges, but this is different for two reasons:

  1. These aren't cases of a local police department picking up the wrong homeless guy off the street and charging him with selling marijuana to college kids.  These techniques are used by the top level federal government agencies to try nand find terrorists and violent conspirators.  To borrow some words from De La Soul, "Stakes is high".
  2. Because of the fact that the grounds for charges are algorithmic, and not the result of normal human methods, the confidence that people have in them is much, much higher than the confidence that they would have had in other charges, regardless of the number of false positives.

The problem here is that regardless of the fact that these charges are based on algorithmic grounds, people are forgetting that it takes a human being to write these algorithms and interpret the results.  What is under indictment here is neither subject nor pattern-based data-mining.  What is under indictment is the capacity of the Bush administration to use these methods effectively.

If it were as simple as just conjoining lists of consumer data, political data and assorted other data and then running CHAID after CHAID, any idiot with a lot of money and a lot of processor power could do it. All you'd have to do is buy the lists and watch the machines work.  The simple fact of the matter is that any result is only as good as the algorithm that produced it, and the algorithm is only as good as the programmer who produced it.

Let's take a very simple example.  Let's suppose that I have a poll, and in party of that poll, I'm trying to determine whether or not someone is a techie. The way that I do this is by asking the following questions:

I am now going to ask you a series of questions. Please answer yes or no, and if you need me to repeat this instructions, please let me know.

Do you/Did you:

  1. Work in a technical capacity?
  2. Spend your personal time pursuing technology related hobbies?
  3. Adopt new technologies before more people?
  4. Study some technology related course of study when you were in school or at college?
  5. Do you frequently eat Red Vines, gummi bears and pre-packaged pastry products?
  6. Do you read XKCD?
  7. Are you fond of comic books?
  8. Do you spend more than ten hours a week playing video games?
  9. Is Chris Matthews a tool?
  10. Is the greatest threat to American sovereignty the proliferation of Amero loving liberals?
  11. Are blogs ruining journalism?
  12. Is Jeet Kune Do the baddest martial art of all time?

Once I have the responses to these questions, let's say that I'm operating in SPSS, and I choose to define a techie as anyone who answered yes to at least four out of twelve of these questions. For those of you who are interested, here's the syntax:

COUNT TECHIE = Q1 TO Q12 ("Yes").

VAR LAB TECHIE "TECHIES".

RECODE TECHIE (LO THRU 3=1) (4 THRU HI=2).

VAL LAB TECHIE 1 "Not a techie" 2 "Techie".

The problem here is that I'm getting a bad measurement.  In the battery of questions above, 1 through 4 accurately measure things that make people techies, 5 though 8 measure whether or not someone participates in tech geek culture (which is not identical to being a techie - not all techies are tech geeks), and 9 through 12 are completely non-germane.  If you were to define a techie as any person who answered "Yes" to any four of those twelve questions, you're going to get a whole bunch of false positives.

This is an extreme example, granted, but it illustrates the point: any algorithm is only as good as the person who wrote it. Frankly, is it any surprise that the Bush government just can't get it right?

Dirty D


Tags: , , (All Tags)
Print Friendly View Send As Email
USER MENU

Open Left Campaigns

SEARCH

   

Advanced Search

QUICK HITS
STATE BLOGS
Powered by: SoapBlox