This is my personal Tumblr page for random musings and things that just don't fit into 140 characters.

 

Using Digit Distribution to Signal Fraud

[This post originally ran on June 20, 2009 on my previous blog]

Today’s Washington Post Op-Ed page has a great op-ed from two statisticians regarding the controversial Iran elections. Their analysis is similar to something the IRS uses to detect tax filing fraud called Benford’s Rule.

In it they write the punchline:

The probability that a fair election would produce both too few non-adjacent digits and the suspicious deviations in last-digit frequencies described earlier is less than .005. In other words, a bet that the numbers are clean is a one in two-hundred long shot.



I posted the article from Twitter and to my Facebook page, and got back the following concern:

intriguing. The only caveat is that the number of provinces in Iran is small - i.e., 30, for statistical analysis to be significant. Perhaps a district-wise analysis (if available) could nail the fraud. BTW, Iran coverage in India is pretty minimal - it shows up on page 20 in a newspaper like Times of India. International newspapers are the main source of good info (such as this article).



The comment made sense, inferences drawn from small sample sizes can be extremely misleading. In fact, just today, Vibha and I celebrated that our zip code was one of few in DC to appreciate in 2008, only then to find the sample size was 3 home sales!

I decided to write the authors, and got a solid reasonable reply:



We’re using the vote counts for all of the four candidates, and we have data for 29 provinces (Lorestan was omitted from the results posted on Press TV, so we didn’t use it either), so that’s 4 * 29 = 116 numbers. A sample of 116 is large enough for us to be confident in our conclusion.

The probabilities we report are, in any case, computed for the number of observations that we have: The smaller the sample, the higher the probability that digit frequencies in fair elections vary widely, so we take this into account in calculating the likelihood that Iran’s vote counts were manipulated.

There’s additional information on the methods we use in an annotated version of the op-ed, which is posted on my web site,
. Here’s the direct link to the annotated version:

http://www.columbia.edu/~bhb2102/files/Beber_Scacco_The_Devil_Is_in_the_Digits.pdf

Best,
Bernd



In my mind, this is the cleanest way to identify fraud when observers are kept out, and normal tools like exit polling are not available. The human mind can’t escape its own limitations when making up numbers.