Wednesday, November 10, 2010

benford's law

In Statistics recently we discussed Benford's Law, which says that data which are not dimensionless (so... monomensionless?) have first (most significant) digits which are not uniformly distributed. In other words, there are nine possible first digits {1, 2, 3, 4, 5, 6, 7, 8, 9} and we might expect that a nice distribution of values might have 1/9 or 11.1% of the numbers starting with 1, but in fact around 30% start with 1. Other digits are decreasingly frequent with small digits being more common. An interesting implication of this is that falsification of data by a careless attempt at making up numbers could result in an observable departure from a known pattern that is not very intuitive. Of course the cat's out of the bag, so now fraud is just taking on a more authentic appearance.

Data that is not dimensionless refers to numbers whose values are affected by the units used in measurement. That part confuses me.

Anyway, I wanted to take this data from the US Census Bureau on world populations by country and see if it follows Benford's Law. You should try it.

I did observe a distribution that Benford would have more or less predicted. I wondered if I had gotten lucky. The two biggest populations were in the 1 billions, and I thought that the frequency of 1's would benefit from spilling over into the billions but not reaching the other-than-1 billions, so I wondered what would happen to the distribution of first digits if the population of each country doubled. Or tripled. Or quadrupled. Guess what happened?

The part of the task that I most enjoyed was finding an excel formula for the first digit of x as a function of x. Can you find such a formula?

5 comments:

  1. Nate,

    Well if it follows Benford's Law, it will be log (1+1/I) where I is the leading digit... The shape of a distribution shouldn't change if you multiply by a constant, so I would think Benford's Law prevailed under those cases too... Actually, History suggests that Simon Newcomb was an earlier observor of the relationship...
    I have some historical info here for those who are interested.. and on my blog from a couple of years ago.
    Pat

    ReplyDelete
  2. My favorite part would be the excel formula also. Would
    =int(x/10^int(log(x)) have worked?

    ReplyDelete
  3. @Rocky
    That function works. I enjoy the occasional task like that and I'm always pleased when I'm able to make it work.

    ReplyDelete
  4. @Pat
    Yes, Benford's Law did prevail, I suppose that's what makes it a law... but I nonetheless marvel to think that if my distribution starts with mostly 1's, I can double the numbers and get a different distribution that still has mostly 1's... shouldn't the doubles of the 1's be mostly 2's? And where did all the new 1's come from? Leading digits which were previously minorities... I know it works, but hmmm. Genius, those Newcomb/Benford folks.

    ReplyDelete
  5. Here's an Excel spreadsheet to investigate Benford's Law. Copy in your data, and Excel does the heavy lifting. I used the spreadsheet for a stats assignment

    ReplyDelete