kottke.org home archives + xml about kottke.org contact me
kottke.org - home of fine hypertext products

Oh, what a year

One year ago today, I asked the readers of kottke.org to become micropatrons and support my efforts in producing the site for a year.

So the year is up and I've decided to take the site in a new direction.

Kottke.org will now start advertising to generate revenue. I have become increasingly addicted to crank in this past year[1], and the micropatron model was incapable of generating enough revenue to support my habit and still allow the time to actually consume all the crank I require to stay effed-up enough to maintain.

Starting today, kottke.org will become an ad sponsored site.

I feel the advertising model will provide the revenue and the connections[2] to acquire and consume all the crystal I need.

I'd like to take this moment to thank the people at the Golden Palace internet casino, providing quality service for internet gamblers everywhere[3].

[1] Actually, I have been addicted for the past three years, but my habit has really stepped up in the past nine months, and I have been unable to supplement enough of the revenue I need by turning tricks.

[2] all ad and marketing people are meth addicts.

[3] This has been a paid advertisement by Golden Palace.

Catching cheaters with Benford's Law

Benford's Law describes a curious phenomenon about the counterintuitive distribution of numbers in sets of non-random data:

A phenomenological law also called the first digit law, first digit phenomenon, or leading digit phenomenon. Benford's law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30%, much greater than the expected 11.1% (i.e., one digit out of 9). Benford's law can be observed, for instance, by examining tables of logarithms and noting that the first pages are much more worn and smudged than later pages (Newcomb 1881). While Benford's law unquestionably applies to many situations in the real world, a satisfactory explanation has been given only recently through the work of Hill (1996).

I first heard of Benford's Law in connection with the IRS using it to detect tax fraud. If you're cheating on your taxes, you might fill in amounts of money somewhat at random, the distribution of which would not match that of actual financial data. So if the digit "1" shows up on Al Capone's tax return about 15% of the time (as opposed to the expected 30%), the IRS can reasonably assume they should take a closer look at Mr. Capone's return.

Since I installed Movable Type 3.15 back in March 2005, I have been using its "post to the future" option pretty regularly to post my remaindered links...and have been using it almost exclusively for the last few months[1]. That means I'm saving the entries in draft, manually changing the dates and times, and then setting the entries to post at some point in the future. For example, an entry with a timestamp like "2006-02-20 22:19:09" when I wrote the draft might get changed to something like "2006-02-21 08:41:09" for future posting at around 8:41 am the next morning. The point is, I'm choosing basically random numbers for the timestamps of my remaindered links, particularly for the hours and minutes digits. I'm "cheating"...committing post timestamp fraud.

That got me thinking...can I use the distribution of numbers in these post timestamps to detect my cheating? Hoping that I could (or this would be a lot of work wasted), I whipped up a MT template that produced two long strings of numbers: 1) one of all the hours and minutes digits from the post timestamps from May 2005 to the present (i.e. the cheating period), 2) and one of all the hours and minutes digits from Dec 2002 - Jan 2005 (i.e. the control group). Then I used a PHP script to count the numbers in each string, dumped the results into Excel, and graphed the two distributions together. And here's what they look like, followed by a table of the values used to produce the chart:

Catching cheaters

Digit   5/05-now   12/02-1/05   Difference
131.76%33.46%1.70%
211.76%14.65%2.89%
310.30%9.96%0.34%
410.44%9.58%0.86%
510.02%10.52%0.51%
64.83%5.40%0.57%
75.66%4.96%0.70%
87.62%4.65%2.97%
97.60%6.81%0.79%

As expected, 1 & 2 show up less than they should during the cheating period, but not overly so[2]. The real fingerprint of the crime lies with the 8s. The number 8 shows up during the cheating period ~64% more than expected. After thinking about it for awhile, I came up with an explanation for the abundance of 8s. I often schedule posts between 8am-9am so that there's stuff on the site for the early-morning browse and I usually finish off the day with something between 6pm-7pm (18:00 - 19:00). Not exactly the glaring evidence I was expecting, but you can still tell.

The obvious next question is, can this technqiue be utilized for anything useful? How about detecting comment, trackback. or ping spam? I imagine IPs and timestamps from these types of spam are forged to at least some extent. The difficulties are getting enough data to be statistically significant (one forged timestamp isn't enough to tell anything) and having "clean" data to compare it against. In my case, I knew when and where to look for the cheating...it's unclear if someone who didn't know about the timestamp tampering would have been able to detect it. I bet companies with services that deal with huge amounts of spam (Gmail, Yahoo Mail, Hotmail, TypePad, Technorati) could use this technique to filter out the unwanted emails, comments, trackbacks, or pings...although there's probably better methods for doing so.

[1] I've been doing this to achieve a more regular publishing schedule for kottke.org. I typically do a lot of work in the evening and at night and instead of posting all the links in a bunch from 10pm to 1am, I space them out over the course of the next day. Not a big deal because increasing few of the links I feature are time-sensitive and it's better for readers who check back several times a day for updates...they've always got a little something new to read.

[2] You'll also notice that the distributions don't quite follow Benford's Law either. Because of the constraints on which digits can appear in timestamps (e.g. you can never have a timestamp of 71:95), some digits appear proportionally more or less than they would in statistical data. Here's the distribution of digits of every possible time from 00:00 to 23:59:

1 - 25.33
2 - 17.49
3 - 12.27
4 - 10.97
5 - 10.97
6 - 5.74
7 - 5.74
8 - 5.74
9 - 5.74

rating: 4.5 stars

Pride and Prejudice

As far as I'm concerned, Will Ferrell et al., Jon Stewart, or the Farrelly brothers have nothing on Jane Austen when it comes to humor (or would that be humour?). And this latest film adaptation of Pride & Prejudice sticks closely to the book and strikes just the right tone. The book, of course, has lots more goodies in it, but as an abridged version the film couldn't be better. Also, since it was published back in 1813, P&P is in the public domain and can be read online for free.

You're Safired!

Wes Felter calls for the ass fact-checking of William Safire over the latter's article in the NY Times about blog jargon and he's not wrong. Wes correctly notes the etymology of "weblog" and "blog" and hopefully the people responsible for things like the AP Style Guide, English dictionaries, and influential columns like On Language will, at some point, do the 20 minutes of research necessary to convince them and the unwashed journalist masses that "blog" is not and was never short for "web log".

Safire also gets tripped up on where the word "blogosphere" came from. While William Quick's usage in 2002 popularized the term, Brad Graham first used the term in 1999.

The secret to Web 2.0: what do Flickr, Ning, Kiko, Vimeo, Shadows, YouTube, Furl, NewsGator, Shutterfly, Mefeedia, Feedster, Planzo, Zazzle, Tailrank, Yakalike, Qoop, Lulu, Blish, Flagr, FireAnt, Odeo, Measure Map, EVDB, Gather, Oyogi, Last.fm...

...Jotspot, Frappr, Yedda, Writeboard, Kanoodle, Memeorandum, SuprGlu, 43 Things, Findory, Clipmarks, Wayfaring, AllPeers, Zoozio, Ziggs, Wink, Reddit, Digg, Gumshoo, Ta-da List, Wikipedia, Pubsub, Ookles, YubNub, Bloop, FeedBurner, Bloglines, Gabbr, Gcast, Blinkx, Openomy, Riffs, Myspace, Pandora, LookLater, 30 Boxes, Rollyo, Squishr, Plazes, Noodly, Wondir, Protopage, Blummy, Jots, Vizu, Del.icio.us, Tagyu, Writely, Simpy, Gtalkr, Truveo, EgoSurf, Mozy, Quimble, Basecamp, Squidoo, NewsVine, Clipfire, Lookster, Netvibes, Facebook, Goowy, Yelp, Magnolia, Technorati, Gmail, Feedmarker, Mercora, StumbleUpon, and SpinSpy all have in common?

They're all web sites. The truth was staring us right in the face all this time.

ps. Damn Movable Type and its restriction on the number of characters I can put in the title of a post. varchar(255) my ass.

Recent moving picture viewings:

rating: 3.0 stars
rating: 4.5 stars
rating: 4.0 stars
rating: 2.5 stars
rating: 3.5 stars
rating: 3.0 stars
rating: 3.0 stars

more movies »  

Not fit to print

Earlier today I posted a link to Frank Bruni's new food blog over at the NY Times. At the same time, I added a comment to this post about how restaurant reservations work here in NYC. I went back to see if there was any further conversation and my comment had been deleted (or had otherwise disappeared). Not such a good start. I've resubmitted the comment...we'll see how long it lasts.

Local competition

Church of the Customer takes a look at how a Northern California restaurant called Cyrus competes with The French Laundry in attracting local customers, particularly those from wineries with big expense accounts for entertaining clients:

1. Match your competitor's exceptional quality.
The food at both restaurants was cooked perfectly and beautifully presented. Both delivered flawless service. By matching the quality of its better-known competitor, Cyrus removes the primary barriers of opposition.

2. Allow your customers to customize.
The French Laundry offers three prix-fixe menus of nine courses each. Cyrus allows its customers to choose their number of courses and the dishes.

Local competition still matters. You usually think of restaurants like The French Laundry as competing on a national or international level. Over the years, Keller's flagship has made several short lists of the best restaurants in the world. But as this article demonstrates, having to compete for the same pool of local customers can drive competitors to achieve a high level of excellence, higher perhaps than they would have achieved without that competition, and that excellence could lead to wider recognition. Even companies like Google, Yahoo, Microsoft, and Amazon who compete on a global level and don't interact with their customers face-to-face still have to vie with each other for local resources, particularly employees.

Keynoting(!) at SXSW 2006

Through an improbable series of clerical errors, I am scheduled to participate in a "keynote conversation" about professional blogging with Heather Armstrong at SXSW in Austin, Texas next month. Armstrong, so the story goes, got fired for blogging at work and was rewarded with a loving husband, cutie-pie daughter, photogenic dog, several television appearances, hundreds of media mentions, and a new job -- talking about poop all day -- that supports her entire family. And so but by the way, she's also headlining the entire SXSW Festival along with Rock and Roll Hall of Famer Neil Young. Which makes me approximately chopped liver. When I told Meg about the headlining thing, she said, "boy, that conversation had better be good". Pressure's on, Heather.

To sum up, a piece of chopped liver will be having a chat with a nice lady from Utah next month about blogging for groceries. Should be fun.

Skiing videos

I did some skiing last week up in Vermont and took some videos with my phone on the slopes. The quality isn't great, but hopefully you'll get the gist.

A short clip of me skiing through the trees:

Riding the chair lift:

And one of me skiing behind Meg:

The motion in the last one reminds me of Quake...like I'm chasing after her with a railgun or something.

You'll find more in the archives or you may peruse the books, movies, remaindered links, or further afield separately.

kottke.org

You're visiting kottke.org. All content by Jason Kottke (contact me) unless otherwise noted, with some restrictions on its use. Good luck will come to those who dig around in the archives. If you've reached this point by accident, I suggest panic.