Paul Graham, the co-creator of what is now the Yahoo Store, has published a strikingly effective new method of filtering
spam.
I myself have long been critical of anti-spam and "family" filters, most of which are ineffective at best, and brain-dead
at worst. One filter I evaluated stopped the users' PC when the word "bomb" was encountered, supposedly to stop children from
using the Internet to learn how to make bombs. Unfortunately, this also stopped students from researching the Unabomber or
other legitimate topics. Dumb.
Graham's new method, by contrast, is an intelligent application of the science of probability theory. In his latest iteration,
his filter correctly flags 99.5 percent of spam, with 0.0 percent "false positives."
False positives are important, personal messages that anti-spam efforts incorrectly filter out. This is a huge problem for
ordinary methods. As Graham explains, "For most users, missing legitimate email is an order of magnitude worse than receiving
spams, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient."
Like many anti-spam crusaders, Graham himself started with an ordinary filter approach, looking for specific "bad words."
This initially showed some promise. Simply filtering out all e-mails that contain the word "click," he says, correctly eliminates
79.7 percent of spam messages, while wrongly trashing only 1.2 percent of legitimate mail.
Those success rates, however, quickly degrade as more and more words are added to the "bad" list. This makes the crude filtering
approach unusable.
The solution, Graham found, was to expand the technique in a statistically sophisticated way. Because all spam is trying
to hype something, certain words have a high probability of indicating a spam message. Other words almost never appear in
spam.
Words such as "though" and "apparently," for example, increase the probability that a message is legitimate, because spam
isn't big on subtlety. At the same time, a genuine message isn't rejected simply because it uses a single instance of a term
that might also appear in an adult-oriented spam message.
Instead of mere "dumb" filtering, Graham's elegant method analyzes the 15 "most interesting" words in each message. Through
a technique known as Bayesian analysis, the weights of these 15 words are then used to compute the probability that a message
is spam. This analysis is where his 99.5 percent accuracy rate comes from.
To get the weights, Graham ran the analysis on 4,000 spam messages and 4,000 legitimate ones. Statistically, this may not
seem like many, but it's proved to be very significant.
Graham proposes that his research be used to create a "seed filter" that would become part of users' e-mail programs. Users
would also be equipped with two Delete commands. One would be the regular Delete key, for genuine messages, while the other
would be a Delete-As-Spam key, to be used when deleting spam messages. After a short time, each user would have an even more
accurate filter, and spammers wouldn't have a single seed file that they could easily figure out a way to work around.
I've long been an advocate of suing spammers out of existence, using state laws that prohibit false identities (employed
by almost all spammers). Graham, too, supports the anti-spam laws, but mainly because they make spam easier to identify (by
making certain terms predictably appear as spammers deny that their messages fall under the laws). Meanwhile, Graham has made
a believer of me.
Probability theory finally makes a filter that works:
http://www.paulgraham.com http://bri.li/?4e68
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Livingston's Top 10 News Picks o' the Week
1. E-business is enjoying double-digit growth again
http://www.ecommercetimes.com http://bri.li/?430
2. New keyword tool calculates return on investment
http://www.acws.com http://bri.li/?818
3. Increase sales by handling your Web failures well
http://www.newarchitectmag.com http://bri.li/?c00
4. Lessig on tech: Ours is less and less a free society
http://www.oreillynet.com http://bri.li/?fe8
5. Hollywood's DVD "region coding" system is collapsing
http://news.bbc.co.uk http://bri.li/?13d0
6. Hackers find it easy to get into military PCs
http://www.washingtonpost.com http://bri.li/?17b8
7. Ten rules for writing for a dynamic World Wide Web
http://www.alistapart.com http://bri.li/?1ba0
8. Pros share secret tricks of the new DreamWeaver MX
http://www.intranetjournal.com http://bri.li/?1f88
9. HTML tips: Importance of font sizing for usability
http://www.useit.com http://bri.li/?2370
10. See great Flash animations: Solemates (plays music)
http://www.centuryinshoes.com http://bri.li/?2758
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Wacky Web Week: Apple's "Ellen Feiss" video rocks
The latest hilarious video making the rounds of the Net involves a spacey -- some say stoned -- teenage girl explaining
that Windows ate her homework.
The work is a TV spot in Apple's "Switch" series, but it can easily be enjoyed by Windows users as well (most of whom can
probably relate).
As explained by Ellen Feiss, the young woman in the ad, "It was, like, beep beep beep beep beep beep beep, and then, like,
half of my paper was gone." She adds it was "kind of a bummer."
A small cult has sprung up around the subject, with numerous fan sites worshipping the non-actress, complete with video
clips of her at MacWorld and altered versions of the Apple original. Wired News, below, has the best links to the classic
video and its many imitators.
"Windows ate my homework, you know, like, buy an Apple":
http://www.wired.com http://bri.li/?c398
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
E-Business Secrets: Our mission is to bring you such useful and thought-provoking information about the Web that you actually
look forward to reading your e-mail.
About the Author: E-Business Secrets is written by InfoWorld contributing editor Brian Livingston (http://SecretsPro.com).
Research director is Vickie Stevens. Brian has published 10 books, including:
Windows Me Secrets:
http://www.amazon.com http://bri.li/?0764534939
Windows 2000 Secrets:
http://www.amazon.com http://bri.li/?0764534130
Win a gift certificate good for a book, CD, or DVD of your choice if you're the first to send a tip Brian prints. mailto:Brian@SecretsPro.com