Home :: About InfoWorld :: Advertise :: Subscribe :: Contact Us :: Awards :: Events
InfoWorld HomeNewsTest CenterOpinionsTechIndex

E-Business Secrets

Paul Graham provides stunning answer to spam e-mails
Probability theory shows impressive results

By  Brian Livingston August 20, 2002  

Paul Graham, the co-creator of what is now the Yahoo Store, has published a strikingly effective new method of filtering spam.


I myself have long been critical of anti-spam and "family" filters, most of which are ineffective at best, and brain-dead at worst. One filter I evaluated stopped the users' PC when the word "bomb" was encountered, supposedly to stop children from using the Internet to learn how to make bombs. Unfortunately, this also stopped students from researching the Unabomber or other legitimate topics. Dumb.

Graham's new method, by contrast, is an intelligent application of the science of probability theory. In his latest iteration, his filter correctly flags 99.5 percent of spam, with 0.0 percent "false positives."

False positives are important, personal messages that anti-spam efforts incorrectly filter out. This is a huge problem for ordinary methods. As Graham explains, "For most users, missing legitimate email is an order of magnitude worse than receiving spams, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient."

Like many anti-spam crusaders, Graham himself started with an ordinary filter approach, looking for specific "bad words." This initially showed some promise. Simply filtering out all e-mails that contain the word "click," he says, correctly eliminates 79.7 percent of spam messages, while wrongly trashing only 1.2 percent of legitimate mail.

Those success rates, however, quickly degrade as more and more words are added to the "bad" list. This makes the crude filtering approach unusable.

The solution, Graham found, was to expand the technique in a statistically sophisticated way. Because all spam is trying to hype something, certain words have a high probability of indicating a spam message. Other words almost never appear in spam.

Words such as "though" and "apparently," for example, increase the probability that a message is legitimate, because spam isn't big on subtlety. At the same time, a genuine message isn't rejected simply because it uses a single instance of a term that might also appear in an adult-oriented spam message.

Instead of mere "dumb" filtering, Graham's elegant method analyzes the 15 "most interesting" words in each message. Through a technique known as Bayesian analysis, the weights of these 15 words are then used to compute the probability that a message is spam. This analysis is where his 99.5 percent accuracy rate comes from.

To get the weights, Graham ran the analysis on 4,000 spam messages and 4,000 legitimate ones. Statistically, this may not seem like many, but it's proved to be very significant.

Graham proposes that his research be used to create a "seed filter" that would become part of users' e-mail programs. Users would also be equipped with two Delete commands. One would be the regular Delete key, for genuine messages, while the other would be a Delete-As-Spam key, to be used when deleting spam messages. After a short time, each user would have an even more accurate filter, and spammers wouldn't have a single seed file that they could easily figure out a way to work around.

I've long been an advocate of suing spammers out of existence, using state laws that prohibit false identities (employed by almost all spammers). Graham, too, supports the anti-spam laws, but mainly because they make spam easier to identify (by making certain terms predictably appear as spammers deny that their messages fall under the laws). Meanwhile, Graham has made a believer of me.

Probability theory finally makes a filter that works:

http://www.paulgraham.com http://bri.li/?4e68

- - - - - - - - - - - - - - - - - - - - - - - - - - - -

Livingston's Top 10 News Picks o' the Week

1. E-business is enjoying double-digit growth again

http://www.ecommercetimes.com http://bri.li/?430

2. New keyword tool calculates return on investment

http://www.acws.com http://bri.li/?818

3. Increase sales by handling your Web failures well

http://www.newarchitectmag.com http://bri.li/?c00

4. Lessig on tech: Ours is less and less a free society

http://www.oreillynet.com http://bri.li/?fe8

5. Hollywood's DVD "region coding" system is collapsing

http://news.bbc.co.uk http://bri.li/?13d0

6. Hackers find it easy to get into military PCs

http://www.washingtonpost.com http://bri.li/?17b8

7. Ten rules for writing for a dynamic World Wide Web

http://www.alistapart.com http://bri.li/?1ba0

8. Pros share secret tricks of the new DreamWeaver MX

http://www.intranetjournal.com http://bri.li/?1f88

9. HTML tips: Importance of font sizing for usability

http://www.useit.com http://bri.li/?2370

10. See great Flash animations: Solemates (plays music)

http://www.centuryinshoes.com http://bri.li/?2758

- - - - - - - - - - - - - - - - - - - - - - - - - - - -

Wacky Web Week: Apple's "Ellen Feiss" video rocks

The latest hilarious video making the rounds of the Net involves a spacey -- some say stoned -- teenage girl explaining that Windows ate her homework.

The work is a TV spot in Apple's "Switch" series, but it can easily be enjoyed by Windows users as well (most of whom can probably relate).

As explained by Ellen Feiss, the young woman in the ad, "It was, like, beep beep beep beep beep beep beep, and then, like, half of my paper was gone." She adds it was "kind of a bummer."

A small cult has sprung up around the subject, with numerous fan sites worshipping the non-actress, complete with video clips of her at MacWorld and altered versions of the Apple original. Wired News, below, has the best links to the classic video and its many imitators.

"Windows ate my homework, you know, like, buy an Apple":

http://www.wired.com http://bri.li/?c398

- - - - - - - - - - - - - - - - - - - - - - - - - - - -

E-Business Secrets: Our mission is to bring you such useful and thought-provoking information about the Web that you actually look forward to reading your e-mail.

About the Author: E-Business Secrets is written by InfoWorld contributing editor Brian Livingston (http://SecretsPro.com). Research director is Vickie Stevens. Brian has published 10 books, including:

Windows Me Secrets: http://www.amazon.com http://bri.li/?0764534939

Windows 2000 Secrets: http://www.amazon.com http://bri.li/?0764534130

Win a gift certificate good for a book, CD, or DVD of your choice if you're the first to send a tip Brian prints. mailto:Brian@SecretsPro.com

Brian Livingston is publisher of BriansBuzz.com. Send tips to him at brian@briansbuzz.com.

  More Brian Livingston columns
  Join a discussion on Brian Livingston's columns

Verity Ultraseek - FREE CASE STUDY on The Johns Hopkins Institutions
- Nearly a thousand internal and external websites - Hundreds of different file formats - An extended education and healthcare system with a world-renowned university, hospitals and research centers Read the FREE case study on The Johns Hopkins Institutions and find out how Verity Ultraseek brought all this together.

Learn how integrated enterprise management tools can help more effectively and cost efficiently manage your organizations highly complex systems and infrastructures while benefiting your resources and your bottom line. Click here to download the free HP IT Consolidation whitepaper today!

HP - IT Consolidation with Linux
Learn how consolidating your IT environment within a Linux environment can help to provide flexibility, scalability and cost savings while helping your organization overcome obstacles to yield lasting infrastructure improvements. Click here to download the free HP IT Consolidation whitepaper today!

HP - Need Power? Powerful HP Workstations at PC Prices.
enKoo - Web access your PC & servers. Scalable & failover appliance.
SAP - Get operational control now. Click here for free METAgroup ERP summary
Microsoft - See who's saving time with Microsoft Windows Server System
IronPort - Email is broken can it be saved? Download this special report on Email in the Enterprise.

Rackspace-The Managed Hosting Specialist - Rackspace offers Managed and Application Hosting with customizable and scalable solutions. 0% downtime and a hardware replacement guarantee.
Mid-Market CRM Made Easy with Oncontact - Oncontact offers customer relationship management (CRM) systems for mid-market companies. Build stronger and more profitable relationships with your customers. Click here for more info.
Intuit Track-It! Help Desk Software - Intuit IT Solutions provides Track-It! - the leading help desk software solution for call tracking, problem resolution, employee & customer self-help, remote control, asset management, LAN/PC audi...
Block E-Mail Spam at Server Level - Block all spam at e-mail server level with GFI MailEssentials. Also adds other tools to your mail server such as e-mail disclaimers, e-mail archiving, auto replies and more. DLD trial today.
File replication and content synchronization - One-to-one, scheduled file replication and content synchronization for cross platform replication on Windows & UNIX.

 HOME  NEWS  TEST CENTER  OPINIONS  TECHINDEX   About InfoWorld :: Advertise :: Subscribe :: Contact Us :: Awards :: Events 

Copyright © 2004, Reprints, Permissions, Licensing