Scientists at Carnegie Mellon University, working with federal grant monies, have discovered that phishing e-mails are decidedly different from most other spam -- so much so that the fraudulent messages can almost entirely be detected and filtered out.
CMU researchers state that their analysis catches 92.65 percent of phishing attempts. Only 0.12 percent of legitimate messages are miscategorized as fraudulent. This "false positive" percentage is tiny enough that the phishing filter could be added to traditional spam filters even by corporations that can't allow any significant loss of important inbound mail.
These findings have a tremendous potential to reduce identity thefts that are initiated by e-mail. But neither CMU nor its government sponsors have issued any press releases about the study. You're reading about it here first.
Summertime, and the Phishing is Easy
If you're a frequent reader of my columns, you've probably heard a lot about phishing -- bogus e-mails that appear to be from a bank or ISP. These messages lure users to a fake Web site that's designed to collect usernames, passwords, credit-card numbers or other valuable information.
But many computer users are still falling for these scams. It's difficult to get hard figures on how many billions of dollars are lost each year to phishing, but the number of attacks is soaring.
The latest Phishing Trends Report by the Anti-Phishing Working Group, a coalition of financial institutions and other businesses, says 11,976 new phishing Web sites were detected by the group in May 2006. That's up from 3,326 such sites in the same month of 2005. Despite misconceptions that hackers in Russia are behind most attacks, 34 percent of phishing Web sites are based in the United States, with 15 percent in China and smaller numbers in other countries, APWG says.
Corporate spam filters are adequate to suppress some phishing e-mails, but not all. Now, the new Carnegie Mellon report shows effective ways to discern phishing messages that might otherwise slip through the net.
The study was conducted at CMU by Ph.D candidate Ian Fette, associate professor Norman Sadeh, and faculty member Anthony Tomasic. It was funded by the U.S. Army Research Office and the National Science Foundation's Cyber Trust Initiative, which is sponsoring a CMU research center called CyLab.
Tell-tale Warning Signs of Phishing Messages
Most spam messages don't need to pretend that the Web sites they link to are respected brand names. People who wish to buy prescription drugs on the sly, for example, may not mind being directed to a site with an obscure name like Pills-Without-Prescriptions.com.
The essence of phishing, however, is that the Web site that's linked to appears to be the legitimate home of a well-known company. It's this central fact of deception, the CMU researchers say, that enables phishing e-mails to be detected. The study uses sophisticated statistical analysis to detect unusual e-mail traits, such as:
Links to "fresh" domains. More than 12 percent of phishing e-mails contain a link to a domain name that was registered fewer than 60 days ago. Because fraudulent Web sites quickly disappear or are kicked off the Internet when discovered, the average phishing site stays online only 5 days, according to APWG.
Links in dotted-decimal format. Many Web sites used for phishing are hosted on home PCs that have been infected by spyware and turned into "zombies." These sites don't have domain names assigned to them, so phishing e-mails must link to them using a raw IP address, such as 126.96.36.199. About 45 percent of phishing e-mails link to such a "dotted-decimal" address.
Clickable domain name doesn't match destination. It's simple for the creator of an e-mail message to make the visible text of a link say "Citibank.com" or whatever. In reality, an end user who clicks the link is sent to some other domain that merely looks like Citibank's. About 50 percent of phishing e-mails contain links in which the visible domain name and the destination don't match.
Atypical destination of "click here" links. To appear legitimate, several links in a phishing e-mail may point to actual privacy statements and customer-service forms at, for example, PayPal.com. The link that the phisher urges users to click, however, points to a different Web site entirely. About 18 percent of the time, phishing e-mails contain an atypical link such as this.
In a telephone interview, researchers Fette and Tomasic acknowledged that their work was in its early stages. "We don't actually have a decision tree that weights each of the factors," said Fette. "We don't have some program yet that people can download."
The research also suffers from the fact that the dataset of tested messages is more than two years old. To determine whether destination domain names had been registered fewer than 60 days before the messages were sent, the researchers had to laboriously look up the registration dates. Running further experiments on live data would help to verify whether the algorithms that work on the tested dataset still work on today's mail, the study's authors say.
Don't Try This on Your Own Mail, Please
Because no packaged software that implements the study's findings is commercially available yet, you might be tempted to start simply deleting e-mails you receive, based solely on a few of the "tell-tale factors." I strongly advise you against trying to invent your own rules in this way.
Many legitimate e-mail messages bear features that the study found to be suspicious. If you delete all messages that exhibit any of the four factors described above, for example, you'll eliminate more than 2 percent of your legitimate inbound messages, according to figures in the study. No company can allow that much mail from customers and vendors to be lost.
Instead, I urge you to wait for professional phishing-filter software to become available. The report's authors explained to me that their algorithm, using 10 complex factors, establishes an n-dimensional space and computes a nonplanar boundary between phishing messages and legitimate e-mails. That's not something you can reproduce with a few simple rules.
If you're really impatient to eliminate phishing messages, your first line of defense is a brand-name spam filter, which will stop most unsolicited bulk e-mails. Then you can consider adding rules to look for "tell-tale signs" of phishing messages that slipped through. If you find anything suspicious using your own unsophisticated rules, write "[CAUTION]" into the Subject line rather than deleting what may be legitimate messages.
I asked why the university and its sponsors hadn't publicized the report, which was completed in June. "This is still very early research," Fette replied. The academics would like to find an executive of a large corporation who would authorize them to rerun their experiment on a live datastream. The researchers, they assure me, would protect the confidentiality of the messages that were scored in the test.
I hope one of my readers will take the researchers up on their challenge. The study's authors can be reached at CMU's Institute for Software Research International.
If you'd like more information, CMU has posted a short abstract of the researchers' study. A 16-page PDF report on the work is available as a PDF file.