Friday, November 14, 2008
Google Chrome includes features to help protect users against phishing and malware attacks. If you have ever hit a red page with the title "Warning: Visiting this site may harm your computer!" (such as our test page) or "Warning: Suspected phishing site!" then you have already seen these features in action. While we try to provide an explanation of what's happening on that warning page, a number of people have asked for more information about how this feature works, in terms of where the data behind those warnings come from, how that data gets to the computer, and what privacy implications the feature has.
Where does the phishing and malware data come from?
Google is constantly crawling and re-crawling the web, all the while finding new and changed websites. These websites are found by following links from other websites, crawling URLs submitted by webmasters and users, and so forth. Sometimes, during that process, we discover a website where something doesn't seem right. A website may look like a phishing website, designed to steal your personal information, or it may contain signs of potentially malicious activity that would install malware onto your computer without your consent. If we find a website that looks like it's a phishing page, it gets added to a list of suspected phishing websites. If we find a website that contains signs of potentially malicious activity, we start up a virtual machine, browse to that website, and watch what happens. If we see certain activities happen on that virtual machines (such as viruses being downloaded and installed), we add that website to a list of suspected malware-infected websites. The process for discovering suspected malware-infected websites is described in more detail in a paper written by Niels Provos and colleagues from Google's anti-malware team.
How does this data get to my computer?
If you have phishing and malware protection enabled, then Google Chrome will contact servers at Google within five minutes of startup, and approximately every half hour thereafter, to download updated lists of suspected phishing and malware websites. These lists are then stored on your computer, so that as you browse the web, each page can be checked against the list of suspected phishing and malware websites locally, without sending the address of each webpage you visit to Google. This is designed to offer both performance (by not having to wait on a round-trip request to Google's servers) and privacy (by not sending a record of your browsing session to Google).
As the lists are large (hundreds of thousands of entries), we looked for ways to reduce the amount of information that had to be sent to and stored on users' computers, to reduce the amount of bandwidth and storage space consumed. One way we achieve this is by using partial hashes of URLs in the lists downloaded by the computer. What this means is that rather sending down the full URL of each website, we do the following. First, we hash the URL using SHA-256. Then, we send add the first 32 bits of that 256-bit hash into the list of phishing or malware websites. Those lists of 32-bit hash prefixes are then downloaded by Google Chrome in the background as described earlier.
How is this data used, and what is sent back to Google?