Google Chrome includes features to help protect users against phishing and malware attacks. If you have ever hit a red page with the title "Warning: Visiting this site may harm your computer!" (such as our test page) or "Warning: Suspected phishing site!" then you have already seen these features in action. While we try to provide an explanation of what's happening on that warning page, a number of people have asked for more information about how this feature works, in terms of where the data behind those warnings come from, how that data gets to the computer, and what privacy implications the feature has.

Where does the phishing and malware data come from?

Google is constantly crawling and re-crawling the web, all the while finding new and changed websites. These websites are found by following links from other websites, crawling URLs submitted by webmasters and users, and so forth. Sometimes, during that process, we discover a website where something doesn't seem right. A website may look like a phishing website, designed to steal your personal information, or it may contain signs of potentially malicious activity that would install malware onto your computer without your consent. If we find a website that looks like it's a phishing page, it gets added to a list of suspected phishing websites. If we find a website that contains signs of potentially malicious activity, we start up a virtual machine, browse to that website, and watch what happens. If we see certain activities happen on that virtual machines (such as viruses being downloaded and installed), we add that website to a list of suspected malware-infected websites. The process for discovering suspected malware-infected websites is described in more detail in a paper written by Niels Provos and colleagues from Google's anti-malware team.

How does this data get to my computer? 

If you have phishing and malware protection enabled, then Google Chrome will contact servers at Google within five minutes of startup, and approximately every half hour thereafter, to download updated lists of suspected phishing and malware websites. These lists are then stored on your computer, so that as you browse the web, each page can be checked against the list of suspected phishing and malware websites locally, without sending the address of each webpage you visit to Google. This is designed to offer both performance (by not having to wait on a round-trip request to Google's servers) and privacy (by not sending a record of your browsing session to Google).

As the lists are large (hundreds of thousands of entries), we looked for ways to reduce the amount of information that had to be sent to and stored on users' computers, to reduce the amount of bandwidth and storage space consumed. One way we achieve this is by using partial hashes of URLs in the lists downloaded by the computer. What this means is that rather sending down the full URL of each website, we do the following. First, we hash the URL using SHA-256. Then, we send add the first 32 bits of that 256-bit hash into the list of phishing or malware websites. Those lists of 32-bit hash prefixes are then downloaded by Google Chrome in the background as described earlier. 

How is this data used, and what is sent back to Google?

When you browse the web using Google Chrome, the hash of each URL is computed, and the first 32 bits of that URL's hash is compared against the list of suspected phishing and malware websites. This includes the URL of the website you are visiting, as well as the URL of any included resources (such as included JavaScript or Adobe Flash movies). If the first 32 bits of the hash match an entry in the list, it is likely that the URL is on the list of suspected phishing or malware websites. At this point, we can only say likely, because there is still a reasonable chance of hash collisions in the 32-bit space - two distinct URLs with distinct 256-bit hashes where the first 32 bits of those hashes are the same. To confirm that the URL is suspected as a phishing or malware website, and not just a 32-bit hash collision, the 32-bit hash is sent to Google. Google then returns the full 256-bit hashes suspected of being phishing or malware and starting with those 32 bits. The full 256-bit hash of the URL in question can then be compared against the 256-bit hash(es) returned by Google, to make a determination of whether in fact the URL in question is or is not on the list of suspected phishing or malware websites. Using this scheme, Google Chrome is able to quickly check the website and its resources against a local database, and only sends information back to Google when the site matches an entry on the locally stored lists. In the case where information is sent to Google to verify such a suspicion, that information consists only of a part of the hash of a URL, not the URL itself. As such, Google never gets information that would definitively indicate whether a user has visited a particular website or not. The end result is a low-overhead efficient mechanism to help protect against phishing and malware, while also helping to protect users' privacy.