31 March 2011

Estimating web malware infections

Many reports on web malware infections tend to use Google queries to estimate the impact of the infection. The latest example comes from Websense.

I don't recommend using Google's "About ... results" to estimate the number of infected URLs. Clicking through the search results, the estimate changes dramatically. What seems to start as "About 533,000 results" at some point drops dramatically to "Page 38 of 374 results"

These result pages change very fast, you might hit the "end" of your results in a different page than I did.  This number is not accurate either -- Search result pages typically limit the number of URLs they return per site.

That's one reason I'd be cautious of this method. The other reason is perhaps more fundamental. A search engine does not index HTML tags, only text between these tags. So a query for "<script src="http://lizamoon.com/ur.php" does not necessarily yield pages infected with that script -- but mostly pages where the infection was unsuccessful, and that tag appears as HTML-escaped text. The blog post from websense actually illustrates this:

This is not a script include -- it's text, and in this case, harmless. Some of the infections might actually have worked. As the first image shows, 2 of the results are marked as "may harm your computer" by Google.

Google offers a different way to estimate infections, the Safe Browsing Diagnostic page. For this site, at the time I fetched the page, it reported that 5 sites were infected. That's not to claim that the diagnostic page is the most accurate estimate out there, but I work on the team and I trust it :).

anti spam service said...

I find that method quite insufficient as well. I question it's inaccuracy overall.