HTTP Archive: nine months

August 17, 2011 8:02 pm | 6 Comments

Although the HTTP Archive was announced in March, I actually started gathering data back in November of 2010. This week’s run marks nine months from that initial crawl. The trends show that performance indicators are mixed, with some critical metrics like size and redirects on the rise.

[As a reminder, the HTTP Archive currently crawls approximately 17,000 of the world’s top websites. All of the comparisons shown here are based on choosing the “intersection” of sites across all of those runs. There are ~13K sites in the intersection.]

The transfer size of pages has increased 15% (95 kB) over nine months. The average size is now up to 735 kB. Note that this is the transfer size. Many text resources (including HTML documents, scripts, and stylesheets) are compressed so the actual size is larger. The bulk of this growth has been in images – up 18% (66 kB). Scripts have had the greatest percentage increase growing 23% (25 kB).

Nov 15 2010:

Aug 15 2011:

Note that these sizes are the total size of all images in the page and all scripts in the page, respectively. The average size of individual resources has stayed about the same over this nine month period. If individual resource size is the same, how is it that the total page size has increased? The increase in total transfer size is the result of a 10% increase in HTTP requests per page – that’s seven more resources per page.

Redirects are known to cause page delays, and yet the percentage of sites containing at least one redirect increased from 58% to 64%. Requests that fail are wasteful using connections that could have been used more productively, but sites with errors grew from 14% to 25%.

All the news isn’t gloomy. The use of Google Libraries API has increased from 10% to 14%. This is good for performance because it increases the likelihood that as a user navigates across sites the most common resources will be in their cache. In addition, serving those from the Google Libraries servers might be faster and more geographically distributed for smaller sites.

The use of Flash has dropped 2% from 47% to 45% of websites. Flash resources average 58 kB which is much larger than other resources, and there are fewer tools and best practices for optimizing Flash performance.

There are still many resources that do not have the necessary HTTP response headers to make them cacheable. Luckily the trend is moving toward more caching: the 61% of resources that did not have headers to make them cacheable has dropped to 58%. Stating the inverse, the number of resources with caching headers grew from 39% to 42% (+3%).

Here’s a recap of the performance indicators from Nov 15 2010 to Aug 15 2011 for the top ~13K websites:

  • total transfer size grew from 640 kB to 735 kB
  • requests per page increased from 69 to 76
  • sites with redirects went up from 58% to 64%
  • sites with errors is up from 14% to 25%
  • the use of Google Libraries API increased from 10% to 14%
  • Flash usage dropped from 47% to 45%
  • resources that are cached grew from 39% to 42%

My kids started school this week. I’m hoping their first report card looks a lot better than this one.

6 Responses to HTTP Archive: nine months

  1. of the “resources that are cached” being at 42%… do you have a breakdown of resource types? Specifically, I’d like to know what the percentage of JS (out of only the JS) that is cacheable is? And CSS?

  2. @Kyle: Such a report isn’t currently available but you can download a MySQL dump of the latest run and do that analysis.

  3. Steve, I am not that confident that using a CDN or even the Google CDN is such an undisputedly good route to take in general. I will tell you why:

    http://trends.builtwith.com/cdn tells us that that from the top 10k sites, 13% use a CDN. And 46% of the top 10k sites use jQuery (http://trends.builtwith.com/javascript). So we might say, that roundabout 6% of the top 10k sites use jQuery from a Google CDN, right? That means in the best case every 17th site fits the constellation, which already is not that much of a recurrence. But that’s not all, since we did not look at jQuery versions as of yet.

    Coming to that, http://w3techs.com/technologies/details/js-jquery/1/all tells us, that most of the sites still use jQuery 1.4 which are 53% of all sites. This further reduces the recurrence of the exact same constellation down to every 31th site. Probably even worse, since W3Tech doesn’t track more precise version numbers of jQuery.

    Now remember: this is the best case! All other quadrillion constellations are worse.

    This has to be put against the alternative of hosting a library yourself, especially if you do not target a global audience. Reason: The Google CDN isn’t always faster in delivery the very same asset. For example does the CDN take 125ms to deliver jQuery to me, while a shared hosting server here in Germany only needs 85ms for the same task. All times measured w/o taking DNS lookup time into account.

    On top of that we are able to concatenate all of our scripts together into one file when we stay on our server. Impossible with the Google CDN.

    So my opinion is that we should have less of a blinded uber-enthusiastic and more of a realistic look at CDNs.

  4. @Schepp: As more sites use Google Libraries API the likelihood that users benefit from cross-site caching of common resources increases. There are certainly situations and geolocations where a local server may outperform the Google Libraries response time for local users, but this is a general recommendation. Each site should definitely evaluate it for their own servers and audience.

  5. Thanks for the review Steve, but maybe do you have also numbers about to use html5?

  6. @Frank: If you have specific suggestions of new statistics to track feel free to suggest them. I don’t have the HTML to parse. If there are things you can see in headers etc. that’s possible.