HTTP caching with Qt

This is an in-depth article about how HTTP caching works in general and how it works with Qt.

What is HTTP caching?

When a browser loads a Web page, the different resources (HTML pages, images, CSS scripts etc.) are stored locally, so that next time the resource is retrieved, it can possibly be served from the local store instead of loading it from the network again. This has several benefits:

  • speed up: Loading resources from the cache is a lot faster than loading them from the network.
  • offline usage: Pages can be displayed without being connected to the network.
  • reducing load: Loading resources from a cache or a proxy reduces the load on the originating server.

This article is mostly about finding out when a resource can be loaded from cache, and when it has to be loaded from the network.

How does caching work with the HTTP protocol?

The usual flow with HTTP caching goes like this: When the client (usually a browser) is requesting a resource via HTTP GET for the first time, it does usually not send any caching information with it. The server responds with a HTTP 200 OK message and the data, while it adds some headers to control caching on the client side, namely:

expiration information:

When the server responds to a client request, it sends information along whether the resource can be cached to disk and, if that is the case, how long the resource can be fetched from cache the next time the client loads it. In other words, it tells the client when the resource expires in the cache and has to be loaded from the network again. HTTP headers used by servers and proxies for sending expiration information are (list is not complete):

  • Expires: The server tells the client the date of when the resource expires. Example: "Expires: Fri, 29 Apr 2011 09:22:59 GMT"
  • Cache-Control: max-age: The server tells the client the maximum age of a resource, i.e. how old the resource can get while still being considered fresh. Example: The server tells the client that the resource can be cached for one hour ( = 3600 seconds): "Cache-Control: max-age=3600".
    Example:
     
  • Cache-Control: s-maxage: Same as the max-age case, but used for shared caches (e.g. caching proxies) and ignored by private caches (e.g. browser caches), while the max-age case is for private caches. Example: The server tells intermediate proxies that the resource can be cached for one hour ( = 3600 seconds): "Cache-Control: s-maxage=3600"
  • Cache-Control: must-revalidate: The server tells the client to always reload this resource, in case other expiration information is not enough. For instance, a client is allowed to serve a stale (over-aged) resource from the cache (see QNetworkRequest::PreferCache below), so specifying "Cache-Control: max-age=0" would not be enough in that case. Specifying "must-revalidate" makes sure the client always reloads from the server itself (and not only from intermediate proxies). E.g. Facebook and Twitter are using that for their front page (but usually not for elements referenced from their front page).
    Example:
     
  • Age: Denotes the age of a resource. This header specifies the time in seconds from when the resource has been generated by the originating server. Now at first glance this seems redundant, because a reply from the server should always implicitly have an age of zero. However, often the reply does not come from the originating server directly, but from intermediate proxies (check e.g. qt.nokia.com). In that case, the "Age" header denotes the number of seconds from when the resource has been fetched from the originating server. The "Age" needs to be considered when calculating the "max-age" directive.

modification information:

When the client has a resource in its cache locally, it can ask the server to send the resource only if it has changed. This involves always a roundtrip to the server, but might save data if the server tells the client that the resource has not changed since the client fetched it last time. In that case, the server sends an HTTP message with an empty body, instead of sending the data body as well. HTTP headers used for sending modification information are (list is not complete):

From the server:

  • Last-Modified: The server tells the client the date of when the resource was last modified.
  • ETag:The server sends a version identifier of the transmitted resource. This can be considered a hash function of the data body, which will change whenever the resource changes.

From the client:

  • If-Modified-Since: The client tells the server to only send the data if it has been modified since the given date; i.e. if the Last-Modified header has changed. If it has not been modified, the server sends an HTTP 304 Not Modified message, containing only HTTP headers, but no body. If it has been modified, the client sends an HTTP 200 OK message containing the body.
    Example:
     
  • If-None-Match: The client tells the server to only send data if it has a new version identifier, i.e. if the ETag header has changed. If it has not been changed, the server sends an HTTP 304 Not Modified message, containing only HTTP headers, but no body. If it has been changed, the client sends an HTTP 200 OK message containing the body.
    Example:
     

It is interesting to note that the headers involving absolute dates ("Expires", "Last-Modified", "If-Modified-Since") were already present in HTTP 1.0; the newer HTTP 1.1 standard resorts to means not involving dates, but time data relative to the client's clock ("max-age", "s-maxage") or versioning information ("ETag", "If-None-Match"). This is because in order of handling dates to work accurately, the server and client clocks need to be synchronized. ETags and relative time data provide more robust means that do not assume the clocks to be synchronized. That said, all of the headers presented above are still in widespread use.

How does caching work with Qt?

By default, no disk cache is used when retrieving resources over HTTP with the QNetworkAccessManager class. In order to enable a cache, you need to either instantiate the QNetworkDiskCache class or write your own class deriving from QAbstractNetworkCache and then set it on your QNetworkAccessManager instance by calling setCache().
In that case, Qt will load resources from the cache if the resource is still fresh, and load from the network if not; if possible it adds modification information, as described above.

In order to fine-tune the behavior of how Qt loads resources from the network, you can set specific attributes in your QNetworkRequest by calling setAttribute() with QNetworkRequest::CacheLoadControlAttribute being one of:

  • AlwaysNetwork: Always load from the server and force intermediate caches to reload by setting "Cache-Control: no-cache" and "Pragma: no-cache".
  • PreferNetwork (default): If the resource can be found in the cache and the age of the cached resource is less than the maximum age (used headers: "Age", "Cache-Control: max-age") or the resource has not expired (used header: "Expires"), then it is loaded from the cache. If the resource has expired or has exceeded its maximum age, it is loaded from the server, if possible with modification information (used headers: "If-Modified-Since" if "Last-Modified" was given, and "If-None-Match" if "ETag" was given).
  • PreferCache: If the resource can be found in the cache and has not expired, then load from cache. The contrast to PreferNetwork here is that even stale, i.e. resources exceeded its maximum age, will be loaded from cache. If the resource has expired (determined via "Expires" header) or cannot be found in the cache, this setting behaves as with the PreferNetwork case.
  • AlwaysCache: Serve the data from the cache if available, never use the network; this can be seen as an offline mode. If the resource is not in the cache, an error is reported.

If you want to fine-tune the caching behaviour even more, you could add headers (e.g. "Cache-Control: max-age" or "Cache-Control: max-stale") yourself via QNetworkRequest::setRawHeader().

Areas for improvement in Qt

  • Implement freshness heuristics: If a resource does not have an expiration date (no "Expires" header) and the age of the page cannot be determined (no "max-age" or "Age" header), the client can implement heuristics to determine whether a page is fresh. In particular, if a resource has a "Last-Modified" header, a fraction (the HTTP RFC mentions 10%) of that time until now can be used to assume a resource still being fresh. For example, if a resource has a "Last-Modified" header set 10 days in the past, the resource can be assumed to be fresh for 1 day.
  • The age calculation needs to be reworked.
  • Fetching resources from the cache can be made faster.
  • The HTTP Vary header must be taken into account.

As always, feel free to vote and comment on the tasks above.


Blog Topics:

Comments