East of the Sun, West of the Moon

2006/4/10

Caching

Filed under: Software — Erwin @ 6:18 pm

The situation is as follows:

I have several PCs (two of which of the OS X persuasion) from which application browse the interweb. While originally this was mostly browsers and some automated scripts, these days a significant chunk involves aggregators fetching RSS/Atom feeds. To slightly reduce unnecessary re-fetching of content and to stop the most simpleminded phoning-home-applications (that try a direct connection instead of looking around for a proxy setting) I run a proxy application. Squid, to be precise.

Aside from allowing a few unusual https/SSL ports so that the IM protocols can be tunneled through the proxy, I (think I) run with a fairly standard configuration, which roughly means that:

  • If a URL contains /cgi-bin/ or ? to indicate that the target will return dynamically generated content, it won’t cache it.
  • In all other cases when no explicit expire-information is provided it will cache the content up to 3 days, or 20% of the age of the content when it was fetched, whichever is less.

The side-effect is that for a relatively quiet newsfeed (that uses feed URLs like http://some.host/feed/) where new entries can be weeks if not months apart, 20% of the age can easily be more than 3 days, so when new content actually appears, it will take that long for the (polite) aggregator to see the fresh newsfeed content!

This was starting to annoy me quite a bit, so now I’ve added these lines to the Squid configuration file, which hopefully solve that issue:


# Don't cache RSS/Atom feeds.
acl XML_url urlpath_regex \.xml$
acl XML_type rep_mime_type -i ^(application|text)/((atom|rss)\ )?xml$
no_cache deny XML_url
no_cache deny XML_type

Time will tell and if someone reading this has a better (Squid oriented) solution (or can tell me in advance I’m on the wrong track!), I’d appreciate hearing about it. 🙂

Powered by WordPress