Reading about LiveJournal's cache development and memcached has had me thinking about caching, a lot, for a couple months.
I recently had the pleasure of implementing a fun system designed around the idea that, in most cases, local caches will hold the data you will need. If you can provide server-affinity (not terribly difficult to do, though annoying, see pound or httpd) then you will, for the most part, have the set of data needed by a current user in the local cache on that node. Then all you need to do is replicate dirty messages. JMS rocks for this, and voila.
I have been trying to reconcile the (extremely powerful!) caching systems described in the memcached/livejournal presenation with resource-level (OJB, Hibernate, etc) level caching, and have been having a rough time. Yes, you can do interesting things with cache regions, and tweaking the mapping / lazy-loading / etc but in the end, you will probably hit the database enough to swamp it at a huge load.
So, take a cue from memcached, and put caching at the application level. You can still hide it (decorate your dao) from other services, and that is probably a good idea. The problem comes in that the state of the art in Java is fine-grained domain models, and I like this, a lot. It makes figuring out what to cache, and how/when to dirty cached things, much trickier though.
Simple example is a one-to-many collection on a Foo. You cache the Foo, and either strip the collection or risk having the things in it be dirty. Every o/r mapping tool out there handles this, but every one of them doesn't do it well in a really highly replicated environment. In short, you wind up relying on the database being horizontally scalable. This is way more expensive than the cache and web server being horizontally scalable.
The solution came to me twofold. One, Eric Evans struck again. You want to carefully control the entities that people are allowed to query for. Queries are the expensive part, not the retrieval. If you know the PK you can retrieve it from the cache. Finding the PK is what hurts, and this is the query. So we take Evans's aggregates, and their root element, and build around querying for aggregates. Now there are only a small(er) number of things for which you can query. More importantly, there are object graphs (aggregates) you can place into the cache as a single cache entry.
The second component is having a means of dirtying the cache which is aware of aggregate boundaries. If you can map an entity to a set of aggregates which it can be participating in (ie, a set of cache roots) you can find all instances of the entity in the cache. It seems like this might be hard (out of foo arbitrary graphs, how do you know which ones hold Foo[id=727]
?) It isn't that hard (I hope).
You have the o/r mapping metadata (hopefully soon to be standardized in JSR 220) which tells you what CAN hold a reference, and what the relationship between the instance and the aggregate root actually is. Knowing the relationship you can collect, in one query, the pk's of all the aggregate roots which hold a reference to Foo[id=727]
. This query may join seventeen, or even fifty, tables but it is deterministically constructable (yes, databases can handle fifty table joins, I met a Hibernate query doing just that the other day and am still shaking my head). This puts tons of load on the db though (50 table joins are expensive). Luckily, you are caching the heck out of the thing, so actual reads against the write database only happen as data is pushed out to the read databases, which soon gets pulled to cache.
Now, the the other approach is to not query for the aggregate roots, but to work from the O/R layer where the aggregate root is already in memory. This doesn't tell you the relation between this root and other roots which can contain a given entity, but I bet for most models this isn't terribly difficult to figure out. This, added to the o/r metadata, drastically reduces the search space for the aggregate root keys.
Now that we can determine the aggregate roots, which are the keys in the cache, we can dirty the correct ones =)
Because aggregate roots are really just row/pk information you can dirty them from anywhere, including your ruby batch processes.
The weak point in this chain, and I am not totally convinced yet (need to do the proofs, I guess), though it feels right, is that aggregate root keys can be determined from relationship metadata and dirty-instance identity with sufficiently less load on the (write-only) database that it is less than the network traffic of fine grained, flat-space, caching. I am really, really willing to bet that it is.