IT infrastructure research thoughts

Storage

IBM DS8000 write cache – all 6 GB of it


[Migrated post from my old site, comments unfortunately remain behind]

There you have it, no reason to continue this post.  The DS8000 has 6 GB of write cache.

Actually, each server has 6 GB, but you need to mirror writes so that effectively halves the 12 GB total to 6.  This is not a readily known fact afaict (Google, RedBooks, …) and when I enquired about the truth of it, the admission was sort of reluctant.

Is it a problem? you may ask.  Well, probably not in most cases.

Lets play with numbers a bit (and note that this is all much simplified as it often is in user land):

We have 6 GB of write cache.  Lets assume (for fun) an average block size of 64 KB.  This gives us approximately 100,000 blocks.  At 100,000 write IOPS, the write cache will last exactly 1 second, after which all writes will be delayed while the cache is being destaged.  So that’s (a worst case) 400,000 IO’s to RAID5 disk groups to clear in 1 second (lets forget about full stripe writes for now, we are in user land after all).  That’s going to require a lot of spindles.

A boundary case you may argue (everyone doing 100’000 write IOPS please raise your hand), but consider this: At a given number of spindles, the DS8000 can only perform IO at a given rate.  Assume the backend disk groups are RAID5 and that rate is slower than 4 x your sustained write IOPS, you will run out of cache.

The concern for me is largely around scalability.  The DS8000, if you remove the nice, large wrapper, is merely 2 pSeries machines (powerful machines, yes, but still only 2), maxed out at 192 GB cache each, of which 6 GB is used to cache writes.  Inside, the DS8000 is still just boring old dual processor, monolithic storage array.  Compared to newer architectures, such as the XIV, it’s old hat.  Consider Hitachi’s VSP which takes up to 1 TB of cache, of which up to 1/2 may be used for write cache – yes, that’s 512 GB compared to 6 GB.

At one stage during my investigations, the DS8000′s advanced caching algorithm would take care of it.

The algorithm in question is AMP (Adaptive Multistream Prefetching) which appears to be different to what others are doing.   AMP is an adaptive, asynchronous algorithm.  Adaptive in that it will vary the amount of prefetching, and asynchronous because it will not only prefetch when there is a cache miss, but also in the event of a cache hit.  Research has shown that AMP outperforms other classes of algorithms (static, synchronous) for most, if not all workloads.  As an aside, the page replacement part of it sounds very much like ARC  (adaptive replacement caching), which was briefly used by Postgres and promptly removed again.

I’m not sure if SVC uses AMP also, but look at this here interesting graph (backend track reads as per SVC’s own statistics): 

vdsk-trackreads

The difference between the green and the brown graph is how well SVC is getting on with the prefetching business.  The red line is the hard part no prefetching algorithm, by definition, will take care of.

Some workloads are friendlier:

vdsk-trackreads-21

Some are worse.

The key word when talking about prefetching is “sequential”.  In the face of random IO, a problem that today is largely still only solved with cache and parallelism, I’m afraid the DS8000′s advanced caching algorithm will not take care of it.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>