IT infrastructure research thoughts

Deduplication

Deduplication: DataDomain vs ProtecTIER


[Migrated post from my old site, comments unfortunately remain behind]

I’ve been doing some interesting tests on EMC’s DataDomain & IBM’s ProtecTIER.  Both are apparently the leaders in the industry on secondary storage deduplication.  So our friendly resellers brought us one each to test: an EMC DD890 and an IBM ProtecTIER of some designation (I forget exactly which one, but it’s the appliance = add your own storage).

All connected we now have a host (SLES 11 with 3 x 8 Gbps FC (of which 2 are for tape only)) running IBM’s TSM and both ProtecTIER & DataDomain with 4 x 8 Gbps fibres, all on the same 8 Gbps SAN switch.

Initially I battled for ages to get stupid LAN-free agents working and getting copies of reasonably sized databases, and getting a reasonable rate of change on those databases.  A major schlep and very difficult to get into a state where you can easily reproduce tests.

In addition, backups always come from somewhere.  Meaning: if you back up your local drive you won’t get more than a few MB/s.  Even if you backup some SAN-attached databases, your throughput will be affected by that.  What I needed was the ability to backup from (and restore to) something resembling /dev/null.  This is not a real life scenario (recovered any backups from /dev/null lately?), but in order to expose any latent weaknesses in the product you have to push the envelope a little.  My car can do 160 km/h (it’s old) but it’s not realistic or legal to drive at that speed.

Then I remembered some stuff I wrote ages ago when I was playing around with backing up Informix databases to TSM.  The Informix onbar utility is an XBSA client which uses a shared library, shipped with TSM, which is an implementation of the XBSA standard.  I found some code and soon had a client with which I could send blocks of any size and number to TSM, and receive them back, all without having to first read what you want to backup from some other device.

Old code however is very hard to read.  I since found some samples under /opt/tivoli/tsm/client/api/bin/samples.  One of them was easily hacked into a client (email me for the source) with which I could do the same.

I quickly found I could drive both the ProtecTIER and the DD890 close to fibre channel speed.  More exactly: 1.3 GB/s to the DD and 1.1 GB/s to the PT.  The numbers make sense considering the 2 x 8 Gbps fibres from the host to the SAN switch.

Next step was to introduce some randomness (all zeros dedups really well in case you wondered).  So I added a flag with which you can set the percentage of randomness you want the blocks to be: 0% is all zero, and 100% is all “random() % sizeof(long)”.

Here was the first interesting thing.  The DD, at 100% random, was still doing > 1 GB/s.  The PT however nosedived to about 400 MB/s.

Second interesting thing was that the PT restores faster than it backs up, the DD restores slightly slower.  The PT also recovers faster than the DD (even though it backed up slower than the DD in my environment).  However, add 100% randomness and the recovery speed dropped from 1.2 GB/s to about 800 MB/s.

Next post I hope to tell you why this is impressive, and how difficult it apparently is to dedup.  Also some more detail about the way we set it all up.

Comments most welcome.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>