Thursday, March 10, 2011

Exadata Backup and Recovery-1

In theory, there's no difference between theory and practice, but in practice, there is.  When theoretical maximums are used, life occurs, and the reality sets in.  Seldom do companies even have an agreed upon RTO, and even more seldom are they proven out until a disaster occurs.  An RTO that isn't proven is called a guess...and even an educated guess based on published theoretical maximums is a guess of what the best case scenario could be.  My current client has quarterly requirements to meet the RTO, and if the recovery plan can't hack it, adjustments (even if that means additional hardware purchases) must be made.  Designing a new backup infrastructure and recovery plan for this Exadata environment needs to not only be precise, it needs to be proven.

In my previous posts on virtualized storage (specificly the Hitachi USP-V), I talked about doing Oracle database backups via snaps on the virtualized storage.  They make life much easier for the DBA, with interfaces directly into RMAN.  Unfortunately, until Oracle buys or partners with a storage vendor (I'm hoping for Netapp) to bring that technology to Exadata, we're limited to traditional backup methods.

For the last several months, one of my focuses has been on meeting the recovery window set out by the business.  At first the requirement was...worst case scenario, we have to do a full recovery, everything needs to be running in 4 hours after a disaster.  This is an OLTP VLDB that generates about 1300GB of archivelogs/day.  Our RTO is RECOVERY time objective, not restore time archivelog apply on Exadata is relavent to my issue.

Archivelog apply rates have primarily 2 big variables...the hardware speed and the type of DML.  A database with a DSS workload will apply much faster than an OLTP database on the same hardware.  Other factors affecting our apply rate are flashback database and Golden Gate, which requires supplemental logging.  1300GB archivelog generation rate is less than what its going to much more is anyone's guess.  Our on-site Oracle consultants are guessing 2X, but I think that's an over-estimate to play it safe.

To understand the size of the data that needs to be restored, we have more variables.  The uncompressed size of this database is expected to be many petabytes (projections are in flux...but more than one, less than 5).  We've been sold a single full rack exadata machine (in an 80% data, 20% fra configuration), with a contingency for more storage cells as needed for the next 3 years.  After 3 years the growth rate accelerates and we'll have more Exadata machines to add to the mix then.  Depending heavily on HCC and OLTP compression and based on growth projections, we're going to say in yr 1 we only have 20TB after index drops and compression.

This database is going to be a consolidation of 22 smaller databases.  The plan is to have a full backup once a week and cumulative incrementals the other 6 days.  Based on the current activity in those databases,  We're going to say our incrementals will be at worst 6TB.

This brings up an Exadata rule-of-thumb...block change tracking (bct) hugely improves incremental backup speeds.  Basicly, as a change to a block is made, a set of blocks is marked "changed" in a file.  The next time you do an incremental, only the block sets in the BCT file are sent to backup. Storage Cells also have a similar Exadata, if you aren't using BCT, the storage cells will find the blocks that have changed and return only changed sets of blocks.  Neither method actually marks the individual blocks as changed...the blocks are grouped together and if one block is changed, several blocks are marked changed.  Storage cells mark the changed blocks in smaller they're more efficient than BCT.  According to "Expert Oracle Exadata" (link to the right) by Kerry Osborne, Randy Johnson and Tanel Põder, (I'm not sure which author said this, I have the alpha version of the eBook...which anybody with interest in Exadata should buy), using BCT is still faster if less than 20% of your blocks have changed...after that, its faster to offload the functionality to the storage cell.  So, based on our 6TB/20TB per week would likely be best for us to do BCT early in the week and use the storage cells' incremental offloads later in the week.  We'll have to test it.

So, worst case scenario, we need to restore a full that's 20TB+6TB of incrementals and then apply ~1.3TB of archivelogs in 4 hours.  To complicate things, we can't do backups to the FRA because our database is too big.  Using the 80/20 configuration, we'll only have ~9TB (20% of 45.5TB in the High Performance machines).  We can always add more tape drives and media servers to make the process faster, my biggest concern is the redo apply rate, since its mostly a serialized process...its not scalable.

We're buying a ZFS storage server (Sun 7420) with connections via Infiniband for "to-disk" backups and a SL3000 tape library with 2 media servers and an admin server.  We'll be utilizing a new 11g HA OEM setup on its own hardware to monitor the backup solution (and all the other Exadata goodies.)  Although its nice to have 96TB waiting around to be used, its going to go fast as a backup destination.  Using this as a backup destination limits our takes away the best option...merged incremental backups.

Merged incremental backups allow you to take a backup once (and only once...ever), and then always apply incrementals to that backup to keep it up to date.  Its the fastest way to do backups, and its Oracle recommended backup methodology.  More importantly when the time comes to do a restore, for example a restore of a datafile, instead of restoring the file from tape or disk, you go into rman and say "switch file 5 to copy;"  Whatever archivelogs are needed to recover that datafile are applied and without even physically moving the backed up file, your database is back to 100%.  The backed up file is now the active file that your database is using.  While its up then, you can go through a procedure to fix the file in the original destination...but your uptime is maximized.  Obviously, when using Exadata, the features you depend on that normally come from the storage cells only work on things that are stored in the storage although "switching to copy" works, it isn't a practical option for Exadata when you aren't using the FRA. 

Originally, the recovery window was 4 hrs, so Oracle presented us a solution with 2 heads for a ZFS server with many spindles, 4 media servers and 28 LTO5's (sadly, the T10kc's won't be out in time for our purchase.)    After the business recovered from the sticker shock, the recovery window was increased to a more resonable 8 hrs.  Oracle then presented us a scaled down (cheaper) solution with 8 LTO5's and a single-headed 7420 with 48 spindles.

The ZFS server will have 2 trays of 24 2TB disks...which I'm told will be the bottleneck.  The single head on the 7420 can do 1.12GB/ unless the disks are *really* slow (around 50MB/s), or the ZFS overhead is extremely high...I think the head will be the bottleneck...but we'll see in testing.

In this design there were a lot of theoretical maximums used.  We've insisted the design be tested and benchmarked before we purchase it, so Oracle has been good enough to set up a similar set of hardware in their lab in Colorado to prove out the numbers we'll see for our recoveries.  If Exadata backups interest you...I highly recommend you read the new white paper that was published in the last few weeks on the topic at

Part 1:  

Part 2:  


  1. what's wrong with using incremental merge strategy on zfs and than just doing the restore back to the storage cells when you need it.

    The "switch" trick is only temporary anyone. I don't imagine people are going to run with their data files living in the FRA diskgroup for very long, so why not get the benefits of incremental merge and plan on the RTO including the restore from ZFS back to the cells.

  2. This is a great I said, it was my preference to do it that way. The problem comes down to performance needs for the RTO and the fact that we're required to have 7 days on disk.

    In order to have 7 days on disk if we were doing merged incrementals we'd have to have 20TB, plus ~1-2TB/day of incrementals (non-cumulative), so say 30TB. We have 4 environments, so that would be 120TB...and we only have a little less than 96TB of usable ZFS storage. But, we could probably get past that issue by using ZFS compression.

    In my latest post, I showed some of the performance results from testing. If we used ZFS compression instead of rman compression, ignoring overhead from ZFS decompression, we'd be limited by the single head throughput of the 7420 of ~1.21GB/s. So, since we have the "7 days on ZFS" requirement, we'd have to restore the full and then merge (worst case scenario) 6 incrementals to it.

    With a merged incremental full (20TB) and then 6 non-cumulative incrementals (~10TB):
    30TB=30,720GB, at 1.21GB/s=7.05 hrs, plus 4 hrs for archivelog we wouldn't be able to hit the RTO.

    With compressed 20TB Lev 0 and 6TB Lev 1 cumulative:
    26TB=26,624GB, at 3.11GB/s=2.38 hrs (due to decompression happening after it passes the 1.21GB/s 7420 head bottleneck. There was only ~754MB/s transferring at that point from ZFS over IB to the compute nodes. When it got to the compute nodes it decompressed at 3187MB/s. RMAN compression made it actually faster.)

    If we didn't have the "7 days on ZFS" requirement, it would take 4.7 hrs to move the 20TB uncompressed image backup to Exadata, which wouldn't give us enough time to apply the archivelogs to meet the 8 hr RTO.

    I really wanted merged incrementals to work...I just couldn't get the numbers to match our requirements. I played with the idea of doing multiple incrementals per day to reduce the number of archivelogs we'd need to apply, but then we'd be copying more from ZFS. We'd be applying fewer archivelogs, but I couldn't get the savings in redo apply time to make up for the cost in zfs transfer time.

    If we had 2 heads on the ZFS, I bet this would have worked in about 3.5 hrs which would have let us hit the RTO. But that's not going to happen for about 3 years, and by then the db will have grown.

    Sorry for the delayed response to your question, I've been very busy with testing the tape performance numbers.

  3. that's really awesome...thanks for giving informative post.
    android data recovery pro full