In theory, there's no difference between theory and practice, but in practice, there is. When theoretical maximums are used, life occurs, and the reality sets in. Seldom do companies even have an agreed upon RTO, and even more seldom are they proven out until a disaster occurs. An RTO that isn't proven is called a guess...and even an educated guess based on published theoretical maximums is a guess of what the best case scenario could be. My current client has quarterly requirements to meet the RTO, and if the recovery plan can't hack it, adjustments (even if that means additional hardware purchases) must be made. Designing a new backup infrastructure and recovery plan for this Exadata environment needs to not only be precise, it needs to be proven.
In my previous posts on virtualized storage (specificly the Hitachi USP-V), I talked about doing Oracle database backups via snaps on the virtualized storage. They make life much easier for the DBA, with interfaces directly into RMAN. Unfortunately, until Oracle buys or partners with a storage vendor (I'm hoping for Netapp) to bring that technology to Exadata, we're limited to traditional backup methods.
For the last several months, one of my focuses has been on meeting the recovery window set out by the business. At first the requirement was...worst case scenario, we have to do a full recovery, everything needs to be running in 4 hours after a disaster. This is an OLTP VLDB that generates about 1300GB of archivelogs/day. Our RTO is RECOVERY time objective, not restore time objective...so archivelog apply on Exadata is relavent to my issue.
Archivelog apply rates have primarily 2 big variables...the hardware speed and the type of DML. A database with a DSS workload will apply much faster than an OLTP database on the same hardware. Other factors affecting our apply rate are flashback database and Golden Gate, which requires supplemental logging. 1300GB archivelog generation rate is less than what its going to become...how much more is anyone's guess. Our on-site Oracle consultants are guessing 2X, but I think that's an over-estimate to play it safe.
To understand the size of the data that needs to be restored, we have more variables. The uncompressed size of this database is expected to be many petabytes (projections are in flux...but more than one, less than 5). We've been sold a single full rack exadata machine (in an 80% data, 20% fra configuration), with a contingency for more storage cells as needed for the next 3 years. After 3 years the growth rate accelerates and we'll have more Exadata machines to add to the mix then. Depending heavily on HCC and OLTP compression and based on growth projections, we're going to say in yr 1 we only have 20TB after index drops and compression.
This database is going to be a consolidation of 22 smaller databases. The plan is to have a full backup once a week and cumulative incrementals the other 6 days. Based on the current activity in those databases, We're going to say our incrementals will be at worst 6TB.
This brings up an Exadata rule-of-thumb...block change tracking (bct) hugely improves incremental backup speeds. Basicly, as a change to a block is made, a set of blocks is marked "changed" in a file. The next time you do an incremental, only the block sets in the BCT file are sent to backup. Storage Cells also have a similar functionality...in Exadata, if you aren't using BCT, the storage cells will find the blocks that have changed and return only changed sets of blocks. Neither method actually marks the individual blocks as changed...the blocks are grouped together and if one block is changed, several blocks are marked changed. Storage cells mark the changed blocks in smaller sets...so they're more efficient than BCT. According to "
Expert Oracle Exadata" (link to the right) by Kerry Osborne, Randy Johnson and Tanel Põder, (I'm not sure which author said this, I have the alpha version of the eBook...which anybody with interest in Exadata should buy), using BCT is still faster if less than 20% of your blocks have changed...after that, its faster to offload the functionality to the storage cell. So, based on our 6TB/20TB per week changes...it would likely be best for us to do BCT early in the week and use the storage cells' incremental offloads later in the week. We'll have to test it.
So, worst case scenario, we need to restore a full that's 20TB+6TB of incrementals and then apply ~1.3TB of archivelogs in 4 hours. To complicate things, we can't do backups to the FRA because our database is too big. Using the 80/20 configuration, we'll only have ~9TB (20% of 45.5TB in the High Performance machines). We can always add more tape drives and media servers to make the process faster, my biggest concern is the redo apply rate, since its mostly a serialized process...its not scalable.
We're buying a ZFS storage server (Sun 7420) with connections via Infiniband for "to-disk" backups and a SL3000 tape library with 2 media servers and an admin server. We'll be utilizing a new 11g HA OEM setup on its own hardware to monitor the backup solution (and all the other Exadata goodies.) Although its nice to have 96TB waiting around to be used, its going to go fast as a backup destination. Using this as a backup destination limits our options...it takes away the best option...merged incremental backups.
Merged incremental backups allow you to take a backup once (and only once...ever), and then always apply incrementals to that backup to keep it up to date. Its the fastest way to do backups, and its Oracle recommended backup methodology. More importantly when the time comes to do a restore, for example a restore of a datafile, instead of restoring the file from tape or disk, you go into rman and say "switch file 5 to copy;" Whatever archivelogs are needed to recover that datafile are applied and without even physically moving the backed up file, your database is back to 100%. The backed up file is now the active file that your database is using. While its up then, you can go through a procedure to fix the file in the original destination...but your uptime is maximized. Obviously, when using Exadata, the features you depend on that normally come from the storage cells only work on things that are stored in the storage cells...so although "switching to copy" works, it isn't a practical option for Exadata when you aren't using the FRA.
Originally, the recovery window was 4 hrs, so Oracle presented us a solution with 2 heads for a ZFS server with many spindles, 4 media servers and 28 LTO5's (sadly, the T10kc's won't be out in time for our purchase.) After the business recovered from the sticker shock, the recovery window was increased to a more resonable 8 hrs. Oracle then presented us a scaled down (cheaper) solution with 8 LTO5's and a single-headed 7420 with 48 spindles.
The ZFS server will have 2 trays of 24 2TB disks...which I'm told will be the bottleneck. The single head on the 7420 can do 1.12GB/sec...so unless the disks are *really* slow (around 50MB/s), or the ZFS overhead is extremely high...I think the head will be the bottleneck...but we'll see in testing.
In this design there were a lot of theoretical maximums used. We've insisted the design be tested and benchmarked before we purchase it, so Oracle has been good enough to set up a similar set of hardware in their lab in Colorado to prove out the numbers we'll see for our recoveries. If Exadata backups interest you...I highly recommend you read the new white paper that was published in the last few weeks on the topic at
http://www.oracle.com/technetwork/database/features/availability/maa-tech-wp-sundbm-backup-11202-183503.pdf.
Part 1:
Part 2: