Tuesday, May 24, 2011

Exadata Backup and Recovery-2

In a previous post, I gave a lot of details about our SUN backup hardware for our Oracle Exadata environment.  Here's the breakdown:

Database                         : 8 Exadata compute nodes
DB Storage                      : 14 high performance storage cells
Backup Disk Storage      : single-head SUN 7420 ZFS server
Media Servers                  : 2 (X4270M2)
Admin Server                   :  1 (X4170M2)
Tape Storage                   : SL3000 Library (Base + Expansion)
Tape Slots                        : 838 (max 5965)
Tape Drives                      : 8 LTO-5 (via 8GB FC)
Tape Drive Bays              : 24 (max 56 LTO-5's)
Backup Software             : Oracle Secure Backup


The plan is to backup the database to the 7420 ZFS server using RMAN, then copy that backup to tape for offsite storage.  The Recover Time Objective is 8 hrs.  We'll have archivelogs stored in FRA and lev 0's and cumulative incremental level 1's on ZFS.  If there's a failure within 1 week, we can go from either ZFS or Tape (I'm playing with the idea of both, but I'll save that for a seperate blog.)  Here's the test plan:


There have been at least 3 white papers I'm aware of that discuss redo apply rates on Exadata, but really, this comes down to your activity.  We have very nearly pure OLTP activity, which is relatively slow for redo apply, so we need to test with our workload.  


Here are the test steps:

1. Create a flashback point using flashback database
2. Take a level 0 backup of an existing ~8TB database to ZFS
3. Generate 1300GB of archivelogs
4. flashback to first restore point
5. recover database


We were told, and the whitepapers from Oracle confirm that the apply of 1300GB or archivelogs should take about an hr on our full Exadata machine...but with our OLTP workload, the recovery took 4 hours (after tweaking it many ways)!  That's half our RTO!  Future tests will try to break up the 1300GB with level 2 cumulative backups, so hopefully we'll be able to cut the time down.

Having the recovery take 4 hrs, not 1, means the restore needs to complete in 4 hrs, not 7.  To do these tests quickly, I'm breaking up the restore time, the incremental time and the recovery times in seperate tests and tweaking each individually before combining for the final result.  After doing many tests, I discovered something interesting. 

I already knew the ZFS head is our bottleneck at 1.12GB/sec. I was able to achieve that speed after turning on jumbo frames and going active/active on the infiniband network.  I'm getting off track here, but this may interest some people:

1. Active/Passive, small MTU on 40GB Infiniband: 550MB/s
2. Active/Passive, Jumbo frame MTU                    : 660MB/s
3. Active/Active, Jumbo Frame MTU                      : 1200+MB/s

(Note: Don't take this to mean IB is pegged.  One of the Oracle BUR whitepapers said they were able to achieve 2000MB/s via IB, so my bottleneck at this point is the single-head of our 7420.  When we need more speed in 3 years, we'll go to a dual head.)

People generally equate compressed backups to "slow" backups.  Its true that, channel for channel, compressed backups add CPU overhead that slows down the backups.  On Exadata, running a backup across all nodes, we have 96 cores.  We also have the ability to throttle the backups with ORM and Exadata's IORM, so the backup will have a lower priority of CPU and IO than other activity.  This is better than using the rman RATE parameter IMHO, because if the resources are available, we can take it.  We ran tests using rman "medium" (zlib) compression, opening a varying number of channels (32, 64, 128).  We found opening 64 channels (8 per node) was the sweet spot in backup performance. 

When I did this with zlib (in 11.2, its called medium) compression, I was able to get much faster backup results than I could do without compression.  With no compression I was limited by the 1.12GB/sec 7420 ZFS head speed.  I was getting a ~4.5X compression ratio, so in theory, I could get 4.5X1.12GB/sec (5.04GB/sec).  In reality it was less than that, but its still much better that 1.12GB/sec. (results below)

My test used a representative 8TB db, but in a few months many more databases will be consolidated into this DB, bringing it to 20TB (compressed, in 5 years this will be upwards of a PB uncompressed), and I expect a 6TB cumulative incremental size.  As I mentioned, given the 4hr redo log apply time, the restore and incremental apply needs to happen in 4 hours or less.  I extrapolated the 26TB restore time based on the 8 TB db in the "Time(HR)@Rate" column.  I think its also interesting how much CPU was consumed on the compute nodes during these transactions.  Here's a subset of the test results:


Test Iteration duration ch comp
NW Config Size
GB
Comp Ratio Time(HR) @Rate Comp Mps ZFS
Mps
CPU
Exadata2ZFS 1 42.5 64 yes
AA 8TB 4.23 2.38 3186.71 753.36 14.75%
ZFS2Exadata (RESTORE) 1 69 64 yes
AA 7871 4.43 3.89 1947 439.5 3.00%
ZFS2Exadata (RESTORE) 2 65 128 yes
AA 7871 4.43 3.66 2066.64 466.51 3.40%
ZFS2Exadata (RESTORE) 3 90 32 yes
AA 7871 4.43 5.07 1492.57 336.92 3.00%
Exadata2ZFS 2 10 16 no
AA 727 1 6.1 1240.75 1240.75 -

These results give you an idea of what I mean.  With a 1:1 compression ratio (bottom line), the uncompressed backup was limited by the ZFS head speed at 1240MB/s.  Considering compression, I was able to achieve 3187MB/s for a backup and 2067MB/sec for a restore (with a single head 7420!).  There's no way an uncompressed backup can compete with that speed.  In addition, the compressed backup on ZFS means I'll have even more space to store backups and incrementals there.  

Its almost dishonest to bring that up though, because if we weren't using compression in RMAN, we'd be using ZFS compression,  achieving approx the same level of compression, so really, that advantage is a wash.  I've seen tests where people have used zfs compression for their backups...the problem is, they're still hitting the bottleneck...so for compression's sake, that's great.  For performance, that's very limiting compared to these speeds.


To recap so far, the redo apply rate allowed 1300GB to be applied in 4 hours, the extrapolated restore of 20TB data and 6TB cumulative incremental will happen in 3.66hrs, leaving .44 hrs (26.4 minutes) for the cumulative incremental apply time.  When I tested that, if finished in exactly 20 minutes.  This means the full restore and recovery will happen with 6.4 minutes to spare.  

That's cutting it close...with level 2 incrementals, I'll be able to cut the 4 hours into a much smaller time period.  If we only do 1 lev 2 incremental per day, that would bring us down to ~650GB of archivelogs we'd need to apply...and applying the level 2 is much faster than applying redo.

I broke up the restore, incremental apply and recovery pieces of the RTO so I could tweak each as I went along.  When I did the entire thing it completed in slightly less time than the individual pieces did due to RMAN "spin-up" time.  The RTO was met in the 8 hr time frame.  

The hardware sales engineers from Sun *really* know their stuff.  We asked them to use "actual" rather than theoretical speeds to design the HW for our system, and they hit it on the nail.


...but what if we have to restore and recover from tape? :)  I'll blog about my tape performance tests soon.

Part 1:  

Part 2:  






2 comments:

  1. Would you mind sharing the document which explains how to configure infiniband ports on sun 7420

    ReplyDelete
  2. One of the best places to get this information is from the 7420's help link in its GUI. Click through help->configuration->Network.

    Remember to use jumbo frames, and...if you're coming from multiple IB sources, build datalinks on devices and interfaces on the datalinks. Make sure the 7420's connection is configured active/active...it'll make a huge difference in performance.

    In my case, my backup was running from 8 compute nodes on Exadata each with an Active/Passive IB link, so potentially 160GB real throughput (lots of overhead from TCPIP over 40GB IB).

    Oracle was good enough to fly engineers in from around the country, including the performance engineer from Colorado that wrote a lot of the white papers we read for my installation, so although I was able to spend some time with the masters and I learned a lot, my installation and configuration experience was limited. This was a blessing and a curse...but I know I, and I doubt any single one of them could have come up with a such a great final result...and in the end, that's all that matters.

    ReplyDelete