Monday, July 1, 2013

Oracle RAC on vBlock

My recent project migrating many large, very active databases from single instance AIX to RAC running Redhat 6.2 had a lot of challenges that changed the design as time went on.  Originally the plan was to deploy (according to VMWare's best practices) using vmdk's on datastores, but the overall storage requirements exceeded 60TB, so this was no longer an option and we were forced (to my delight) to use raw devices instead.  All of these databases logically migrated to multiple VCE vBlocks (http://www.vce.com/products/vblock/overview).

Per SAP's ASM best practices (Variant 1), we placed the storage in 3 diskgroups: DATA, RECO and ARCH.


Oracle ASM Disk Group Name Stores
+DATA     - All data files
                  - All temp files
                  - Control file (first copy)
                  - Online redo logs (first copy)

+ARCH     - Control file (second cop

                  - Archived redo logs


+RECO      - Control file (third copy)
                   - Online redo logs (second copy)

Per Oracle's best practices, all the storage in a diskgroup should be the same size and performance...and the SAP layout suggests different IO requirements of these pools, so we went with a combination of SSD's and fast 15k SAS spindles in the DATA diskgroup (FAST on), many smaller 15k SAS spindles in REDO and slower 7200rpm 2TB NL-SAS spindles in +ARCH...after all, its ok if the background processes take longer to archive your logs.  Redo will remain active a little longer, but as long as its cleared long before we wrap around all the redo groups, its sufficient, doesn't affect performance and its much less expensive per GB.  We also created VMWare datastores for the OS out of the arch pool, since it, too, has low iops requirements.

There are some issues with this design but overall its performing extremely well.  The SAP database  is serving about 4 million db calls per minute, generating 1TB of archivelogs/day.  For a mixed load or DSS database, that archivelog generation wouldn't be a big deal...but for a pure OLTP db that's pretty respectable.  The DB Cache is undersized at 512GB...more than the old system had, which has really helped take the load off the storage and reduced our IOPS requirements.  The "DB Time" tracked by SAP is showing over a 2X performance boost.

For the larger non-SAP databases, their performance increase has been much more dramatic.  SAP ties your hands a bit, to make things consistent between all their customers their implementation is very specific...you have to be SAP Migration Certified to move a database to a new platform.  Michael Wang (from Oracle's SAP Migration group), who also teaches some Exadata administration classes, is an excellent resource for SAP migrations, and he's great to work with.   Many features that have been common in Oracle for years aren't supported.  For the non-SAP databases, we're free to take advantage of all the performance features Oracle has...and there are many.  We compressed tables with advanced compression, compressed indexes, tweaked stats and caches, moved to merged incremental backups on a different set of spindles than our data, create profiles suggested during RAT testing...basically everything we could think of.  For some databases, we implemented result cache...for others we found (in RAT testing) that it wasn't beneficial overall...it depends on your workload.  Some of our biggest performance gains (in some cases, 1000X+) didn't come from the new hardware, new software or the new design...but came from the migration itself.  For years, database upgrades were done in place, and since performance was tracked relative to "what it usually is" rather than what it should be...lots of problems, such as chained rows, were hidden.  After we did a logical migration, these problems were fixed and performance reached its potential.  I got lots of emails that went something like, "Wow, this is fast!!"

Its extremely good, but not perfect.  There's still an issue left due to going to multiple VNX's instead of a single vMax.  I'll talk about that one later.

Friday, June 28, 2013

Adding a disk to ASM and using UDEV?

The project I've been writing about to migrate many single instance IBM P5 595 AIX databases to 11.2.0.3 RAC on EMC vBlocks is coming to a close.  I thought there might be value in sharing some of the lessons learned from the experience.  There have been quite a few....

As I was sitting with the DBA team, discussing how well everything has gone and how stable the new environment was, alerts started going off that services, a vip and a scan listener on one of the production RAC nodes had failed over.  Hmm...that's strange.  About 45 seconds later...more alerts came in that the same thing happened on the next node...that happened over and over.  We poured through the clusterware/listener/alert logs and found nothing helpful...only that there was a network issue and clusterware took measures after the fact...nothing to point at the root cause.

Eventually we looked in the OS message log, and found this incident:

May 30 18:05:14 SCOOBY kernel: udev: starting version 147
May 30 18:05:16 SCOOBY ntpd[4312]: Deleting interface #11 eth0:1, 115.18.28.17#123, interface stats: received=0, sent=0, dropped=0, active_time=896510 secs
May 30 18:05:16 SCOOBY ntpd[4312]: Deleting interface #12 eth0:2, 115.18.28.30#123, interface stats: received=0, sent=0, dropped=0, active_time=896510 secs
May 30 18:05:18 SCOOBY ntpd[4312]: Listening on interface #13 eth0:1, 115.18.28.30#123 Enabled
May 30 18:08:21 SCOOBY kernel: ata1: soft resetting link
May 30 18:08:22 SCOOBY kernel: ata1.00: configured for UDMA/33
May 30 18:08:22 SCOOBY kernel: ata1: EH complete
May 30 18:09:55 SCOOBY kernel: sdab: sdab1
May 30 18:10:13 SCOOBY kernel: sdac: sdac1
May 30 18:10:27 SCOOBY kernel: udev: starting version 147

Udev started, ntpd reported the network issue, then udev finished.  Hmm...why did Udev start?  It turns out that the unix team added a disk (which has always been considered safe during business hours) and as part of Oracle's procedure to create the udev rule, they needed to run start_udev.  The first reaction was to declare "adding storage" an "after-hours practice only" from now on...and that would usually be ok...but there are times that emergencies come up and adding storage can't wait until after hours, and must be done online...so we needed a better answer.

The analysis of the issue showed that when the Unix team followed their procedure and ran start_udev, udev deleted the public network interface and re-created it within a few seconds which caused the listener to crash...and of course, clusterware wasn't ok with this.  All the scan listeners and services fled from that node to other nodes.  Without noticing an issue, the unix team proceeded to add the storage to the other nodes causing failovers over and over.  

We opened tickets with Oracle (since we followed their documented process per multiple MOS notes) and Redhat (since they support Udev).  The Oracle ticket didn't really go anywhere...the Redhat ticket said this is normal, expected behavior, which I thought is strange...I've done this probably hundreds of times and never noticed a problem, and I found nothing on MOS that mentions a problem.   RH eventually suggested we add HOTPLUG="NO" to the network configuration files.  After that, when we run start_udev, we don't have the problem, the message log doesn't show the network interface getting dropped and re-created...and everything is good.  We're able to add storage w/o an outage again.


I updated the MOS SR w/Redhat's resolution.  Hopefully this will be mentioned in a future note, or added to RACCHECK, for those of us running Oracle on Redhat 6+, where asmlib is unavailable.

-- UPDATE -- (Thanks for finding this, Dave Jones)

From Oracle, per note 414897.1, 1528148.1, 371814.1 etc, we're told to use start_udev to activate a new rule and add storage.  From Redhat (https://access.redhat.com/site/solutions/154183) we're told to never manually run start_udev.

Redhat has a better suggestion...you can trigger the udev event and not lose your network configuration and only effect the specific device you're working with via:

echo change > /sys/block/sdg/sdg1/uevent

I think this is a better option...so...do this instead of start_udev.  I would expect this to become a bigger issue as more people migrate to RH 6+, where asmlib isn't an option.