Friday, August 31, 2012

High-Performance Oracle RAC on VMWare

Its been a while since I gave the blog some attention. Sorry to those who have commented and I haven't replied to. I'll try to catch up ASAP. I've been extremely busy with a mass migration of Oracle databases from some IBM P5 595's to VMWare ESX 5, running on beautiful Cisco M3's series (Ivy Bridge) blades and Redhat 6. My client had already moved a few hundred smaller databases to this platform...but to move the high IO, multi-terabyte databases to RAC on VMWare is a completely different challenge. This kind of migration to blades without VMWare isn't easy...you have to deal with different hardware, a different OS and different endian (the etymology of that word is hilarious, by the way). Oracle's statement of direction announcement (see 1089399.1) of not supporting ASMLib after Redhat 5, adds one more complication.

After the architecture has been defined and ESX 5 installed (see vBlock and vFlex references...both are excellent selections and I've read on The Register they may be offering FusionIO as part of their packages) the next step is to create the VM's.

This is an unpleasant point-and-click process with any new VM...and with even a 3 node RAC, its too easy to make a mistake. In general, the process is to create the VM's with the appropriate vCPU and RAM you need (which should be a post in itself to calculate that.) Then create the storage on one of the vm's (call it node 1), then...on each of the other nodes create disks and point back to node 1's vmdk's. Rinse and repeat for each node and each shared disk. If you're following the SAP best practices for ASM, you'll need to do that for a minimum of 3 ASM diskgroups, each with their own set of vmdk's and data stores. When that's complete, go back and update the other settings per VMWare's best practices for Oracle running on RAC, then make the settings needed per VMWare's hardening guide. To make these changes, there's a button (under properties-options-Advanced-General called "Configuration Parameters") where you can add a line, and add these parameters and their values in a free-form table. If you make a typo, there's no checking...it just won't work. Per SAP best practices, don't forget to reserve 100% of RAM.

...all that to say, it is a time-consuming pain and its difficult to do without making a mistake. Early on in the project, I decided I would try to find a script from somebody out there who has done this and use their script to create the VM's...but I couldn't find any. In fact, although there were lots of people who were using PowerCLI and good community support, I don't think I found a single reference when PowerCLI was being used to create RAC nodes. So...I came up with a script to do it. There are 2 sections in it I borrowed from the PowerCLI community...the section that updated the configuration parameters and the section that zero's out the vmdk's.

The obvious (but not performant) way to create a RAC database in VMWare (as I said above) is to create the vmdk's and then point all the other node's vmdk's to their counterpart on the first node. After that you have to set the SCSI controller to allow for "physical SCSI Bus Sharing." This works...but this is the generic method of sharing storage across VMware nodes. VMware implements a locking mechanism to protect your data while accessing the same storage from multiple machines. If you have an application that isn't cluster-aware, this is necessary. If you have an application that IS cluster aware (like Oracle RAC) this is redundant. So...for better IO performance in RAC, set up each shared disk with the multi-writer parameter (see below.) For that to work, the disks must be eager zeroed. Zeroing out the vmdk's is a best practice you'll see for people who have high-io VM's (like databases.) In vSphere 5, that's called "thick eager zeroed", and its necessary for multi-writer locking.

There are a couple of key things to keep in mind when working on VMWare with RAC:

1.  Eager zero data storage, as stated above.

2. Sometimes, more vCPU's is slower. How VMWare's cpu scheduling works is that (in an effort to simulate a physical environment) all virtual cores have to be free on physical cores in order to get a time slice. For example, let's say you have a dual-quad blade with hyperthreading turned on (per vSphere 5 best practice), which gives you 16 virtual CPU's. You have other VM's that are busy and they're using up 10 cores at this moment. You created a VM that can use 8 vCPU's and now you need a single time slice. Your single time slice has to wait until all 8 vCPU's are free before it can proceed. Also, even though you just need 1 time slice, it makes 8 physical cores busy. Not only does assigning too many cores slow your VM down, your VM slows down all other VM's on the host. You can have an ESX host CPU bottleneck, even though the sum total of the cpu used inside the VM's is barely anything. This means the DBA needs read access to the host performance statistics, a fact that VMWare admins aren't happy about. Blue Medora is answering that call with an OEM 12c plugin for VMWare.

3. In VMWare ram is shared, but only when its possible to do so. There are many great memory efficiency features in vSphere 5...but if you follow Oracle best practices, they won't help you very much. Huge Pages is used internally in vSphere 5, but if you use huge pages in the VM (for the SGA), it will initially work very well...but as soon as your caches warm up, the SGA becomes distinct from all other memory used on your host. At that point, you get no advantages...so I've found its better to reserve at least the size of the SGA you'll be using. SAP's best practice for RAC on VMWare is to do 100% memory reservation for Prod...and there are other performance enhancing reasons to do that. Besides removing the performance overhead of the memory saving features, it allows some other vSphere 5 features that improve network latency (such as VMWare's Direct Path I/O to reduce network latency, not to be confused with Oracle's definition of Direct Path IO). This can have a huge impact for a busy interconnect in RAC.

4. Many of the VMWare HA features are redundant when you're running RAC. In RAC, if you have a node fail, your processes should fail over to the surviving nodes, and apps keep running. If you're running Cisco UCS, your blade's profile will go to a spare blade, and soon (15 min or so) the failed blade is back in action. VMware HA would restart that VM on a different machine that's still running, and soon (15 min or so) your failed node is back in action, assuming you left sufficient free RAM on the other blades to make that possible. Very smart people disagree about which HA method is best..and I think there are good arguments to be made on all sides. My opinion is that you should provide HA via RAC, because its instant and its more efficient. If you depend on VMWare HA with RAC, you have to keep X% free resources available in reserve on all blades...just in case.  For example, if you have 2 nodes, you'll need to limit your total vm ram allocation to 50% (maybe a bit less due to memory tricks VMWare employs...but its still a large % of your total ram.)

If you depend on RAC for HA, you can use all resources as long as you allocate enough RAM for the additional processes you'd need in case there's a node failure.  Surviving nodes would have to absorb the additional connections that used to be on the failed node...) This allows for much better efficiency, but it means the surviving nodes need to be capable of supporting the additional load from the failed node.

5. One last thing to keep in mind - if you try to start a VM that puts you over 20TB on the host, you get a non-descriptive "out of memory" error, which then references one of your vmdk's and your VM will fail to start.  When I first saw this I thought...what does "out of memory"-an error associated with ram- have to do with a vmdk?  The answer lies in the VMWare internals...for performance reasons, the storage is reference in a reserved heap space in ram...similar to the memory used to track AU's in Oracle's ASM.  By default, there's 80MB set aside for that which is sufficient for 20TB of vmdk storage.  After that, you get the "out of memory" error and the vmdk that pushed you over the limit is referenced.  That’s why a RAM issue references a vmdk.  The solution is to increase the heap size to its max, which is 256MB and allows up to 60TB of vmdk storage per ESX host.  After that, you need to reconsider using VMDK's.  In my project, we were going to pass this limit...so for that reason (and others) we implemented Oracle Advanced Compression as part of the migration.  Suddenly, the databases that were over 60TB total became much less.  We're averaging a little over 3X compression with our data. 

With all the changes I've mentioned...hardware platform, os, endian, asmlib (or the lack of), vmdk's, advanced compression...and we're moving some non-RAC databases to RAC, some non-ASM databases into ASM and implementing index compression...how can we guarantee performance on the new platform with so many variables?  We used Oracle's Real Application Testing and tested extensively.  I'll talk more about that later.

In my next post, I'll show you the script I passed off to the VMWare team to create the VM's.  At first they were a bit hesitant that a database guy was giving them a script to make VM's, but after they realized how many hours of their lives they were going to get back (and that it removed human error), they were more than happy to not only use it for the migration, but to modify it and use it for all other VM builds....

7 comments:

  1. Wow, really interesting posts this month. I am located in St. Louis too, currently doing something very similar. Would you be interested in comparing notes?

    ReplyDelete
  2. Sure...I'm sure that would be mutually beneficial. I'll send you a note on LinkedIn and we can set something up.

    ReplyDelete
  3. I am wondering about the benefits of RAC on Vmware, when vertical scalability can be achieved through VMware's allocation of vCPUs. As far as availability, VMware provides its own HA solution. Did you get any push-back from VMware architects to put Oracle RAC as part of solution? Thanks

    ReplyDelete
    Replies
    1. Hi Phani,

      This is a GREAT question that I've struggled with myself. There's no question that small db's are great candidates for virtualization. For the large IOPS/CPU/RAM hungry db's...I wanted RAC on steel. The trend around the field is to "Virtualize Everything" so I reluctantly took a look at RAC on vSphere 5. As you pointed out...I see redundancies all over the place with RAC+VMWare...but they can still work well together and compliment each other. Even where there's a technological overlap...one of them is better at it than the other.

      For HA, RAC is better. In the event of a failure, your query will be restarted on the surviving node. If you're in the middle of a transaction and you enable/trap FAN events, your DB client will fail over to a surviving node and your app will restart the transaction on the other node...all without an interruption in service. The user will never know there was a problem. The alternative options offered by vSphere are HA and FT. HA will restart the vm/db on a different ESX Host...so...an outage of 15 min or so to boot up. A different VMWare option is FT (fault tolerance)...which is great...but today has a limitation of a single vCPU...HUGELY limiting. At SAP TechED in Vegas, I had conversations with VMWare reps that the limitation would soon be improved to 4 vCPU (and might be available in 5.2...soon). That would be great...but in the meantime, RAC is better at HA, IMHO.

      Another technology that overlaps is "how well multiple db's run on a single machine." In Oracle, you can do "Instance Caging", which limits a database to a set amount of cpu. This works really well (complication can happen when you combine this with Oracle's ORM though) but seperating out multiple instances on a single server is much more straight forward when they're each in their own VM. Its not uncommon for security reasons that somebody who has OS access to one DB wouldn't have access to a 2nd db. With VMWare, they can share the hardware (and more importantly, the licensing cost) of a single machine with better seperation.

      VMWare and SAP have provided RAC on VMWare white papers, implying they see a place for the two to work together...and I think Larry Ellison would agree there's a place for RAC+Virtualization...which is why Oracle provides RAC templates for OVM. I really like OVM, but it doesn't have the market penetration that VMWare has...so that's why I find myself working with RAC on VMWare today....

      One last point - before vSphere 5, there was a limitation of 8 vCPU's per host. Think about how restrictive that was...a vCPU with hyperthreading is really just a thread of a core. Although they were technically vertically scalable, even medium-sized db's weren't an option until now. I would expect that trend to continue, that in time these two technologies will work together better and better.

      Delete
  4. Very good information. I was wondering, how you go about adding more space to your ASM instances? For example, it is my understanding to make a config change for the multi-writer flag the vm has to be shutdown. Therefore, if you have not added extra entries for future disks in your configuration parameters file you will be forced to shutdown your vm to add the multi witer flag for the new disk. Has this been your observation or did I miss something?

    ReplyDelete
    Replies
    1. Hi Scott,

      Something to think about when adding storage without a bounce...if you're not using asmlib (aka Redhat 6.X), check out (http://otipstricks.blogspot.com/2013/06/adding-disk-to-asm-and-using-udev.html). There's an issue when you follow Oracle's "add-a-disk" procedure using udev rules - which always ends with "start_udev". This will (by default) disconnect your network interface briefly, causing lots of issues for your listener(s) and clusterware.

      Delete
  5. Hi Scott,

    You're right...in the Power-CLI script I use to build the VM's (http://otipstricks.blogspot.com/2012/08/the-oracle-rac-vm-build-script-for.html), I seperate out the purposes of the data (OS, DATA, REDO, ARCH) by the different scsi devices. For the DB devices, I include the multi-writer flag for each possible future disk on the scsi devices I use for storage...so I don't have to worry about ever adding an additional entry. As you said...to have to bounce in order to add storage wouldn't be ideal...even if you rolled the change in with RAC, its still a pain...I'd rather avoid it all together.

    Hope this helps....

    ReplyDelete