Monday, July 1, 2013

Oracle RAC on vBlock

My recent project migrating many large, very active databases from single instance AIX to RAC running Redhat 6.2 had a lot of challenges that changed the design as time went on.  Originally the plan was to deploy (according to VMWare's best practices) using vmdk's on datastores, but the overall storage requirements exceeded 60TB, so this was no longer an option and we were forced (to my delight) to use raw devices instead.  All of these databases logically migrated to multiple VCE vBlocks (http://www.vce.com/products/vblock/overview).

Per SAP's ASM best practices (Variant 1), we placed the storage in 3 diskgroups: DATA, RECO and ARCH.


Oracle ASM Disk Group Name Stores
+DATA     - All data files
                  - All temp files
                  - Control file (first copy)
                  - Online redo logs (first copy)

+ARCH     - Control file (second cop

                  - Archived redo logs


+RECO      - Control file (third copy)
                   - Online redo logs (second copy)

Per Oracle's best practices, all the storage in a diskgroup should be the same size and performance...and the SAP layout suggests different IO requirements of these pools, so we went with a combination of SSD's and fast 15k SAS spindles in the DATA diskgroup (FAST on), many smaller 15k SAS spindles in REDO and slower 7200rpm 2TB NL-SAS spindles in +ARCH...after all, its ok if the background processes take longer to archive your logs.  Redo will remain active a little longer, but as long as its cleared long before we wrap around all the redo groups, its sufficient, doesn't affect performance and its much less expensive per GB.  We also created VMWare datastores for the OS out of the arch pool, since it, too, has low iops requirements.

There are some issues with this design but overall its performing extremely well.  The SAP database  is serving about 4 million db calls per minute, generating 1TB of archivelogs/day.  For a mixed load or DSS database, that archivelog generation wouldn't be a big deal...but for a pure OLTP db that's pretty respectable.  The DB Cache is undersized at 512GB...more than the old system had, which has really helped take the load off the storage and reduced our IOPS requirements.  The "DB Time" tracked by SAP is showing over a 2X performance boost.

For the larger non-SAP databases, their performance increase has been much more dramatic.  SAP ties your hands a bit, to make things consistent between all their customers their implementation is very specific...you have to be SAP Migration Certified to move a database to a new platform.  Michael Wang (from Oracle's SAP Migration group), who also teaches some Exadata administration classes, is an excellent resource for SAP migrations, and he's great to work with.   Many features that have been common in Oracle for years aren't supported.  For the non-SAP databases, we're free to take advantage of all the performance features Oracle has...and there are many.  We compressed tables with advanced compression, compressed indexes, tweaked stats and caches, moved to merged incremental backups on a different set of spindles than our data, create profiles suggested during RAT testing...basically everything we could think of.  For some databases, we implemented result cache...for others we found (in RAT testing) that it wasn't beneficial overall...it depends on your workload.  Some of our biggest performance gains (in some cases, 1000X+) didn't come from the new hardware, new software or the new design...but came from the migration itself.  For years, database upgrades were done in place, and since performance was tracked relative to "what it usually is" rather than what it should be...lots of problems, such as chained rows, were hidden.  After we did a logical migration, these problems were fixed and performance reached its potential.  I got lots of emails that went something like, "Wow, this is fast!!"

Its extremely good, but not perfect.  There's still an issue left due to going to multiple VNX's instead of a single vMax.  I'll talk about that one later.

Friday, June 28, 2013

Adding a disk to ASM and using UDEV?

The project I've been writing about to migrate many single instance IBM P5 595 AIX databases to 11.2.0.3 RAC on EMC vBlocks is coming to a close.  I thought there might be value in sharing some of the lessons learned from the experience.  There have been quite a few....

As I was sitting with the DBA team, discussing how well everything has gone and how stable the new environment was, alerts started going off that services, a vip and a scan listener on one of the production RAC nodes had failed over.  Hmm...that's strange.  About 45 seconds later...more alerts came in that the same thing happened on the next node...that happened over and over.  We poured through the clusterware/listener/alert logs and found nothing helpful...only that there was a network issue and clusterware took measures after the fact...nothing to point at the root cause.

Eventually we looked in the OS message log, and found this incident:

May 30 18:05:14 SCOOBY kernel: udev: starting version 147
May 30 18:05:16 SCOOBY ntpd[4312]: Deleting interface #11 eth0:1, 115.18.28.17#123, interface stats: received=0, sent=0, dropped=0, active_time=896510 secs
May 30 18:05:16 SCOOBY ntpd[4312]: Deleting interface #12 eth0:2, 115.18.28.30#123, interface stats: received=0, sent=0, dropped=0, active_time=896510 secs
May 30 18:05:18 SCOOBY ntpd[4312]: Listening on interface #13 eth0:1, 115.18.28.30#123 Enabled
May 30 18:08:21 SCOOBY kernel: ata1: soft resetting link
May 30 18:08:22 SCOOBY kernel: ata1.00: configured for UDMA/33
May 30 18:08:22 SCOOBY kernel: ata1: EH complete
May 30 18:09:55 SCOOBY kernel: sdab: sdab1
May 30 18:10:13 SCOOBY kernel: sdac: sdac1
May 30 18:10:27 SCOOBY kernel: udev: starting version 147

Udev started, ntpd reported the network issue, then udev finished.  Hmm...why did Udev start?  It turns out that the unix team added a disk (which has always been considered safe during business hours) and as part of Oracle's procedure to create the udev rule, they needed to run start_udev.  The first reaction was to declare "adding storage" an "after-hours practice only" from now on...and that would usually be ok...but there are times that emergencies come up and adding storage can't wait until after hours, and must be done online...so we needed a better answer.

The analysis of the issue showed that when the Unix team followed their procedure and ran start_udev, udev deleted the public network interface and re-created it within a few seconds which caused the listener to crash...and of course, clusterware wasn't ok with this.  All the scan listeners and services fled from that node to other nodes.  Without noticing an issue, the unix team proceeded to add the storage to the other nodes causing failovers over and over.  

We opened tickets with Oracle (since we followed their documented process per multiple MOS notes) and Redhat (since they support Udev).  The Oracle ticket didn't really go anywhere...the Redhat ticket said this is normal, expected behavior, which I thought is strange...I've done this probably hundreds of times and never noticed a problem, and I found nothing on MOS that mentions a problem.   RH eventually suggested we add HOTPLUG="NO" to the network configuration files.  After that, when we run start_udev, we don't have the problem, the message log doesn't show the network interface getting dropped and re-created...and everything is good.  We're able to add storage w/o an outage again.


I updated the MOS SR w/Redhat's resolution.  Hopefully this will be mentioned in a future note, or added to RACCHECK, for those of us running Oracle on Redhat 6+, where asmlib is unavailable.

-- UPDATE --

From Oracle, per note 414897.1, 1528148.1, 371814.1 etc, we're told to use start_udev to activate a new rule and add storage.  From Redhat (https://access.redhat.com/site/solutions/154183) we're told to never manually run start_udev.

Redhat has a better suggestion...you can trigger the udev event and not lose your network configuration and only effect the specific device you're working with via:

echo change > /sys/block/sdg/sdg1/uevent

I think this is a better option...so...do this instead of start_udev.  I would expect this to become a bigger issue as more people migrate to RH 6+, where asmlib isn't an option.




Thursday, December 6, 2012

Swingbench is great! (but not descriptive when there's an error.)

Just a quick note that may help some of you using Swingbench in distributed mode.  For those of you that haven't used it yet, Swingbench is a great way to compare performance of different Oracle databases running on different platforms or configurations.  The best part is...its free:

http://www.dominicgiles.com/swingbench.html

There are two ways to do it...one is a simple test from your laptop...the other is for a distributed RAC database, in order to push it and see when its bottlenecks are (and to make sure your laptop isn't introducing a bottleneck) You can get the details and a walk through from the author's site (link above), but essentially you have multiple groups connecting to the database directly to specific nodes, then their results are aggregated into the "coordinator process"...and its results are displayed by the cluster overview process.  When I was doing this last night I got this error and was unable to find help on "the internets":

11:33:11 AM FINEST com.dom.benchmarking.swingbench.clusteroverview.datasource.S calabilityDataSource () Connected java.lang.NullPointerException at com.dom.benchmarking.swingbench.clusteroverview.datasource.Transactio nDataSource.updateResultsArray(TransactionDataSource.java:148) at com.dom.benchmarking.swingbench.clusteroverview.datasource.TransactionDataSource.run(TransactionDataSource.java:177) at java.lang.Thread.run(Unknown Source)

I was doing a distributed swingbench test and the workload generators I was using (charbench), were all using the same swingconfig.xml over a shared NFS mount, which had a typo in the connect string...so I ended up having no connections. I can only guess this might be what the java error was trying to say with the "null pointed exception on update of the results array of the transaction data source." For my situation (and maybe yours) I consider this an "unable to connect to database" error...if you hit this issue, check the connect string in swingconfig.xml.

I hope this helps!

Thursday, October 25, 2012

Update to VM script for PowerCLI is on the way....

I'm going to update the script in the previous post to not use vmdk's for the data luns of the database. 

Although this performs well and its the SAP/VMWare best practice (and this is great for smaller db's) the idea that we may have to replicate our bugs on physical hardware for Oracle support means we'd have to live with our bug long enough to set up a new physical server, install the db software and restore the database to it.  For these multi-TB databases, that would take many hours, at least.  If we use RAW devices in the VM's (with the vBlock's Cisco UCS) all we have to do is apply a profile to a spare blade, turn it on and add that node to the RAC cluster.  Within a few minutes we'll be able to replicate the bug for MOS...then we can shutdown the blade and remove the profile.

I'll post it when its finished.

Friday, August 31, 2012

The Oracle RAC VM Build Script for PowerCLI on vSphere 5

One of the benefits of computers is that they're supposed to make repetative tasks easier, or automated.  Still, in many tasks in our industry, we're still supposed to click over and over in a GUI to do the same tasks, without error.  Besides causing the "Post Office Syndrom" (which causes one to "go postal"), this is no way for intelligent human beings to spend their lives.  Any chance I get to automate something that needs to be done over and over, I take it.  With that in mind, this script improved my life, and I hope it'll improve your life too.

I mentioned this in a previous post, so here it is, the PowerCLI vSphere 5 multiple vm build script.  It easily creates multiple VM's with shared storage utilizing a combination of best practices from Oracle, SAP and VMware.  I am by no means a PowerCLI guru...if you have improvements for this...let me know.  Here are a few I would like to see in the long term:

1. I have a friend that's planning to add the XLS functionality in PowerCLI to this...so that the dba team can give him an Excel spreadsheet with a list of database parameters that this script can read in and replace the parameters in a loop...creating many vm's, one after the other, automated.

2. The number of nodes should be a parameter that feeds the logic in a loop...so if you have a 2 node or 8 node RAC db, the same script can be used.

3. The final section that eager-zeroes the storage works...but for large databases it takes an extremely long time.  An alternative method would be to create the vmdk's thin, and then move them to the same datastore as eager zeroed, similar to what's discussed here.  My theory is that this might use VAAI, which could hugely improve the eager zeroing process by offloading it to the SAN.

Also, be aware there is a VSphere client bug that incorrectly reports the backing of the vmdk's and thick lazy zeroed when they're actually thick eager zeroed.  If you run the "eager zero" part of the script and it seems to complete in a few seconds, it means you're trying to eager zero something that's already eager zeroed (regardless of what the client reports), which almost amounts to a no-op. 

Many sections of this have been patch worked from talented people in the PowerCLI community but I can't tell who the original authors were...I think because different snippets have been added on by different people. Still, I'd really like to give them credit and thank them for making this possible. Technical communities that work together restore my faith in the good of mankind. :)


When I run this, I connect via terminal services to vCenter on vSphere 5, then paste it into PowerCLI on that machine...but there are lots of ways to skin that cat... Hmm...I know my blog gets translated to different languages...I wonder how colloquialisms like that are interpreted in Hindi, Chinese, Peta etc? :)


$VS5_Host1 = "node1.company.com"
$VS5_Host2 = "node2.company.com"
$VS5_Host3 = "node3.company.com"
$vmName1 = "racnode1"
$vmName2 = "racnode2"
$vmName3 = "racnode3"
$rac_vm_cpu = 6
$rac_vm_ram_mb = (110GB/1MB)
$rac_vm_ram_mb_rez = (90.6GB/1MB)
$public_network_name = "10.2.14-17"
$private_network_name = "192.168.20.0"
$backup_network_name = "Backup"
$osstore = "os_datastore"
$osstore_size_MB = (100GB/1MB)
$orastore = "ora_datastore"
$orastore_size_KB = (100GB/1KB)
$datastore1 = "data1"
$datastore2 = "data2"
$datastore3 = "data3"
$datastore4 = "data4"
$datastore5 = "data5"
$datastore6 = "data6"
$datastore7 = "data7"
$datastore8 = "data8"
$datastore_size_KB = (550GB/1KB)
$recostore1 = "loga"
$recostore2 = "logb"
$recostore_size_KB = (8GB/1KB)
$archstore1 = "arch01"
$archstore2 = "arch02"
$archstore3 = "arch03"
$archstore_size_KB = (200GB/1KB)

$VM1 = new-vm `
-Host "$VS5_Host1" `
-Name $vmName1 `
-Datastore (get-datastore "$osstore") `
-Location "Oracle" `
-GuestID rhel6_64Guest `
-MemoryMB 4096 `
-DiskMB $osstore_size_MB `
-NetworkName "$public_network_name" `
-DiskStorageFormat "Thin"

$vm2 = new-vm `
-Host "$VS5_Host2" `
-Name $vmName2 `
-Datastore (get-datastore "$osstore") `
-Location "Oracle" `
-GuestID rhel6_64Guest `
-MemoryMB 4096 `
-DiskMB $osstore_size_MB `
-NetworkName "$public_network_name" `
-DiskStorageFormat "Thin"

$VM3 = new-vm `
-Host "$VS5_Host3" `
-Name $vmName3 `
-Datastore (get-datastore "$osstore") `
-Location "Oracle" `
-GuestID rhel6_64Guest `
-MemoryMB 4096 `
-DiskMB $osstore_size_MB `
-NetworkName "$public_network_name" `
-DiskStorageFormat "Thin"

Function Change-Memory {
Param (
$VM,
$MemoryMB
)
Process {
$VMs = Get-VM $VM
Foreach ($Machine in $VMs) {
$VMId = $Machine.Id

$VMSpec = New-Object VMware.Vim.VirtualMachineConfigSpec
$VMSpec.memoryMB = $MemoryMB
$RawVM = Get-View -Id $VMId
$RawVM.ReconfigVM_Task($VMSpec)
}
}
}

Change-Memory -MemoryMB $rac_vm_ram_mb -VM $VM1
Change-Memory -MemoryMB $rac_vm_ram_mb -VM $VM2
Change-Memory -MemoryMB $rac_vm_ram_mb -VM $VM3

Set-VM -vm(get-vm $VM1) -NumCpu $rac_vm_cpu -RunAsync -Version v8 -Confirm:$false
Set-VM -vm(get-vm $vm2) -NumCpu $rac_vm_cpu -RunAsync -Version v8 -Confirm:$false
Set-VM -vm(get-vm $VM3) -NumCpu $rac_vm_cpu -RunAsync -Version v8 -Confirm:$false

Get-VM $VM1 | Get-VMResourceConfiguration | Set-VMResourceConfiguration -MemReservationMB $rac_vm_ram_mb_rez
Get-VM $vm2 | Get-VMResourceConfiguration | Set-VMResourceConfiguration -MemReservationMB $rac_vm_ram_mb_rez
Get-VM $VM3 | Get-VMResourceConfiguration | Set-VMResourceConfiguration -MemReservationMB $rac_vm_ram_mb_rez

New-NetworkAdapter -VM $vm1 -NetworkName "$private_network_name" -StartConnected -Type vmxnet3 -Confirm:$false
New-NetworkAdapter -VM $vm2 -NetworkName "$private_network_name" -StartConnected -Type vmxnet3 -Confirm:$false
New-NetworkAdapter -VM $vm3 -NetworkName "$private_network_name" -StartConnected -Type vmxnet3 -Confirm:$false

New-NetworkAdapter -VM $vm1 -NetworkName "$backup_network_name" -StartConnected -Type vmxnet3 -Confirm:$false
New-NetworkAdapter -VM $vm2 -NetworkName "$backup_network_name" -StartConnected -Type vmxnet3 -Confirm:$false
New-NetworkAdapter -VM $vm3 -NetworkName "$backup_network_name" -StartConnected -Type vmxnet3 -Confirm:$false

Function Enable-MemHotAdd($vm){
$vmview = Get-vm $vm | Get-View
$vmConfigSpec = New-Object VMware.Vim.VirtualMachineConfigSpec

$extra = New-Object VMware.Vim.optionvalue
$extra.Key="mem.hotadd"
$extra.Value="true"
$vmConfigSpec.extraconfig += $extra

$vmview.ReconfigVM($vmConfigSpec)
}

enable-memhotadd $vm1
enable-memhotadd $vm2
enable-memhotadd $vm3

Function Enable-vCpuHotAdd($vm){
$vmview = Get-vm $vm | Get-View
$vmConfigSpec = New-Object VMware.Vim.VirtualMachineConfigSpec

$extra = New-Object VMware.Vim.optionvalue
$extra.Key="vcpu.hotadd"
$extra.Value="true"
$vmConfigSpec.extraconfig += $extra

$vmview.ReconfigVM($vmConfigSpec)
}

enable-vCpuHotAdd $vm1
enable-vCpuHotAdd $vm2
enable-vCpuHotAdd $vm3

New-HardDisk -vm($VM1) -CapacityKB $orastore_size_KB -StorageFormat Thin -datastore "$orastore"
New-HardDisk -vm($vm2) -CapacityKB $orastore_size_KB -StorageFormat Thin -datastore "$orastore"
New-HardDisk -vm($VM3) -CapacityKB $orastore_size_KB -StorageFormat Thin -datastore "$orastore"

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore1"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})
$New_SCSI_1_1 = $New_Disk1 | New-ScsiController -Type ParaVirtual -Confirm:$false
$New_SCSI_2_1 = $New_Disk2 | New-ScsiController -Type ParaVirtual -Confirm:$false
$New_SCSI_3_1 = $New_Disk3 | New-ScsiController -Type ParaVirtual -Confirm:$false

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore2"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore3"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore4"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore5"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore6"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore7"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $datastore_size_KB -StorageFormat Thick -datastore "$datastore8"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_1
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_1
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_1

###################################

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $recostore_size_KB -StorageFormat Thick -datastore "$recostore1"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})
$New_SCSI_1_2 = $New_Disk1 | New-ScsiController -Type ParaVirtual -Confirm:$false
$New_SCSI_2_2 = $New_Disk2 | New-ScsiController -Type ParaVirtual -Confirm:$false
$New_SCSI_3_2 = $New_Disk3 | New-ScsiController -Type ParaVirtual -Confirm:$false

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $recostore_size_KB -StorageFormat Thick -datastore "$recostore2"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_2
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_2
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_2

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $recostore_size_KB -StorageFormat Thick -datastore "$recostore1"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_2
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_2
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_2

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $recostore_size_KB -StorageFormat Thick -datastore "$recostore2"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_2
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_2
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_2

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $recostore_size_KB -StorageFormat Thick -datastore "$recostore1"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_2
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_2
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_2

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $recostore_size_KB -StorageFormat Thick -datastore "$recostore2"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_2
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_2
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_2

#######################


$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $archstore_size_KB -StorageFormat Thick -datastore "$archstore1"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})
$New_SCSI_1_3 = $New_Disk1 | New-ScsiController -Type ParaVirtual -Confirm:$false
$New_SCSI_2_3 = $New_Disk2 | New-ScsiController -Type ParaVirtual -Confirm:$false
$New_SCSI_3_3 = $New_Disk3 | New-ScsiController -Type ParaVirtual -Confirm:$false

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $archstore_size_KB -StorageFormat Thick -datastore "$archstore2"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_3
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_3
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_3

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $archstore_size_KB -StorageFormat Thick -datastore "$archstore3"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_3
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_3
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_3

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $archstore_size_KB -StorageFormat Thick -datastore "$archstore1"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_3
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_3
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_3

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $archstore_size_KB -StorageFormat Thick -datastore "$archstore2"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_3
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_3
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_3

$New_Disk1 = New-HardDisk -vm($VM1) -CapacityKB $archstore_size_KB -StorageFormat Thick -datastore "$archstore3"
$New_Disk2 = new-harddisk -vm($vm2) -diskpath ($New_Disk1 | %{$_.Filename})
$New_Disk3 = new-harddisk -vm($vm3) -diskpath ($New_Disk1 | %{$_.Filename})

set-harddisk -Confirm:$false -harddisk $New_Disk1 -controller $New_SCSI_1_3
set-harddisk -Confirm:$false -harddisk $New_Disk2 -controller $New_SCSI_2_3
set-harddisk -Confirm:$false -harddisk $New_Disk3 -controller $New_SCSI_3_3

$ExtraOptions = @{
# per VMware, SAP and Oracle VMware Best Practices
"scsi1:0.sharing"="multi-writer";
"scsi1:1.sharing"="multi-writer";
"scsi1:2.sharing"="multi-writer";
"scsi1:3.sharing"="multi-writer";
"scsi1:4.sharing"="multi-writer";
"scsi1:5.sharing"="multi-writer";
"scsi1:6.sharing"="multi-writer";
"scsi1:8.sharing"="multi-writer";
"scsi1:9.sharing"="multi-writer";
"scsi1:10.sharing"="multi-writer";
"scsi1:11.sharing"="multi-writer";
"scsi1:12.sharing"="multi-writer";
"scsi1:13.sharing"="multi-writer";
"scsi1:14.sharing"="multi-writer";
"scsi1:15.sharing"="multi-writer";
"scsi2:0.sharing"="multi-writer";
"scsi2:1.sharing"="multi-writer";
"scsi2:2.sharing"="multi-writer";
"scsi2:3.sharing"="multi-writer";
"scsi2:4.sharing"="multi-writer";
"scsi2:5.sharing"="multi-writer";
"scsi2:6.sharing"="multi-writer";
"scsi2:8.sharing"="multi-writer";
"scsi2:9.sharing"="multi-writer";
"scsi2:10.sharing"="multi-writer";
"scsi2:11.sharing"="multi-writer";
"scsi2:12.sharing"="multi-writer";
"scsi2:13.sharing"="multi-writer";
"scsi2:14.sharing"="multi-writer";
"scsi2:15.sharing"="multi-writer";
"scsi3:0.sharing"="multi-writer";
"scsi3:1.sharing"="multi-writer";
"scsi3:2.sharing"="multi-writer";
"scsi3:3.sharing"="multi-writer";
"scsi3:4.sharing"="multi-writer";
"scsi3:5.sharing"="multi-writer";
"scsi3:6.sharing"="multi-writer";
"scsi3:8.sharing"="multi-writer";
"scsi3:9.sharing"="multi-writer";
"scsi3:10.sharing"="multi-writer";
"scsi3:11.sharing"="multi-writer";
"scsi3:12.sharing"="multi-writer";
"scsi3:13.sharing"="multi-writer";
"scsi3:14.sharing"="multi-writer";
"scsi3:15.sharing"="multi-writer";
"disk.EnableUUID"="true";
"ethernet0.coalescingScheme"="disabled";
"ethernet1.coalescingScheme"="disabled";
"sched.mem.pshare.enable"="false";
"numa.vcpu.preferHT"="true";

# per VMware's Hardening Guide - Enterprise Level
"isolation.tools.diskShrink.disable"="true";
"isolation.tools.diskWiper.disable"="true";
"isolation.tools.copy.disable"="true";
"isolation.tools.paste.disable"="true";
"isolation.tools.setGUIOptions.enable"="false";
"isolation.device.connectable.disable"="true";
"isolation.device.edit.disable"="true";
"vmci0.unrestricted"="false";
"log.keepOld"="10";
"log.rotateSize"="1000000";
"tools.setInfo.sizeLimit"="1048576";
"guest.command.enabled"="false";
"tools.guestlib.enableHostInfo"="false"
}
$vmConfigSpec = New-Object VMware.Vim.VirtualMachineConfigSpec;
Foreach ($Option in $ExtraOptions.GetEnumerator()) {
$OptionValue = New-Object VMware.Vim.optionvalue
$OptionValue.Key = $Option.Key
$OptionValue.Value = $Option.Value
$vmConfigSpec.extraconfig += $OptionValue
}

$vmview=get-vm $vmName1 | get-view
$vmview.ReconfigVM_Task($vmConfigSpec)
$vmview=get-vm $vmName2 | get-view
$vmview.ReconfigVM_Task($vmConfigSpec)
$vmview=get-vm $vmName3 | get-view
$vmview.ReconfigVM_Task($vmConfigSpec)

function Set-EagerZeroThick{
param($vcName, $vmName, $hdName)
# Find ESX host for VM
# $vcHost = Connect-VIServer -Server $vcName -Credential (Get-Credential -Credential "vCenter account")
$vmImpl = Get-VM $vmName
if($vmImpl.PowerState -ne "PoweredOff"){
Write-Host "Guest must be powered off to use this script !" -ForegroundColor red
return $false
}

$vm = $vmImpl | Get-View
$esxName = (Get-View $vm.Runtime.Host).Name
# Find datastore path
$dev = $vm.Config.Hardware.Device | where {$_.DeviceInfo.Label -eq $hdName}
if($dev.Backing.thinProvisioned){
return $false
}
$hdPath = $dev.Backing.FileName

# For Virtual Disk Manager we need to connect to the ESX server
# $esxHost = Connect-VIServer -Server $esxName -User $esxAccount -Password $esxPasswd

# Convert HD
$vDiskMgr = Get-View -Id (Get-View ServiceInstance -Server $esxHost).Content.VirtualDiskManager
$dc = Get-Datacenter -Server $esxHost | Get-View
$taskMoRef = $vDiskMgr.EagerZeroVirtualDisk_Task($hdPath, $dc.MoRef)
$task = Get-View $taskMoRef
while("running","queued" -contains $task.Info.State){
$task.UpdateViewData("Info")
}

# Disconnect-VIServer -Server $esxHost -Confirm:$false

# Connect to the vCenter
# Connect-VIServer -Server $vcName -Credential (Get-Credential -Credential "vCenter account")
if($task.Info.State -eq "success"){
return $true
}
else{
return $false
}
}

Set-EagerZeroThick $vCenter $vmName1 "Hard disk 3"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 4"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 5"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 6"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 7"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 8"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 9"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 10"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 11"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 12"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 13"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 14"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 15"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 16"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 17"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 18"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 19"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 20"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 21"
Set-EagerZeroThick $vCenter $vmName1 "Hard disk 22"

Udev>ASMlib

I'm fighting the urge to rant about this.  Although I respect Oracle's right to make their products work better and have additional functionality with their other products, I really dislike the position Oracle has on ASMlib, db_flash_cache and HCC on Sun-only storage.  Db_flash_cache and HCC are *wonderful* enhancements to database functionality...and they were released with the caveats that they only work with their respective Oracle co-products...db_flash_cache needs OEL and HCC (non-Exadata) needs Sun storage.  You've probably read prior posts on the Sun 7420 used for Exadata backups...its wonderful...people should buy it on its own merits.  OEL offers a solid product at a great price point relative to other distros...it can stand on its own too.  Still...I get it...Oracle wants to sell more Oracle.

What bothers me more is the decision to no longer release ASMLib for Redhat 6+.  Its a difficult thing for customers who have an installed base of Oracle databases on Redhat 5 and procedures to use ASMlib to switch.  For a small shop, maybe its not that big of a deal...for Oracle's big customers, it involves documentation, meetings, coordination...and a lot of ill will about being forced to change procedures and retrain resources that were trained to use ASMLib.

The written procedures to use udev with Redhat are a bit of a pain.  You identify the uuid of each disk device and create an entry in a udev rules file for it.  Again...for a small shop with a few databases...not a big deal.  I'm working on a project now to move over a hundred databases from AIX to Redhat 6 on vSphere 5 (VMWare), in RAC.  This is one of the most aggressive use cases for Oracle on VMware I've ever heard of.  To have big databases in VMWare is easy...to have busy databases in VMWare is a challenge.  Each node of each database has many disk devices.  Some of the databases are SAP, which requires multiple failgroups...which means multiple controllers with separated storage.  Its easy to make a mistake and add storage to the wrong asm diskgroup...destroying your failgroup separation.

Even with ASMLib its difficult to make sure your ASM disk...ie: VOL45, maps to to correct SCSI controller, which maps to proper storage that uses separate paths all the way to separate storage.  Without ASMlib, there's much more room for human error.  Obviously...I had to automate.

So...Oracle hands you lemons...make exa-lemonade.  I created a udev rule creation script for Redhat that does a similar task to what ASMLib always had done.

Pros to Script
  • Easy to maintain (its just bash)
  • Not kernel (or RH version) dependant, or OEL dependant
  • syncs rules across nodes (disk 14 on node 1 is the same as disk 14 on node 4 by UUID)
  • based on the SCSI controller, the name of the alias is changed...so the dba won't accidentally add a disk from one failgroup to a diskgroup of a different failgroup (which would eliminate the data protection of using multiple failgroups, and fail your SAP ASM platform certification)
Pros to ASMLib

  • Oracle maintains it (...until you update your kernel...which you do regularly for security fixes, right?)
  • Stamps disks (sector 2?) so VOL14  on node 1 is the same as VOL14 on node 4
  • There are some people that think ASMLib is more performant than Udev...but I don't think those claims come from Oracle...and I haven't been able to quantify a difference.  If a performance advantage exists, it must be slight.
To distribute the file to all the other nodes, there are 2 dependencies on Oracle's OneCommand configuration (used in Exadata, OVM, ODM, etc) for the params.ini and doall.sh script.  For params.ini, I added a parameter called SHARED_DIR, which is a directory mounted by all nodes.  If you want...you can just ftp this file to the other nodes and comment those 2 lines out. 

This is a work in progress which is expected to be modified and improved upon by the end user, and as always, use at your own risk...but I think it will likely save you some work creating your udev rules.  There is some detection of formatting of partitions...and it works for me, but you should verify the devices it recognizes as unformatted are really unformated.  Use this in a non-prod, unimportant crash and burn system first.  Hmmm...I can't think of any other warning to give.  Don't run this on any computer, ever.  To be extra, extra safe, you could comment out the last few lines that deal with moving the file around and reloading udev rules...that way you can look at the new rules file before you actually use it.

Ok...that being said, I hope this enables you to get past the lack of ASMlib on Redhat 6+, as it has definitely helped me.

#! /bin/sh
###################################
# Name: udev_rules.sh
# Date: 5/9/2012
# Purpose: This script will create all the udev rules necessary to support
# Oracle ASM for RH 5 or RH6. It will name the aliased devices
# appropriately for the different failgroups, based on the contoller
# they're assigned to.
# Revisions:
# 5/8/2012 - Created
# 5/10/2012 - Will now modify the existing rules to allow the addition of a
# single new disk. It will also sync the udev rules on node 1 with all other nodes.
###################################
source /u01/racovm/params.ini
rm /mnt/shared/udev/99-oracle-asmdevices.rules
data_disk=0
redo_disk=0
arch_disk=0
release_test=`lsb_release -r | awk 'BEGIN {FS=" "}{print $2}' | awk 'BEGIN {FS="."}{print $1}'`
echo "Detected RH release ${release_test}"

if [ -f "/etc/udev/rules.d/99-oracle-asmdevices.rules" ]; then
echo -e "Detected a pre-existing asm rules file. Analyzing...\c"
for y in {1..50}
do
found_data_disk=`cat /etc/udev/rules.d/99-oracle-asmdevices.rules|grep "asm-data-disk${y}"`
found_redo_disk=`cat /etc/udev/rules.d/99-oracle-asmdevices.rules|grep "asm-redo-disk${y}"`
found_arch_disk=`cat /etc/udev/rules.d/99-oracle-asmdevices.rules|grep "asm-arch-disk${y}"`
if [ -n "${found_data_disk}" ]; then
let "data_disk++"
fi
if [ -n "${found_redo_disk}" ]; then
let "redo_disk++"
fi
if [ -n "${found_arch_disk}" ]; then
let "arch_disk++"
fi
echo -e ".\c"
done
echo "complete."
echo "Existing rules file contains:"
echo " ASM Data Disks: ${data_disk}"
echo " ASM Redo Disks: ${redo_disk}"
echo " ASM Arch Disks: ${arch_disk}"
new_file="false"
else
echo "Detected no pre-existing asm udev rules file. Building..."
new_file="true"
fi

echo "Creating new partitions if needed."
sh install.sh &> install.log

for x in {a..z}
do
if [ -n "`ls /dev/sd*1 | grep sd${x}1 `" ] ; then
asm_test1=`file -s /dev/sd${x}1 |grep "/dev/sd${x}1: data" `
asm_test2=`file -s /dev/sd${x}1 |grep "Oracle ASM" `
if [[ -n "${asm_test1}" || -n "${asm_test2}" ]] ; then
controller=`ls /sys/block/sd${x}/device/scsi_device | awk 'BEGIN {FS=":"}{print $1}'`
# ie: scsi_device:1:0:1:0
if [ "${release_test}" = "5" ]; then
result=`/sbin/scsi_id -g -u -s /dev/sd${x}`
else
result=`/sbin/scsi_id -g -u -d /dev/sd${x}`
fi
if [ "${result}" = "" ]; then
echo "No scsi id found for /dev/sd${x}. If you're running on VMWare, verify disk.EnableUUID=true has been added under option->Advanced->General->Configuration Parameters."
exit 1
fi
if [ "${controller}" = "3" ]; then
if [ -f "/etc/udev/rules.d/99-oracle-asmdevices.rules" ]; then
found_uuid=`cat /etc/udev/rules.d/99-oracle-asmdevices.rules|grep "${result}"`
else
found_uuid=
fi
#if [[ -z "${found_uuid}" || "${new_file}" = "true" ]]; then
if [ -z "${found_uuid}" ]; then
echo "Detected a new data disk. Adding rule to /etc/udev/rules.d/99-oracle-asmdevices.rules"
let "data_disk++"
if [ "${release_test}" = "5" ]; then
echo "KERNEL==\"sd?1\", BUS==\"scsi\", PROGRAM==\"/sbin/scsi_id -g -u -s /dev/\$parent\", RESULT==\"${result}\", NAME=\"asm-data-disk${data_disk}\", OWNER=\"oracle\", GROUP=\"dba\", MODE=\"0660\"" >> /etc/udev/rules.d/99-oracle-asmdevices.rules
else
echo "KERNEL==\"sd?1\", BUS==\"scsi\", PROGRAM==\"/sbin/scsi_id -g -u -d /dev/\$parent\", RESULT==\"${result}\", NAME=\"asm-data-disk${data_disk}\", OWNER=\"oracle\", GROUP=\"dba\", MODE=\"0660\"" >> /etc/udev/rules.d/99-oracle-asmdevices.rules
fi
fi
elif [ "${controller}" = "4" ]; then
if [ -f "/etc/udev/rules.d/99-oracle-asmdevices.rules" ]; then
found_uuid=`cat /etc/udev/rules.d/99-oracle-asmdevices.rules|grep "${result}"`
else
found_uuid=
fi
if [[ -z "${found_uuid}" || "${new_file}" = "true" ]]; then
echo "Detected a new Redo disk. Adding rule to /etc/udev/rules.d/99-oracle-asmdevices.rules"
let "redo_disk++"
if [ "${release_test}" = "5" ]; then
echo "KERNEL==\"sd?1\", BUS==\"scsi\", PROGRAM==\"/sbin/scsi_id -g -u -s /dev/\$parent\", RESULT==\"${result}\", NAME=\"asm-redo-disk${redo_disk}\", OWNER=\"oracle\", GROUP=\"dba\", MODE=\"0660\"" >> /etc/udev/rules.d/99-oracle-asmdevices.rules
elif [ "${release_test}" = "6" ]; then
echo "KERNEL==\"sd?1\", BUS==\"scsi\", PROGRAM==\"/sbin/scsi_id -g -u -d /dev/\$parent\", RESULT==\"${result}\", NAME=\"asm-redo-disk${redo_disk}\", OWNER=\"oracle\", GROUP=\"dba\", MODE=\"0660\"" >> /etc/udev/rules.d/99-oracle-asmdevices.rules
fi
fi
elif [ "${controller}" = "5" ]; then
if [ -f "/etc/udev/rules.d/99-oracle-asmdevices.rules" ]; then
found_uuid=`cat /etc/udev/rules.d/99-oracle-asmdevices.rules|grep "${result}"`
else
found_uuid=
fi
if [[ -z "${found_uuid}" || "${new_file}" = "true" ]]; then
echo "Detected a new Arch disk. Adding rule to /etc/udev/rules.d/99-oracle-asmdevices.rules"
let "arch_disk++"
if [ "${release_test}" = "5" ]; then
echo "KERNEL==\"sd?1\", BUS==\"scsi\", PROGRAM==\"/sbin/scsi_id -g -u -s /dev/\$parent\", RESULT==\"${result}\", NAME=\"asm-arch-disk${arch_disk}\", OWNER=\"oracle\", GROUP=\"dba\", MODE=\"0660\"" >> /etc/udev/rules.d/99-oracle-asmdevices.rules
elif [ "${release_test}" = "6" ]; then
echo "KERNEL==\"sd?1\", BUS==\"scsi\", PROGRAM==\"/sbin/scsi_id -g -u -d /dev/\$parent\", RESULT==\"${result}\", NAME=\"asm-arch-disk${arch_disk}\", OWNER=\"oracle\", GROUP=\"dba\", MODE=\"0660\"" >> /etc/udev/rules.d/99-oracle-asmdevices.rules
fi
fi
fi
else
echo "/dev/sd${x}1 is not an asm disk."
fi
fi
done
#cat /etc/udev/rules.d/99-oracle-asmdevices.rules
echo "Syncing rules file for all nodes of this cluster..."
cd ${SHARED_DIR}/udev
cp /etc/udev/rules.d/99-oracle-asmdevices.rules .
/u01/racovm/doall.sh -p cp ${SHARED_DIR}/udev/99-oracle-asmdevices.rules /etc/udev/rules.d/99-oracle-asmdevices.rules
echo "Reloading rules for all disks on all nodes in this cluster..."
if [ "${release_test}" = "5" ]; then
/u01/racovm/doall.sh /sbin/udevcontrol reload_rules &> /dev/null
else
/u01/racovm/doall.sh /sbin/udevadm control --reload-rules &> /dev/null
fi
/u01/racovm/doall.sh /sbin/start_udev &> /dev/null
/u01/racovm/doall.sh /sbin/partprobe &> /dev/null
echo "Complete."
echo "To see the ASM UDEV rules: cat /etc/udev/rules.d/99-oracle-asmdevices.rules"


When complete, the rules file looks something like this (on RH5):

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c3800682301d41f40ce5129d796f", NAME="asm-data-disk1", OWNER="oracle", GROUP="dba", MODE="0660"


KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38a77610df366b5ce4045e0f438", NAME="asm-data-disk2", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38d94459b437a48b0d75784d0bf", NAME="asm-data-disk3", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c3803929d52a392a506b75b8fc2d", NAME="asm-data-disk4", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c383a1ab40918dbc2e7a5f8gfb9d", NAME="asm-data-disk5", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38840c4740546cb2d9874152b98", NAME="asm-redo-disk1", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38523a828e0bd08637f79862c5a", NAME="asm-redo-disk2", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38904e50311ce41bd1f8db03ca1", NAME="asm-redo-disk3", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38d4b7e30a0102afb9934bge9f2", NAME="asm-redo-disk4", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38bd8e8a59464126630ff37b5da", NAME="asm-redo-disk5", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38e50ded425980005bb5f685e14", NAME="asm-arch-disk1", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c380a88d61b860gbfab66e4ba2ec", NAME="asm-arch-disk2", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38e2dceadba28f7bb8144egd67a", NAME="asm-arch-disk3", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c383ae7f7d05b2bc2ded9724c69e", NAME="asm-arch-disk4", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c3866e4127140d48467c09f1363a", NAME="asm-arch-disk5", OWNER="oracle", GROUP="dba", MODE="0660"

KERNEL=="sd?1", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -s /block/$parent", RESULT=="42000c38bca0e4c305e9236c7d62c553e", NAME="asm-arch-disk6", OWNER="oracle", GROUP="dba", MODE="0660"

High-Performance Oracle RAC on VMWare

Its been a while since I gave the blog some attention. Sorry to those who have commented and I haven't replied to. I'll try to catch up ASAP. I've been extremely busy with a mass migration of Oracle databases from some IBM P5 595's to VMWare ESX 5, running on beautiful Cisco M3's series (Ivy Bridge) blades and Redhat 6. My client had already moved a few hundred smaller databases to this platform...but to move the high IO, multi-terabyte databases to RAC on VMWare is a completely different challenge. This kind of migration to blades without VMWare isn't easy...you have to deal with different hardware, a different OS and different endian (the etymology of that word is hilarious, by the way). Oracle's statement of direction announcement (see 1089399.1) of not supporting ASMLib after Redhat 5, adds one more complication.

After the architecture has been defined and ESX 5 installed (see vBlock and vFlex references...both are excellent selections and I've read on The Register they may be offering FusionIO as part of their packages) the next step is to create the VM's.

This is an unpleasant point-and-click process with any new VM...and with even a 3 node RAC, its too easy to make a mistake. In general, the process is to create the VM's with the appropriate vCPU and RAM you need (which should be a post in itself to calculate that.) Then create the storage on one of the vm's (call it node 1), then...on each of the other nodes create disks and point back to node 1's vmdk's. Rinse and repeat for each node and each shared disk. If you're following the SAP best practices for ASM, you'll need to do that for a minimum of 3 ASM diskgroups, each with their own set of vmdk's and data stores. When that's complete, go back and update the other settings per VMWare's best practices for Oracle running on RAC, then make the settings needed per VMWare's hardening guide. To make these changes, there's a button (under properties-options-Advanced-General called "Configuration Parameters") where you can add a line, and add these parameters and their values in a free-form table. If you make a typo, there's no checking...it just won't work. Per SAP best practices, don't forget to reserve 100% of RAM.

...all that to say, it is a time-consuming pain and its difficult to do without making a mistake. Early on in the project, I decided I would try to find a script from somebody out there who has done this and use their script to create the VM's...but I couldn't find any. In fact, although there were lots of people who were using PowerCLI and good community support, I don't think I found a single reference when PowerCLI was being used to create RAC nodes. So...I came up with a script to do it. There are 2 sections in it I borrowed from the PowerCLI community...the section that updated the configuration parameters and the section that zero's out the vmdk's.

The obvious (but not performant) way to create a RAC database in VMWare (as I said above) is to create the vmdk's and then point all the other node's vmdk's to their counterpart on the first node. After that you have to set the SCSI controller to allow for "physical SCSI Bus Sharing." This works...but this is the generic method of sharing storage across VMware nodes. VMware implements a locking mechanism to protect your data while accessing the same storage from multiple machines. If you have an application that isn't cluster-aware, this is necessary. If you have an application that IS cluster aware (like Oracle RAC) this is redundant. So...for better IO performance in RAC, set up each shared disk with the multi-writer parameter (see below.) For that to work, the disks must be eager zeroed. Zeroing out the vmdk's is a best practice you'll see for people who have high-io VM's (like databases.) In vSphere 5, that's called "thick eager zeroed", and its necessary for multi-writer locking.

There are a couple of key things to keep in mind when working on VMWare with RAC:

1.  Eager zero data storage, as stated above.

2. Sometimes, more vCPU's is slower. How VMWare's cpu scheduling works is that (in an effort to simulate a physical environment) all virtual cores have to be free on physical cores in order to get a time slice. For example, let's say you have a dual-quad blade with hyperthreading turned on (per vSphere 5 best practice), which gives you 16 virtual CPU's. You have other VM's that are busy and they're using up 10 cores at this moment. You created a VM that can use 8 vCPU's and now you need a single time slice. Your single time slice has to wait until all 8 vCPU's are free before it can proceed. Also, even though you just need 1 time slice, it makes 8 physical cores busy. Not only does assigning too many cores slow your VM down, your VM slows down all other VM's on the host. You can have an ESX host CPU bottleneck, even though the sum total of the cpu used inside the VM's is barely anything. This means the DBA needs read access to the host performance statistics, a fact that VMWare admins aren't happy about. Blue Medora is answering that call with an OEM 12c plugin for VMWare.

3. In VMWare ram is shared, but only when its possible to do so. There are many great memory efficiency features in vSphere 5...but if you follow Oracle best practices, they won't help you very much. Huge Pages is used internally in vSphere 5, but if you use huge pages in the VM (for the SGA), it will initially work very well...but as soon as your caches warm up, the SGA becomes distinct from all other memory used on your host. At that point, you get no advantages...so I've found its better to reserve at least the size of the SGA you'll be using. SAP's best practice for RAC on VMWare is to do 100% memory reservation for Prod...and there are other performance enhancing reasons to do that. Besides removing the performance overhead of the memory saving features, it allows some other vSphere 5 features that improve network latency (such as VMWare's Direct Path I/O to reduce network latency, not to be confused with Oracle's definition of Direct Path IO). This can have a huge impact for a busy interconnect in RAC.

4. Many of the VMWare HA features are redundant when you're running RAC. In RAC, if you have a node fail, your processes should fail over to the surviving nodes, and apps keep running. If you're running Cisco UCS, your blade's profile will go to a spare blade, and soon (15 min or so) the failed blade is back in action. VMware HA would restart that VM on a different machine that's still running, and soon (15 min or so) your failed node is back in action, assuming you left sufficient free RAM on the other blades to make that possible. Very smart people disagree about which HA method is best..and I think there are good arguments to be made on all sides. My opinion is that you should provide HA via RAC, because its instant and its more efficient. If you depend on VMWare HA with RAC, you have to keep X% free resources available in reserve on all blades...just in case.  For example, if you have 2 nodes, you'll need to limit your total vm ram allocation to 50% (maybe a bit less due to memory tricks VMWare employs...but its still a large % of your total ram.)

If you depend on RAC for HA, you can use all resources as long as you allocate enough RAM for the additional processes you'd need in case there's a node failure.  Surviving nodes would have to absorb the additional connections that used to be on the failed node...) This allows for much better efficiency, but it means the surviving nodes need to be capable of supporting the additional load from the failed node.

5. One last thing to keep in mind - if you try to start a VM that puts you over 20TB on the host, you get a non-descriptive "out of memory" error, which then references one of your vmdk's and your VM will fail to start.  When I first saw this I thought...what does "out of memory"-an error associated with ram- have to do with a vmdk?  The answer lies in the VMWare internals...for performance reasons, the storage is reference in a reserved heap space in ram...similar to the memory used to track AU's in Oracle's ASM.  By default, there's 80MB set aside for that which is sufficient for 20TB of vmdk storage.  After that, you get the "out of memory" error and the vmdk that pushed you over the limit is referenced.  That’s why a RAM issue references a vmdk.  The solution is to increase the heap size to its max, which is 256MB and allows up to 60TB of vmdk storage per ESX host.  After that, you need to reconsider using VMDK's.  In my project, we were going to pass this limit...so for that reason (and others) we implemented Oracle Advanced Compression as part of the migration.  Suddenly, the databases that were over 60TB total became much less.  We're averaging a little over 3X compression with our data. 

With all the changes I've mentioned...hardware platform, os, endian, asmlib (or the lack of), vmdk's, advanced compression...and we're moving some non-RAC databases to RAC, some non-ASM databases into ASM and implementing index compression...how can we guarantee performance on the new platform with so many variables?  We used Oracle's Real Application Testing and tested extensively.  I'll talk more about that later.

In my next post, I'll show you the script I passed off to the VMWare team to create the VM's.  At first they were a bit hesitant that a database guy was giving them a script to make VM's, but after they realized how many hours of their lives they were going to get back (and that it removed human error), they were more than happy to not only use it for the migration, but to modify it and use it for all other VM builds....