Friday, December 10, 2010

GoldenGate Tips and Tricks

GoldenGate was bought out by Oracle last year. Around that time, I was asked to take a look to see how it could be used as a Materialized View (aka Snapshot) replacement. As we took a closer look, we found more and more uses for this great product and the project sprawled. The more its around, the more uses we find for it. Its worth every penny...but like everything else...don't just install it...research, test and find its limitations before you depend on it.

Some of the details are possibly out of date on this post...I've put off writing about GoldenGate for reasons stated below. Forgive me if this seems harsh...I hold the highest respect for the people who work at Oracle and the work they do to bring excellent software to the marketplace. I've made my career off of them, I know many of them and...besides the occasional bug, I've never had a genuine negative experience with them. My complaints below relate to the now merged company Golden Gate, and really, only with Director (licensed by the Golden Gate mgmt pack)...Golden Gate (the product) itself is excellent. I've spoken with employees at Oracle who agree with my opinions, I even had to cover for an Oracle consultant who was threatened with being fired for telling the truth about Director.
Before I get into the details of why GG is great, let me tell you why it isn't (and then I'll tell you how to get past the problems.) The difficult part of it is monitoring and administration. When Oracle purchased GoldenGate, along with it came a product called Director. Director is a Web Logic based application that promises to allow people to make changes to GoldenGate configuration files and monitor their application. This is necessary unless you're ok with your app dba's and developers getting Unix-level access to your database servers. GoldenGate is wonderful, but Director is probably the biggest software disappointment I've ever seen. On one day I launched 13 SR's with bugs to Metalink...not little things...each of them deal breakers. Some of the issues I found:
1. Every few days Director crashes...you have to restart the WL instance to bring it back up.
2. The security layers (different users can do different things) they have no longer worked in the version we were testing...and I was told they hadn't worked in several previous releases. This means every Director user had full authority to do anything to any configuration on any database, and start and stop replication.
3. If more than 2 people try to use the web app at a time, their GUI configurations become in conflict (due to the shared db connection Director uses to its repository) and the first person in has all their icons moved on top of each other to the top left corner of the screen. We have some complex configurations with hundreds of icons...so this means you have to drag the icons back out to where they were one at a time...taking sometimes 20 minutes...and this happens often.
4. There's no logging of who does what...so I can go into your configuration and make changes and nobody will know. Its also very difficult to trace back causality of bugs and issues.
5. [UPDATE 10/6/2011-There is a huge security hole that has been around for many versions.  I brought it up with an SR and I know it was brought to the attention of the GG product manager, but they chose to not correct the issue.  At Openworld 2011 I talked to Mr Screvens, who seemed genuinely concerned about the issue.  I think its finally going to be addressed, but in the meantime, there's an undocumented work around you need to use to protect yourself from the issue.  I'll blog about that asap.]
6. Of the *many* SR's I created in total, very few of them were corrected because Oracle was coming out with a product to replace it called Monitor. I was told at the time it was due out in the fall, now I'm hearing early spring, although the beta is available now. My impression was that Oracle was picking and choosing what they were willing to fix, since they knew it was going away soon. I haven't had a chance to test out Monitor yet, so I can't say if these issues have been corrected...I hope they have been. GoldenGate is a truly great and useful product...it just needs some attention to details that it wasn’t getting under the watch of GoldenGate, the company. Oracle, in my opinion, is usually very good at meeting user needs and correcting security issues, so I hope they'll pay attention to these issues.
The reason I waited almost a year after notifying Oracle of these problems is because I consider publicizing security issues to be irresponsible. I notified Oracle and I put off writing about these issues to give them time to correct them...after a major release (11) and several minor releases, they've hopefully corrected them. If they haven't, people need to know how to protect themselves from vulnerabilities while setting up GoldenGate.
...enough about the bad. How can you get past these issues, what are the tips and tricks? I'll cover them in my next post.


Thursday, December 9, 2010

Virtualized Storage (USP-V) for Oracle Databases-4. thin provisioning

Let's say you need to build a database that's expected to be 7 TB within 2 years. When you create the database server, you tell the storage team or administrator you need 500GB times 20 luns, for example (10TB). You know you need less than that, but its better to be safe than come back for more space later. In a nutshell, thin provisioning is when you're presented with the luns you asked for, 10TB in our example...but physically, you only allocate storage as its needed from the storage. If you only use 2TB the first year...although your server thinks it has 10TB...you're physically only allocating 2TB leaving 8TB for allocation by other servers.

The concept of thin provisioning is motivated by the fact that every server in an enterprise has unused, but allocated space every time a lun is carved out. On local storage, this is just something you live with, but on expensive san storage, this adds up to a huge amount of space...an independent company recently did an audit and claimed upwards of 40% of our storage is allocated, but unused. This is millions of dollars of underutilized storage.

As a database guy, my instinct is to say there's no such thing as wasted storage...the empty space means I'm occupying more spindles giving me the potential for more database speed...but the fact is that empty space is unmanaged. We always need free space for growth, but by the time you sum up empty space in blocks, tablespaces and storage...you're talking about a huge amount of space. It would be better to occupy that space with a managed ILM approach...sharing rarely accessed data with highly accessed data, or packing it in on storage with enough speed to handle it. I don't love the concept of sharing storage with other systems, but if there's enough cache that its transparent to me...no harm, no foul.


There are a lot of details (as it relates to Oracle databases) around storage virtualization that haven't been documented, so I wanted to dive a little deeper into it...specifically around thin-provisioned ASM. In the previous post we looked at thin-provisioned overhead on performance...so that aside, how, and when, is space allocated at the storage layer? How do you become un-thin? What can we do to prevent this and when it happens, what do you do to correct this situation?

We know storage is allocated on the USP-V as its requested in 42MB pages. Once its allocated, it stays allocated...so if you create a 10GB file and then remove it...you're still allocating 10GB. From my tests on NTFS...an OS block is created, then deleted, then created in the next free space for it. This is great for a hard drive because it spreads out the wear. Most SSD's do this internally anyway...but for thin provisioning this is terrible. NTFS with any activity won't remain thin. Oracle ASM is the opposite...it uses a space allocation algorithm that dictates that as storage is written (and allocated), then deleted (but still allocated on storage), then written again, you'll be writing to the exact space that was already allocated in whatever increments you set for your AU size. The problem can still exist that you allocate a lot of space, then remove it (like for archivelogs)...in which case you're no longer thin. The cure for this is the ASRU utility developed by Oracle and 3PAR. Its a pearl script that will shrink all the ASM disks to their minimum size, write (extremely efficiently) 0's to the remaining free space in the diskgroup, then return the ASM disks to their previous size. When this is complete, the storage administrator can click the Zero Page Reclaim button (why isn't this scriptable, HDS?!) and reclaim as free all the space that's free to the storage pool. I had a meeting at Oracle Openworld with reps from Netapp about why this process is necessary...from the ASM instance we can recognize all the AU's that are occupied and their physical addresses...so we should then know all the space that's not being used. In addition, there's an algorithm for AU allocation in Oracle...not only do we know from the ASM instance what AU's are currently being used, from the algorith we can deturmine were the next 100 AU's will be placed. The ASRU utility and the administration associated with it shouldn't be necessary. The first storage vendor to recognize that and implement it wins the database storage market...it'll save companies millions of dollars on new storage.

That being said, its all we have today, so we tested it. My test diskgroup had 350GB...I had 100% allocated on storage but only 20% utilized by the database. The ASRU utility finished in 24 minutes, ZPR took 67 minutes and reclaimed 80% of the storage. To locate the ASRU utility was painful...all the links to it were dead that I found on the internet. I opened an SR and was given a link to an internal Oracle website (which obviously didn't work for me.) Eventually Oracle Support was able to supply it to me. If you need it, let me know.

We created a 10GB datafile to see how much space was allocated. Physically on storage, there are occasional very small writes, even in empty non-sparse tablespaces. The USP-V allocates at a minimum 42MB...so in the end our 10GB empty tablespace physically allocated 7770MB. Lesson learned here...create smaller datafiles with autoextend on.

Most people recognize that a rebalance happens when you add a lun to an ASM diskgroup...to me the exact process of how it moves the AU's to the new lun is unclear...it isn't a direct read and copy...there's a lot more activity than that. My test diskgroup was about 50% full of data and I added an equal amount of space to it. What you end up with is a diskgroup that's now 25% full (according to ASM), but 100% allocated at the storage level. You can correct this with ASRU, but that's a manual process (due to the gui-only mgmt interface HDS provides.) The work around to this is to grow the luns in the USP-V and then resize the ASM disks to their new larger size. This process is officially supported by Hitachi on NTFS only. We tried it with AIX, but after the luns grew (from the USP-V's perspective), the lun sizes never changed at the OS level, so there was nothing ASM could do. Our P5 is using VIO servers which act as a middle man between the luns presented by the USP-V and our LPAR...we suspect that might have been our issue, but due to time constraints we were unable to verify this. This is a problem for thin provisioning on AIX, it makes thin provisioning very difficult for databases using ASM...at least, keeping them thin after a rebalance. We wanted to test this on Suse Linux 11 running on a Cisco UCS, but we discovered SUSE isn't bootable from the USP-V...although a patch is expected within the month.

I remember the first time I explained the concept of thin provisioning to the database tech lead at a client site. Her reaction was a mix of confusion and horror...followed by, "What happens if we try to allocate space and its not really there?" ...that was our first test.

What happens when you call for space that the OS thinks is available, but it isn't due to thin-provisioning? I set up an 11.2.0.2 GI ASM diskgroup using a 10GB lun in a 5GB storage pool. I started adding data until eventually I was trying to allocate storage that didn't exist.

1. I was told by the storage team that at least 4 emails were sent to them to warn them of the situation.
2. The database froze, and I called the storage team and had them add some space to my storage pool...which they did, and the USP-V seemed content without an issue...unfortunately, the db didn't get the memo that the crisis had passed...it was still frozen.
3. I logged in to unix and I could do things like cd and ls...but anything else would give me "IO ERROR". I was working in an OS mount that wasn't in the same storage pool I had tested on...so this surprised me. Eventually the db gave the error "ORA-01114: IO error writing block to file 4..."
5. I decided to bounce the database...but I couldn't log in and create a new session to issue a shutdown command.
6. I killed PMON at the OS level...and it died...but all the other processes were still up!
7. We then bounced the LPAR...first a "soft" boot (which didn't work), then a "hard" boot. When the server came back...several luns were missing...and we later determined, completely corrupted.

Luckily I had scheduled this test right after the "merged incremental backup" test in the previous post...because I was forced to do a complete restore/recovery of the database. After more analysis, I was told the scci driver on the frame had locked up, which affected all the storage on our testing frame. Lesson learned...never, ever let your storage pool run out of storage.

Conclusion: Thin provisioning is in its infancy...only recently has the linux foundation implemented SCCI trim extentions to allow the OS to notify the storage server of the deletes...to my knowledge no distro's have implemented this yet, although I discussed it in a meeting with Oracle VP Wim Coekaerts to be added to their new kernel. Potentially ASM and databases could work very well with it but there's a trade-off between additional maintenance to keep things thin vs storage savings. If your database isn't growing, this might be an option for you...keeping in mind the overhead seen from the performance tests in the previous posts. Its possible that a different OS or even AIX not using a VIO server would have been more successful in these tests. This isn't the USP-V's fault since it works with NTFS...but our big databases are in *nix...so for practical purposes, it didn't work. The Storage Guy has some interesting points on this topic...read about them here.

In this series:
HDS USP-V Overview for Oracle DB
HDS USP-V Performance for Oracle DB
HDS USP-V Features for Oracle DB
HDS USP-V Thin Provisioning for Oracle DB

Wednesday, December 8, 2010

Virtualized Storage (USP-V) for Oracle Databases-3. backup/clone

Now that the performance tests are out of the way (see previous post), we can take a look at the virtualization features found in the USP-V as they relate to databases. Incidentally, I'm told the USP-V and the new VSP are virtually (pun) identical in their usage. There are a few changes and differences, such as page level auto tiering is only found in the upcoming micro-code update of the VSP, and the VSP uses more energy efficient, more dense 2.5" drives.

Two features we need from the USP-V are faster backups and faster database refreshes.

Today, clones are accomplished through an RMAN duplicate procedure I put together. Its still limited by the 10GB network (which I used to think was a lot). The requirement is to refresh Prod to Test-take 22 databases, (about 30TB) and refresh them all, keeping them in sync with each other because they have interdependencies. Although you can do an RMAN duplicate to a point in time, there's no RMAN ACTIVE duplicate to a point in time...and restoring (or RMAN duplicating) 22 databases at once would be a huge strain on backup resources. What I came up with is an RMAN active duplicate to 22 standby databases that are each using flashback database. After the RMAN "active duplicate for standby", eventually all the 22 standby databases are current and shipping logs. You stop them, flash them back to the same moment in time, and start them up. Viola...RMAN active duplicate of an entire prod environment to a point in time. :) I should blog about the details around that someday....

These databases are growing quickly (growing by factors over the next few years)...we won't be able to get this done inside our maintenance window next year. Storage virtualization to the rescue....

I ran ~30 tests, I'll just give you the highlights and lessons learned here. There's a feature in NetBackup that will interface between the storage array and RMAN. When RMAN runs a "backup proxy" command, NetBackup passes the details to the storage array, and the storage array does the work. The net effect of this feature is that multi-petabyte databases can have a backup taken in a few seconds by creating a set of pointers to the original storage and tracking deltas after that. An Oracle DBA, who doesn't have access to the Hitachi command interface, can initiate that backup from the familiar settings of Oracle's RMAN. When it comes to restores, the process is basically reversed and your MTTR is reduced to the time it takes to apply archivelogs since your last backup. There's also a feature in the USP-V that allows the storage-level equivalent to the block change tracking feature dba's commonly use in Oracle...only in the USP-V, the smallest denominator is a 42MB page instead of an 8k db block. Since the deltas are tracked after the first backup, only a small percentage of your data (the data that's changed since the last backup) needs to be backed up. The 3rd option is to keep 2 lun sets in sync.

As multiple databases at this client's site are growing to multiple petabytes, this feature holds great promise. I wanted to compare the differences between the current 2tape backups to two alternatives...Oracle's recommended backup strategy which they call merged incremental backups and Hitachi's methods described above.

The database I tested with is around 12TB. Its backed up to 4 LTO-4 drives. I established a baseline by reviewing backup times and performing a baseline restore/recovery. This is a very busy OLTP database, and the backup times for it vary widely around 14hrs...I suspect that at times the drives are attempting to write faster than the database can feed them. Eventually the cache on the tape drives runs dry and they have to stop, rewind and begin writing again where they left off. Its ironic that backing up a busy database that's barely able to keep its head above water is sometimes faster with slower tape drives than it is with fast ones. Anyway, the restore baseline finished in 8.5 hrs, the recovery took 33.25 hrs. The full backup was taken about a week prior to my restore point, so many, many GB of archivelogs needed to be applied. After speaking with the application manager, the cost of 41.75 hrs of downtime would cost the business more than the purchase price of the Hitachi...so this feature alone could justify its purchase, all other things being equal.

The first lesson learned came when I tried to add 16, 1TB luns to my ASM diskgroup. Although the published limitation is 2GB/lun for ASM, the error, ORA-15099, reported that I was adding a lun larger than ASM could handle. Doing a:

./kfod di=all

...from the grid home, I was able to see that Oracle was reporting the luns to be 4TB in size, not 1TB. I verified the correct size with AIX's bootinfo, then I created an SR. Oracle identified it as bug 10072750 and they're creating a patch for it. Hopefully it'll be ready before you encounter this issue.

The work-around is to specify the disk sizes (less than 1TB) when you create the diskgroup and add the disks...so now I have 16, 1023MB disks in 2 diskgroups, data and fra.

There were complications that prevented the Backup team from applying the NetBackup feature that allows proxy copies to the media server due to a prior issue. It would have been easier for me, but for my purposes, I just need to be able to backup a database using virtualization features. So with a little coordination with the storage team, we were able to manually interface with the storage array and get this to work. Essentially the procedure is:

1. Establish a consistency group (this will mirror 2 sets of luns)
2. Place the database in backup mode
3. Take a snap (this takes a few seconds)
4. Take the database out of backup mode

Putting the db in backup mode would increase this database's archivelog generation...currently reaching 60GB/hr. Since we manually did this we were admittedly a bit clumsier and less efficient than we could have been. I'm told by HDS experts this process, when done in RMAN, places the database in backup mode for no more than a few seconds no matter how big your database is.

Since we were manually doing this we had some options. The USP-V can do "Shadow Images" (which is a storage-layer mirror), or snaps (which are copy on write, tracking deltas). We need to also do a cloning test so we went the Shadowcopy route. To have a point-in-time capability, we would have to do snaps instead.

For the restore, we had the mirror lun set presented to our 2nd lpar on the frame. The storage team split the consistency set in a few seconds and they were done. Once I set up the init.ora/oratab settings we had restored the database to a new LPAR for the restore and since its on a different LPAR, I'm calling it a refresh/clone too. The MTTR is now zero, because there's no recovery to speak of (the database was in a current, crash-consistent state.)

Lastly, I did a test of the RMAN merged incremental backup. The basic procedure of a MIB is:
1. You turn on block change tracking
2. Do an image backup of your database (probably to cheap sata storage)
3. Schedule a "merge"...a recovery of the copies of the datafiles you just made.

With block change tracking...this is a very inexpensive process...normally taking a few minutes on TB-sized, busy databases (depending on activity patterns, of course). In the case of a corrupted datafile, you just have to "switch" (an RMAN command) to the copy, and RMAN will recover that file to be in sync with the rest of the database and start using it. While that's happening, you can restore and recover the original, then switch back. In the event that the entire database is lost, you can "switch database to copy" and the same process takes place.

I've always been a fan of this backup method, and if it weren't for the Hitachi's features, this would still be the way to go. The restore time was eliminated...but they don't say Mean Time To Restore...its Mean Time To Recovery. With block change tracking turned on, the merge took only a few minutes. When it was time for the database switch, since I recommended the merges be scheduled hourly, only 1/2 hr (avg) of archivelogs would have to be applied...compared to the 7 days of archivelogs from the baseline, that's fast to apply. From simulated crash to open, the restore/recovery of the 12TB database took a little under 5 minutes. Like the USP-V features, this option for backup scales...but it also has the advantage of being free. :)

Conclusion:
The USP-V was able to complete a restore and a clone instantly...this solution is scalable up to many, many PB...so its up to the task of keeping us in the maintenance window indefinitely, and it can do it with the simple RMAN interface a DBA likes to use without giving them storage array access. A close 2nd option is a merged incremental backup strategy...but this wouldn't help us with clones/refreshes.

During step 1, the mirror set is in read-only mode. This gave me an idea on how to cost avoid some licensing for the Golden Gate implementation...I'll have to test that out a different day.

In this series:
HDS USP-V Overview for Oracle DB
HDS USP-V Performance for Oracle DB
HDS USP-V Features for Oracle DB
HDS USP-V Thin Provisioning for Oracle DB

Tuesday, December 7, 2010

Virtualized Storage (USP-V) for Oracle Databases-2. Performance

Based on white papers re:Hitachi USP-V best practices from Oracle, I expect to see some response time increases but the performance increases due to cache and the way the USP-V stripes the data across the storage should make the response time issue nearly undetectable. I ran 14 performance tests on multiple tiers of storage - SATA, Fibre Channel (in Raid 5 and 10 configurations) and SSD, each tier was tested using Orion (simple, matrix and dss modes) and Swingbench running on 11Gr2. From the database perspective, the most important item in the results is the Max Transactions per Minute (Max TPM) reported from Swingbench running on an actual 11.2.0.2 Oracle database using ASM. Why ASM? Raw storage means no file system caching, so the results I get are truely from the storage.

I started the first test and right away I noticed the storage wasn’t my bottleneck so I increased the virtual CPU’s in my LPAR to 8 and after test 2, I lowered the db_cache to a tiny 256MB. At this point, when I ran test 3, nearly all the IO was physical IO.




Test Number



Storage




Spindle Type



Raid Level




Virtualized



Cache(GB)



IOPS




MBPS



Max TPM




Max TPS



Avg TPS



Avg Resp




Disk Busy %



CPU%



1



Clariion FC




FC



5



No




4



1403




117.46



69413



1294




1083



65




30



95




2



Clariion FC



FC




5



No




4



1403



117.46




102911



2182




1450



51



24




87



3




Clariion FC



FC




5



No



4




1403



117.46




73011



1368



1087




66



100




84











The results of test 3 will be my baseline for future tests.

To see the virtualization overhead and how effective the USP-V cache is, I ran Swingbench on a database with Clariion storage in a USP-V cache “pass-through” configuration:



Test Number



Storage




Spindle Type



Raid Level




Virtualized



Cache(GB)



IOPS




MBPS



Max TPM




Max TPS



Avg TPS



Avg Resp




Disk Busy %



CPU%



4



Clariion




FC



5



Yes




pass-thru



N/A




N/A



62895



1190




984



78




42



71.4






The next test results show the Clariion behind the USP-V’s massive 512GB cache, all other things equal to the test above. The Clariion has 4GB of cache itself-this is Hitachi cache feeding the Clariion cache, feeding 10K FC spindles.



Test Number



Storage




Spindle Type



Raid Level




Virtualized



Cache(GB)



IOPS




MBPS



Max TPM




Max TPS



Avg TPS



Avg Resp




Disk Busy %



CPU%



5



Clariion




FC



5



Yes




512



1207




120.28



72666



1319




1087



65




94



85






As you can see above, the Max TPM increased and the avg response time improved. The next test is similar to the previous, but this time using a Hitachi AMS unit loaded with 7.5k sata disks. This was more of a functional test of the AMS unit...obviously you can't compare its performance to the EMC Clariion since the Clariion test above had 10k FC spindles. Ignoring that, I wanted to see a "worst-case" scenario of using sata disks for the databases.



Test Number



Storage




Spindle Type



Raid Level




Virtualized



Cache(GB)



IOPS




MBPS



Max TPM




Max TPS



Avg TPS



Avg Resp




Disk Busy %



CPU%



6



AMS




SATA



5



Yes




32



428




53.96



31853



963




476



149




100



85






The next set of tests were done on storage internal to the USP-V, virtualized and thin-provisioned with a 4GB clipper (cache partition) for the first 3 and the full 512GB clipper for the last 2:




Test Number



Storage




Spindle Type



Raid Level




Virtualized



Cache(GB)



IOPS




MBPS



Max TPM




Max TPS



Avg TPS



Avg Resp




Disk Busy %



CPU%



7



Internal




FC



5



Yes




4



1396




133.12



64168



1155




954



75




35



69.1



8



Internal




FC



10



Yes




4



2077




179.77



66451



1185




979



73




45



70.6



9



Internal




SSD



10



Yes




4



5122




214.70



64044



1171




966



75




43



69.9



10



Internal




FC



10



Yes




512



9486




308.92



66277



1166




956



74




45



71.3



11



Internal




SSD



5



Yes




512



8060




252.61



58844



110




935



81




43



72.8








In this series:


HDS USP-V Overview for Oracle DB

HDS USP-V Performance for Oracle DB
HDS USP-V Features for Oracle DB
HDS USP-V Thin Provisioning for Oracle DB