Monday, March 21, 2011

Why isn't the optimizer doing the right thing!?!?

The purpose of the Oracle Cost Based Optimizer is to quantify the different possible ways to execute a query and compare those options in order to figure out what the best way is to run your query. For even a small query, the options are vast...its a very difficult job and IMHO, the CBO does a great job when its supplied with enough accurate information about the problem.  That being said, there are times when its decisions are perplexing, even to people who feel they understand the CBO.
When you go to the optometrist and take a vision test, you run through multiple questions..."Which is clearer, left or right?"  After all these test your vision's clarity can be quantified to represent how clearly you see.  Even if your vision is perfect at "20/20", there are people who can see even more clearly than that, so vision is always relative.  This is how I think most technical knowledge is.  If somebody asks you, “Do you understand the CBO?”, you may think the answer is yes, and you may grasp the concepts very well...but *very* few people have 20/20 clarity on the topic. In vision like in complex topics like the CBO, understanding is relative.
To gain better vision into the CBO, the definitive reference is Jonathan Lewis. If you get a chance to hear him speak or read his books, do it. Wolfgang Breitling is another of my heroes.  I had the opportunity to meet him at Hotsos a few years back...he's brilliant, and like Jonathan Lewis, he approaches being a DBA from the perspective of a mathematician.
There have been many blogs and books by brilliant authors (like Jonathan Lewis, see link to the right for his CBO book) to answer the question, “Why isn't Oracle choosing the right plan?”.  I just want to point you to another source.  After spending years reading several books and tons of blogs...I gained new levels of clarity on the topic when I read Wolfgang Breitling's white paper:

A Look Under the Hood of CBO The 10053 Event

The copy I got a few years ago was very old at that time...but the vast majority of it-if not the formulas, than the concepts-still apply today.  In modern Oracle RDBMS versions with system stats, the rough time estimate cost in the CBO has switched from counting I/O's, to the amount of time needed to read a single block, so its a little more tightly coupled with the concept of time.

The idea behind the CBO is to quantify and compare multiple potential plans...so if its quantified in time or IO, it doesn't really matter...the result should be the same. Keep in mind, it does make it difficult to compare plan costs between older DBMS versions though.

When I get a chance, I'm planning to revisit the concepts of his 10053 analysis a bit on the Exadata platform to see what's changed...I'll post anything interesting I see.

Thursday, March 17, 2011

Who has dba privs?

A database auditor asked me for some quick help answering the question, "Who has DBA privs?"  Somebody from the client's dba team did a simple:
select * from dba_role_privs where granted_role='DBA'
That's ok...but some people were still demonstrating DBA abilities who didn't show up on that list.  If somebody is granted a role that's been granted dba...or a role that's been granted a role that's been granted dba (etc) this statement would miss them.  This can go on infinitely deep, and it can be a way to hide dba privs from auditors if its buried deeper than the person searching for the priv is willing to check.
This is a hierarchical query problem.  Traditionally, hierarchical queries are used to answer questions like, "Who is under the CIO?"  Direct reports are easy...but a person who reports to a person who reports to the CIO is more difficult...the more layers, the more difficult it is.
The first time I had to do this was on a Sql Server ERP I was supporting...I had to write a query that went 6 levels deep...it was complex and ended up being about 2 pages long.  To do on Sql Server you have to find the root node and then all the leaf nodes for all the levels of the tree. Quite the pain. About a year later we migrated that ERP to Oracle...the new statement in Oracle that was logically equivalent was just a few lines.  This is because Oracle's SQL extensions, "CONNECT BY PRIOR" and "START WITH" make this relatively easy.
The request from the auditor was to find all the people who have sysdba, sysoper or dba privs, no matter how they got them.  (note: The statement below will display people twice if they're granted sysdba/sysoper AND a deep dba role...the auditor was ok with that)
select username, 1 level_deep from V$PWFILE_USERS
union
select grantee, max(level_deep) from (
select distinct level level_deep, grantee, granted_role
from dba_role_privs
start with granted_role='DBA'
connect by prior grantee=granted_role
) where grantee in (select username from dba_users)
group by grantee
order by 1;
With variations of this statement, he was able to find how some people who weren't directly granted DBA were able to use DBA privs.  He even found one guy who had a dba role that was granted 8 levels deep...(ie: he was granted a role that was granted a role (repeat 7 times) that was granted DBA!)  That would never have shown up on older audit reports.
Hopefully this will help some of you tighten up your db security and aid in your hierarchical problems.

Thursday, March 10, 2011

Exadata Backup and Recovery-1

In theory, there's no difference between theory and practice, but in practice, there is.  When theoretical maximums are used, life occurs, and the reality sets in.  Seldom do companies even have an agreed upon RTO, and even more seldom are they proven out until a disaster occurs.  An RTO that isn't proven is called a guess...and even an educated guess based on published theoretical maximums is a guess of what the best case scenario could be.  My current client has quarterly requirements to meet the RTO, and if the recovery plan can't hack it, adjustments (even if that means additional hardware purchases) must be made.  Designing a new backup infrastructure and recovery plan for this Exadata environment needs to not only be precise, it needs to be proven.

In my previous posts on virtualized storage (specificly the Hitachi USP-V), I talked about doing Oracle database backups via snaps on the virtualized storage.  They make life much easier for the DBA, with interfaces directly into RMAN.  Unfortunately, until Oracle buys or partners with a storage vendor (I'm hoping for Netapp) to bring that technology to Exadata, we're limited to traditional backup methods.

For the last several months, one of my focuses has been on meeting the recovery window set out by the business.  At first the requirement was...worst case scenario, we have to do a full recovery, everything needs to be running in 4 hours after a disaster.  This is an OLTP VLDB that generates about 1300GB of archivelogs/day.  Our RTO is RECOVERY time objective, not restore time objective...so archivelog apply on Exadata is relavent to my issue.

Archivelog apply rates have primarily 2 big variables...the hardware speed and the type of DML.  A database with a DSS workload will apply much faster than an OLTP database on the same hardware.  Other factors affecting our apply rate are flashback database and Golden Gate, which requires supplemental logging.  1300GB archivelog generation rate is less than what its going to become...how much more is anyone's guess.  Our on-site Oracle consultants are guessing 2X, but I think that's an over-estimate to play it safe.

To understand the size of the data that needs to be restored, we have more variables.  The uncompressed size of this database is expected to be many petabytes (projections are in flux...but more than one, less than 5).  We've been sold a single full rack exadata machine (in an 80% data, 20% fra configuration), with a contingency for more storage cells as needed for the next 3 years.  After 3 years the growth rate accelerates and we'll have more Exadata machines to add to the mix then.  Depending heavily on HCC and OLTP compression and based on growth projections, we're going to say in yr 1 we only have 20TB after index drops and compression.

This database is going to be a consolidation of 22 smaller databases.  The plan is to have a full backup once a week and cumulative incrementals the other 6 days.  Based on the current activity in those databases,  We're going to say our incrementals will be at worst 6TB.

This brings up an Exadata rule-of-thumb...block change tracking (bct) hugely improves incremental backup speeds.  Basicly, as a change to a block is made, a set of blocks is marked "changed" in a file.  The next time you do an incremental, only the block sets in the BCT file are sent to backup. Storage Cells also have a similar functionality...in Exadata, if you aren't using BCT, the storage cells will find the blocks that have changed and return only changed sets of blocks.  Neither method actually marks the individual blocks as changed...the blocks are grouped together and if one block is changed, several blocks are marked changed.  Storage cells mark the changed blocks in smaller sets...so they're more efficient than BCT.  According to "Expert Oracle Exadata" (link to the right) by Kerry Osborne, Randy Johnson and Tanel P├Áder, (I'm not sure which author said this, I have the alpha version of the eBook...which anybody with interest in Exadata should buy), using BCT is still faster if less than 20% of your blocks have changed...after that, its faster to offload the functionality to the storage cell.  So, based on our 6TB/20TB per week changes...it would likely be best for us to do BCT early in the week and use the storage cells' incremental offloads later in the week.  We'll have to test it.

So, worst case scenario, we need to restore a full that's 20TB+6TB of incrementals and then apply ~1.3TB of archivelogs in 4 hours.  To complicate things, we can't do backups to the FRA because our database is too big.  Using the 80/20 configuration, we'll only have ~9TB (20% of 45.5TB in the High Performance machines).  We can always add more tape drives and media servers to make the process faster, my biggest concern is the redo apply rate, since its mostly a serialized process...its not scalable.

We're buying a ZFS storage server (Sun 7420) with connections via Infiniband for "to-disk" backups and a SL3000 tape library with 2 media servers and an admin server.  We'll be utilizing a new 11g HA OEM setup on its own hardware to monitor the backup solution (and all the other Exadata goodies.)  Although its nice to have 96TB waiting around to be used, its going to go fast as a backup destination.  Using this as a backup destination limits our options...it takes away the best option...merged incremental backups.

Merged incremental backups allow you to take a backup once (and only once...ever), and then always apply incrementals to that backup to keep it up to date.  Its the fastest way to do backups, and its Oracle recommended backup methodology.  More importantly when the time comes to do a restore, for example a restore of a datafile, instead of restoring the file from tape or disk, you go into rman and say "switch file 5 to copy;"  Whatever archivelogs are needed to recover that datafile are applied and without even physically moving the backed up file, your database is back to 100%.  The backed up file is now the active file that your database is using.  While its up then, you can go through a procedure to fix the file in the original destination...but your uptime is maximized.  Obviously, when using Exadata, the features you depend on that normally come from the storage cells only work on things that are stored in the storage cells...so although "switching to copy" works, it isn't a practical option for Exadata when you aren't using the FRA. 

Originally, the recovery window was 4 hrs, so Oracle presented us a solution with 2 heads for a ZFS server with many spindles, 4 media servers and 28 LTO5's (sadly, the T10kc's won't be out in time for our purchase.)    After the business recovered from the sticker shock, the recovery window was increased to a more resonable 8 hrs.  Oracle then presented us a scaled down (cheaper) solution with 8 LTO5's and a single-headed 7420 with 48 spindles.

The ZFS server will have 2 trays of 24 2TB disks...which I'm told will be the bottleneck.  The single head on the 7420 can do 1.12GB/sec...so unless the disks are *really* slow (around 50MB/s), or the ZFS overhead is extremely high...I think the head will be the bottleneck...but we'll see in testing.

In this design there were a lot of theoretical maximums used.  We've insisted the design be tested and benchmarked before we purchase it, so Oracle has been good enough to set up a similar set of hardware in their lab in Colorado to prove out the numbers we'll see for our recoveries.  If Exadata backups interest you...I highly recommend you read the new white paper that was published in the last few weeks on the topic at http://www.oracle.com/technetwork/database/features/availability/maa-tech-wp-sundbm-backup-11202-183503.pdf.

Part 1:  

Part 2:  

Wednesday, March 9, 2011

Quick plug for a great book

Its not out yet, but if you're interested in Exadata, check out Expert Oracle Exadata.  You can download the alpha book NOW and eventually when its finished you'll get access to the full ebook.

http://apress.com/book/view/1430233923

I just signed up for the alpha program and d/l'd it...I can't wait to read it...yet I'm ashamed that I would be this excited over a book that's so geeky.  I wonder if it comes in the original Klingon print?