Monday, February 21, 2011

Exadata-index dropping and compression

As I said in the previous posts...there are a lot of variables in play as we move to Exadata...the two I'm most concerned about is the effect of overhead due to OLTP compression and the effect of dropping many of our indexes.

I've never read anything from Oracle that says, "If you have Exadata you don't need indexes." I've heard it implied though, in the context of storage reduction. Data Cells are licensed by the spindle, so if you need more storage, it gets expensive. When you bring this up, the sales guys will talk about how you'll get great compression rates with HCC (which is true) and you'll reduce space because you can drop indexes. To no fault of Oracle's, some people extrapolate that idea and think, "I can drop all indexes...or most indexes."

Before we look at dropping indexes, let's start out with a baseline. I've installed the SOE schema for Swingbench...I'm hitting the scan listener of a High Capacity half rack of Exadata using 500 sessions. Some swingbech settings I've modified:

Number of Users: 500
Min Delay between transactions: 0
Max Delay between transactions: 1
Logon Delay: 10ms

Almost all defaults are being used for the Exadata machine. There would be huge gains in performance if this was tweaked...but for our purposes, this setup is fine.

When I first ran this test, the best I could do was about 310K, which is blistering fast...but then I learned we had a hardware failure that was preventing the use of one of our storage cells. The failed equipment was replaced and I re-ran the test.

I was able to get just over 400K transactions per minute (in red above)...that's very fast. For a frame of reference, our legacy system test was on an LPAR (IBM's VM) with 16 CPU's on fast 15k disks using an EMC Clarion CX4...definetaly no slouch, but it was 5 years old, and it was only able to produce ~63K TPM.

OCS' plan is to compress all tables (that aren't being replicated by Golden Gate) with either OLTP or some form of HCC compression and to drop all indexes that aren't either PK/UK indexes or indexes supporting FK's. The patch for Golden Gate that allows it to mine compressed tables still hasn't come out, so that's why those tables are being skipped. We may use Stream on some OLTP-compressed tables, since the version does have the ability to mine archivelogs of compressed tables. Let's try some OLTP compression on Exadata. I altered the tables to be "compress for OLTP."

select 'alter table SOE.'||table_name||' compress for OLTP;'||chr(13)||'alter table SOE.'||table_name||' move;' from dba_tables where owner='SOE' and partitioned='NO' union all select 'alter table SOE.'||table_name||' modify partition '||partition_name||' compress for OLTP;'||chr(13)||'alter table SOE.'||table_name||' move partition '||partition_name||';' from dba_tab_partitions where table_owner='SOE';

...and then I rebuilt all the indexes.

The new baseline test results are nearly 450K trans/min! This is the fastest Swingbench score I've ever seen, and this is only on a half rack of relatively slow 7,200 RPM high capacity SAS storage! The other 3 high performance machines just arrived today...I can't wait to see what they can do! Granted, I expect there was serious caching in the db_cache...but this still says a lot about the compute nodes, since decompression can be done on either the compute or storage nodes and all compression is done on the compute nodes.

Next, (as is planned by our Oracle architects), I mark "unusable" all the indexes that aren't PK/UK's or supporting FK's. I'm leaving the function-based index usable though.

I restart Swingbench and begin the "index reduced" run...this is what I see:

All 96 threads on all servers are pegged at 100% user (in green above)...almost no CPU is used by system or IO waits. My response time is through the roof reaching 1266ms (red line above), and Transactions/Min has plummeted to less than 20K.

I ran ADDM/AWR reports and checked out the high waits during the "reduced index" test. What's happening is that, without indexes on the columns used in where clauses, we're forcing full table scans, but only 2% of the time is being spent actually doing the FTS.

This is exactly why I wanted to run these test. All over the internet you can find results of tests where somebody executes a single query to encounter a storage index with timing...things act differently when there's contention in an OLTP environment.

The high wait was "enq: KO - fast object checkpoint". One of the features of 11g is that the optimizer can choose to use direct path reads even if you don't specify the append hint or make the transaction parallel...this is called a serial direct read. Direct path reads are necessary to use storage indexes and HCC compression...and basicly to take advantage of all the special sauce Exadata offers. To begin the direct path read, a segment checkpoint is needed. The idea is that in a direct path read, we're going to pull the data straight from disk (or flash card) into memory in the PGA, which is very avoid all the SGA overhead. If there are dirty blocks in the SGA, we'd get incorrect the dirty blocks are flushed to disk before we start our direct read.

As Tanel Poder put it, "You see this event because of object level checkpoint which happens when you run direct path reads on a segment (like with serial direct read full table scan or parallel execution full segment scans).

Now the question is-how much of your session's response time is spent waiting for that event - and whether the winning in query speed due to direct read/parallel execution outweighs the time spent waiting for this checkpoint."

In a DW/DSS environment, this feature will accelerate performance. With fewer process to contend with, the chance of creating a long "fast object checkpoint" wait is reduced, and your IO transfer is faster, but we're implementing Exadata in an OLTP environment.

This isn't an Exadata-specific issue...its just the way Oracle works on a hot segment with direct path reads in 11g+. By dropping the indexes we're forcing the optimizer to do DW-style IO in an OLTP environment. We're handcuffing the Oracle optimizer and taking away its options. The answer for this as we migrate will be to replace the indexes that were removed...and basicly only drop indexes that aren't used very often.

The take-aways from this are:

1. Don't believe you can just "drop all indexes" or that Exadata doesn't need indexes...your workload, not the hardware dictates this. Exadata will work better than anything else out there if you don't have indexes and you're forced to do a FTS, but for an OLTP environment, indexes aren't optional.

2. OLTP compression on Exadata is extremely fast. OLTP compression performance has always been a tradeoff between the time saved in IO vs the extra time used consuming CPU. With the fast Nahalem EP's in the X2-2, compression actually improves performance.

Exadata feature performance overhead

We're getting ready to migrate 88+ large non-RAC OLTP databases from an IBM P5 595 into a single RAC database running on 3 High Performance Exadata machines and 1 High Capacity machine. We're having Oracle Consulting Services do the actual architectural design and migration work and I'm here to keep them on their toes and make sure the recommendations work from a holistic perspective on the overall system. There are a lot of changes that will be made during this consolidating migration...even one variable can have a negative effect that could impact the entire system. Before I get into the variables I see as having the largest impact (and before I test them) let's talk about the hardware a bit.

Exadata is much more than a "database-in-a-box." Its a set of compute nodes (think RAC node servers) combined with ultra-fast infiniband (...and 10GB ethernet, and multiple 1GB ethernet) and storage nodes. Storage nodes are basicly "helper" servers with local storage. The database can send requests to the storage node (that it sees like an ASM disk), the node will be able to handle the request and just send the results (offloading the workload) or it'll handle it like a traditional ASM disk would handle it...returning the blocks back to the compute node. New concepts in storage nodes include storage indexes (dynamicly created indexes stored on flash storage that may/may not be used to retrieve your data), compression/encryption offloading (cpu in data cells decrypt and uncompress the data, if that's in your best interest. Sometimes Oracle will choose to send the compressed blocks to the compute node instead.) When doing incremental backups, the storage cells will only send the changed blocks to the compute nodes to accelerate performance.  There's a misconception out there that RMAN backups can be completely offloaded to your data cells...this (mostly) isn't true...what Oracle means by that is that the CPU interrupts when accessing the spindles are handled by the data cell...all the rest of the work is handled by your compute node.  From a db that's on a server with local storage, this is a big improvement...for a db that uses a SAN...its not really an improvement.

Storage indexes are a unique feature that allows us to drop indexes and perform full table scans very fast. Exadata will create storage indexes dynamicly, based on the first query to hit the tables in question. The first time you access the table, you perform a full table scan. The storage cells will then build a storage index on your table. The second time you query that table, the storage cell will have the option to be able to access the table via the storage index. There are a lot of great blogs re:storage indexes out there, so I won't rehash the topic again, except to say, in theory, you can do a full table scan and have your results returned almost as fast as if you had done an index range scan...and depending on the amount of data you're returning and how its layed out (clustering factor), maybe even faster.

Other great Exadata features include multiple types and levels of compression (HCC/OLTP) that are handled in some situations on the data cells themselves. The marketing guys at Oracle will say you can take your database "as-is" and drop it into Exadata. Some people will take it to the extreme and claim you don't need indexes at all...which is a misunderstanding of the technology. Oracle's Advanced Consulting Services is planning to drop all indexes except those supporting FK's and PK's and compress almost everything else via OLTP or HCC compression. They'll then compare performance with the legacy system (via RAT) and put indexes back where they're needed. What will be the effect on CPU/IO...and ultimately performance in an OLTP environment?

Again, lots of people on lots of blogs have discussed the features. What I haven't seen blogged about is the overhead of these activities. Yes, dropping indexes and having quick restores may be possible, what what happens on a macro level to the system? Poder and Claussen have talked about storage indexes and their effect on a query...and for a DSS system that's all that matters...but in an OTLP environment you'll have many statement happening simultaneously by hundreds of users. What's the overhead of storage indexes then? What's the overhead of OLTP compression on Exadata in an OLTP scenario?

Let's see if we can get a baseline of what the planned implementation of these features will do. I'll install Swingbench with a 10GB database and run a series of tests on a half rack (4 compute node, 7 data cell) high capacity Exadata machine. Swingbench is an excellent graphical tool that provides performance information and simulates OLTP-type load. Please don't think these tests are something you can deturmine Exadata performance against...there are a lot of limiting factors outside of Exadata in play here (like the network speed between swingbench and Exadata, etc...) measure these tests against each other and the variables I introduce.