Friday, August 19, 2011

Page Missing?

A close friend of mine recently blogged about an upcoming feature of an Oracle product...let's call it OEM.  So, in his zeal to discuss the latest, greatest and the new technology in OEM, when a product manager sent him an email and asked about how he knew about the new features, my friend removed his post.  Was that a wise move or did it display a lack of courage?  Who's to say...maybe both?   The real downside is that there was an interesting conversation between him and an Oak Table member I was enjoying. The lessons we should learn from this are, "Secret information isn't always declared so."  The other lesson is..."He who has the most lawyers wins arguments before they begin."

Monday, August 15, 2011

Equi-sized Partition Cuts

There are a lot of resources to show you how to make a partitioned table...syntax search results are plentiful...but its unusual to find a resource that tells you about the process to partition a table.

Some rows are wider than others, and a lot of times your partition key doesn't have gaps...so this may not apply to you.  When partitioning my approach has been to:

1. Identify the access patterns
    You can do this by talking to the application development teams and selecting from hist_sql_plan (see below).  That will give you a good idea about some of the sql that's hitting the table in question.  Is it direct-io reads or conventional dml?  Are they simple queries, where response time is likely important, or are they more complex, taking several minutes to complete?  Look at the operation column.

select * from dba_hist_sql_plan where sql_id in (select sql_id from DBA_HIST_SQL_PLAN where object_name='TABLE_NAME') order by sql_id, id;

2. From #1, what columns do you often see in the where clauses?  This/these columns may be good partition key column candidates.

3. From #1, what's the best compression type you can use without negatively affecting application performance?  (long runtime queries could likely have more aggressive compression on their tables, short queries will have to be C4OLTP or no compression)

4.  If possible, use interval partitioning instead of range partitioning.  There are a lot of things that will keep you from using interval partitioning  (like domain indexes), but it'll save you time in management if you can use it.


5.  There's a balance between adding lots of partitioning for performance reasons and having too many partitions to be easily managed.  As a rule of thumb, I shoot for ~20, depending on table size, but never let a partition be smaller than 2% of the db_cache size (see earlier post re:smallness logic).  This is hard with ASMM/AMM, since the db_cache size is dynamic...but even if you use ASMM and AMM you should set minimum sizes in the memory pools.  Its a good practice to make the partitions as equally sized as possible...if that means equal number of rows or equally size segments, that's for you to decide...it depends on your situation.  Also consider the partitioned tables commonly joined with this table...if your queries pull back many rows, partition-wise joins will make you want to use the same partitioning column if possible.


We're partitioning many, many tables and we need a process to do this as quickly as possible.  Some of the tables are multi-terrabye, and it was taking me a really long time to come up with the values for the ranges.  So, I changed my method and just created a query to do it.  There might be a better way...but this is performing very well, even on multi-billion row tables this returns in around a minute.  If you have a better method, I'd love to hear about it.

The idea here is that there are repeated values and/or gaps in the sequence column you want to make your partitioning key...so you can't just take the sequence number.  The first partition has to be the row that's 5% in...regardless of what the value of the partitioned key is.  This query will give you the value of the partitoning column key that's about 1/20th of the rows in the table...the 2nd will give you the one that's around 2/20th's...etc.  Change the table_name to be your table, and seq_id to be your partition key.


select part, max(seq_id), max(rownumber) from (
select trunc(rownumber,2) part, tm2.* from (
select /*+ PARALLEL 96 */ seq_id, CUME_DIST() OVER (PARTITION BY 'X' ORDER BY seq_id) AS RowNumber
from table_name tn
) tm2
where trunc(rownumber,2) in (0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1))
group by part
order by 1;

...when its done, it'll give you 3 columns.  The first one is your "goal" (.6 is the one ~60% into the row count), the second is the partition key value at that point, the 3rd is how close to your goal you came (which depends on the data cardinality...how unique the column values are.)  Use the 2nd column as the value for your partitioned table range. IE:

0.05    20      0.06
0.1      36      0.11
0.15    52      0.16
0.2      67      0.21
0.25    82      0.26
0.3      95      0.31
0.35    124    0.36
0.4      151    0.41
0.45    169    0.46
0.5      186    0.51
0.55    200    0.56
0.6      217    0.61
0.65    235    0.66
0.7      253    0.71
0.75    296    0.76
0.8      311    0.81
0.85    347    0.86
0.9      381    0.91
0.95    415    0.96
1         445    1

PARTITION BY RANGE (SEQ_ID)

  PARTITION P_1 VALUES LESS THAN (20),
  PARTITION P_2 VALUES LESS THAN (36),
  PARTITION P_3 VALUES LESS THAN (52),
  PARTITION P_4 VALUES LESS THAN (67),
  PARTITION P_5 VALUES LESS THAN (82),
  PARTITION P_6 VALUES LESS THAN (95),
  PARTITION P_7 VALUES LESS THAN (124),
  PARTITION P_8 VALUES LESS THAN (151),
  PARTITION P_9 VALUES LESS THAN (169),
  PARTITION P_10 VALUES LESS THAN (186),
  PARTITION P_11 VALUES LESS THAN (200),
  PARTITION P_12 VALUES LESS THAN (217),
  PARTITION P_13 VALUES LESS THAN (235),
  PARTITION P_14 VALUES LESS THAN (253),
  PARTITION P_15 VALUES LESS THAN (296),
  PARTITION P_16 VALUES LESS THAN (311),
  PARTITION P_17 VALUES LESS THAN (347),
  PARTITION P_18 VALUES LESS THAN (381),
  PARTITION P_19 VALUES LESS THAN (415),
  PARTITION P_20 VALUES LESS THAN (445),
  PARTITION P_MAX VALUES LESS THAN (MAXVALUE)
)

I hope this helps you, or at least saves you some time determining your table partition key and the best range of values for that key.

Partitoning while migrating to Exadata-P2 (Exadata is inexpensive)

Preliminary testing results are very promising.  After a few days of activity the accelerated performance due to C4OLTP is working, and the majority of the data is seeing excellent compression (up to 55X+) for HCC c4archive-high. 

One of the tables is ~2TB in size and contains a clob column.  If you read the documentation, it tells you that HCC doesn't work on cLOBS.  What they SHOULD say is it doesn't work on out-of-line clobs.  If your clobs are small, they're stored in the table's segment and HCC does work on them.  I was able to take this 2048GB table and compress it down to around 76GB.  That's 3.7% of its original size, for a compression ratio of about 27X.

This is one of the concepts that's difficult for people considering buying Exadata to grasp...it costs a lot, but due to its compression abilities, it may be the least expensive thing out there for very large databases.  Its hard to compare Exadata to a standard 11.2 database system because Hybrid Columnar Compression makes it an apples-oranges comparison.

Consider the cost of high-performance storage on an EMC or Hitachi array.  Everybody has their own TCO for high performance storage...but lets say EMC gives you a good deal and it costs $15/GB.  The storage savings on this table alone saved almost $30k.  Multiply that out by the 28TB in a high performance machine and it saves $11.3 million.  A different way to look at it is...in a world where you can compress everything with HCC-QH and get these compression results...the actual capacity of a 28TB high capacity Exadata machine is 756TB!

Now that Oracle is selling storage cells individually, Exadata is truly expandable...there really isn't a capacity limit until you can saturate IB and create a bottleneck...but since that's been optimized (only sending required blocks, columns and sometimes result sets to the RAC nodes), its difficult to image how much storage you can have before that's an issue.

Partitoning while migrating to Exadata (ILM and Compression)-P1

In our frenzy to move data from around 22 databases into a single database in Exadata, at the same time we're doing some modifications to the larger tables- hundreds of them are 2GB+ in size.  As you've read in the previous posts, there's concerns that the growth rate of these tables may exceed the storage capacity before the original 3 yr projections.  To postpone this, we're trying to partition these tables for ILM (information lifecycle management) and performance during the migration.

The idea is that usually, the newest data is the most active and the older data is kept for archival purposes.  The new data is usually manipulated with DML and the old data is usually used only for queries.

With Exadata's 8 table compression options, we have a lot of ways to shrink the data and make it fit in our storage capacity as long as possible.  These all have trade-offs between compression density and CPU overhead.  HCC compress for archive high is extremely good at compressing data...but your access is very slow and it uses lots of CPU...compress for OLTP is relatively bad at compression, but its extremely fast.  Since the CPU for compression/decompression is often offloaded to the cores on the storage cells...the only real concern is how well does it compress, and how will it affect query and DML performance?

Experimentation and testing is in progress now, but our initial results show compress for OLTP actually improves performance on Exadata with our workload!  C4OLTP usually gives between 2X and 4X compression.

My thought is, since DML on an HCC compressed table without direct IO moves that row into a C4OLTP block, and since there's no way to delineate the "old, Read-Only" data from the "new, often-modified" data in the tables, we compress some of the largest multi-TB tables with "HCC compress for query-high".  The idea is that we take an initial performance hit, but after a while, internally all the old stable data will be stored with HCC QH, and the new volatile data will be C4OLTP.  Its the best of both worlds...the volatile data will be faster than no compression due to C4OLTP and the older data will be extremely densely compressed.  This should work because our access patterns show we rarely do bulk insert/update/deletes.

Eventually, (I'm thinking annually) we'll have to move the partitions to re-compress the data to HCC QH again and let the activity sort out the data between the 2 compression methods.  As testing continues, I'll let you know how its going.


Friday, August 12, 2011

An Overlooked Database Performance Pitfall - Part 2 (smallness logic)


To answer the question, "How can I make my large transaction faster in Oracle?"  We first have to know when direct path reads are prevented and triggered...when will the help, and when will they hamper performance?  Here are some situations that prevent them:

    * If the table uses advanced security features, like fine-grained access control

    * Queue tables don't use direct path reads

    * Tables that have BFILE or opaque or have an object type containing opaque

    * Tables that have encrypted columns

    * A table with LONG or LONG RAW column, where that column isn't the last one in the table

    * I've heard direct path reads are avoided when X% of the table's block are already in the db_cache.  This may be due to smallness logic...if the table is 5% the size of your db cache, and 4% of it is already in the db cache, the read would only be 1% of the db cache.  I haven't verified that, but that's my theory.  This means its possible that one day you get direct path reads, the next day you don't.

    * When "smallness logic" (_small_table_threshold) comes into play (Metalink-ID 787373.1):

"If the number of blocks to be read is lower than or equal to the setting of the parameter Oracle will load the object via the buffer cache as this is more efficient than doing a direct read operation. However, the parameter does not need to be set explicitly as there is also a dependency function calculating the cut over threshold if the parameter is unset. This calculation is roughly 2% of the buffer cache size to be used as the cut over threshold. This means any object (table) smaller than 2% of the buffer cache will be read via the buffer cache and not using direct load."

Table smallness logic has a lot of ramifications.  It doesn't just apply to tables, it applies to table partitions too, measuring segment size vs cache size. 

Here are 2 situations I've seen where table smallness logic had an impact:


In an effort to simplify the decision path of the query, a DBA I know went on a mission to partition everything he could.  He should have listened to Einstein who said, "Make everything as simple as possible, but not simpler." The DBA almost created a new type of index.  Partitioning a table with hundreds of partitions on a medium-sized table (to get partition pruning) to the point where some partitions only hold a few rows...and this has been done on an Exadata system!  He concluded that it was faster after he was able to do a sample workload much faster than before.  Besides the obvious management pain, this makes the partitioned segments so small that you can't trigger direct-path reads, which then prevents the use of storage indexes, and all the other Exadata goodies.  For performance reasons, if a query hits that table and filters on the partition key, that's likely not  a big deal...its small so a conventional, non-direct path read will be faster anyway.  For large queries, over partitioning removes the performance advantage of direct path reads.  I think Einstein would say, "He made it too simple."  He prevented the use of direct path reads, which meant his reads were no longer blocking his writes...so the end result of his tests were faster because there weren't waits...but he had no idea why it was faster.

I had the opportunity to work with Scott Gossett briefly for a consulting stint in Saint Louis.  He's an author, creator of ADDM...also teaches the Exadata classes to Oracle consultants and one the most talented "explainers" I've ever met.  There's a disconnect often between "techys" and "management" that's difficult to overcome.  Scott has the ability to almost put the management in a trance with his style of explaining things.  (Visualize a cobra in front of a guru playing a flute.)  Anyway, he sized the nodes of a full OLTP Exadata machine with 16GB of SGA and 32GB of PGA (out of 96GB of ram.)  The reason is, even if you have spare RAM, if you create the sga too big, you increase the db_cache size.  This means fewer tables will be less than 2% of the cache size, and that means you reduce the percent of activity that'll be direct path.  Its counter-intuitive, but less cache means it will run faster (depending on your workload and table layout, of course.)

Should you design your database to use direct path more often than conventional path?  It depends on your data and your workload.  For a data warehouse, its likely that direct path will almost always be the correct answer...but even in a data warehouse system, there's usually some non-DSS activity.  Be aware of direct path IO...it can make things much faster (single, large queries) or it can be a pitfall and make things much slower (multiple transactions attempting to read and modify the same segment).

An Overlooked Database Performance Pitfall (Direct IO) - Part 1

At this point, if you've heard of Oracle (other than from watching The Matrix or reading "The History of Ancient Greece") then you've likely heard of Exadata.  If you've heard of Exadata, whatever your source...reading about it, hearing about it, whatever, there was a mention of how fast it is.  If this was a power point slide instead of a post, I'd show a race car, because designing your database is a little like speeding in your car...people just don't always know when they're hitting the accelerator and when they're breaking.

In my previous post, Click Here,  I talked about how Exadata's accelerator-direct path reads, allows for the use of many Exadata performance features.  Normally, the IO path for Oracle will move the data blocks from storage to shared pool, and then work with it from there.  There's a lot of overhead related to shared memory latches.  For small amounts of data movement, there's no better way to do it.  For large data transfers (large is relative to your segment) there's a better way.  Direct-IO allows you to circumvent the overhead of shared buffer caching, and move the data directly from the storage (array/spindle, etc) to the your process memory (PGA).

"But what about dirty blocks?  They're in memory, but they haven't been written to disk.  If I'm reading directly from disk to my process, wouldn't I miss them?"  Yes, you would...that's why at the start of any direct-io read there's a segment checkpoint that flushes dirty blocks to disk, followed by a shared lock to prevent the segment contents from changing during a direct-io read.  So, writing (especially direct IO writes...they can't be dirty) will have to wait for the read to complete.  Readers blocking writers...where have I heard that before? 

This is why there isn't one correct method of disk access in Oracle...it depends on what you're doing.  If you're moving large amounts of data direct-io moves it faster but you incur segment checkpoint waits before you can begin.  If you're only moving a few blocks they copy slower with conventional IO because of the latch waits of shared memory, but they don't incur the overhead of the segment checkpoint at the beginning.  The optimizer looks at how much you're copying and attempts to do the best path, based on the system settings and table statistics.

In previous posts I gave examples on how under certain OLTP conditions, direct path reads can also work like a break in your db-racecar, due to waits on segment checkpoints.  As a DBA/Data Architect, how do you design your database to take advantage of the benefits?  The optimizer will choose direct path reads when it thinks they're optimal, so you really need to know, what are the pitfalls that prevent them?

I'll list a few in my next post.