Monday, June 13, 2011

Extreme Goldengate performance on Exadata

I had a birthday recently, and now I'm going to show my age.  I remember once thinking, "What application could possibly demand the performance of an AT class computer?"  I was in awe of how fast it was...how programs I had on my 12 Mhz XT ran so fast on my 20Mhz AT (with a turbo button) some of them were unusable.  It wasn't long before that awe came and went.  I think I may still have that machine buried somewhere in my basement.  New technology and performance is always relative to the future, and its a fight today's very best technology can never win because no matter how unlikely it may seem today, there's always room for improvement.

As you may have read in my previous posts, I'm working on a project to move 22 databases (~30TB) times 4 environments from 3 IBM P5 595 frames to 4 Exadata frames.  On the P5's, we used materialized views extensively to keep the inter-related data in sync within the multiple databases.  As the databases became more and more busy, the mv logs grew and with their increase in size, the refresh performance dropped.  We made multiple changes to have them refresh more frequently, but eventually we could see we were reaching the edge of the abilities of the technology on our hardware.  At some point we began to have issues with performance where MV's weren't going to cut it any longer.

Something often forgotten or overlooked in the use of MV's is that there's huge overhead caused by MV logs.  You don't just increase the overhead by refreshing the MV's...you increase the overhead with every DML statement on a table that has a MV log on it.  Before I looked at it and quantified it, I assumed this was trivial, but its not.  Jonathan Lewis did extensive research on that, showing that a single dml statement on a table with a MV log causes multiple DML statements in the background..and the overhead from "on commit" MV's makes them almost unusable in a high DML environment.

We felt like we reached the edge of the MV replication technology on our hardware...the next step forward in db replication was change data capture (CDC) technology.  The concept with CDC is that every change made in a database (at least on the tables you care to run CDC on) is logged in redo and archived redo logs anyway for backup purposes.  So, instead of adding overhead with triggers or MV's, you can just get the inserts, updates and deletes that happened from the logs you're already creating.  All the changes are there...you just need to parse them and apply them to the remote database.  The first step is to instantiate the remote table (copy as of an SCN via db link) then start CDC after the SCN you copied the data from.  The next time you do an insert on the table on the source side, the change is put into a redo log.  The CDC capture (or extract) process will see that and move that same insert to the table in the remote db.  They're kept in sync in near real-time.  (There's some lag in parse overhead, network speed, etc...)

I wasn't involved in the first pass at CDC here.  About a year previously Oracle had sent some people to implement Streams...one of them a top notch Oak Table member, but they couldn't get it stable enough to depend on.  This was an early implementation of "down" Streams in 10g (Asynchronous Autolog mode).  We wanted to have the load of mining be on the target, not the source, which is a little atypical...maybe that's why we had so many issues?  With the talent they sent us I can safely say it was the product, not the implementation that failed.  Bug after bug was filed, after months and after huge expense the project was abandoned.  Incidentally, I'm told 11g Streams (which is at end of life with no future features after 11.2.0.2) is much more stable today.


So...a year later the issue resurfaced and became a new "hot item."  This time, I got a shot at it.  I performed some proof of concept performance tests of multiple products working with vendors such as Informatica and Oracle Streams and at the time, an independent company called Golden Gate (which has since be bought out by Oracle).  There was a very bad taste still in the mouth from the last time Streams was attempted...so there was no way that was going to be the chosen direction.  Still, it was useful to include it in testing for comparison purposes.

When I perform these tests I usually have a list of qualities for the product prepared before testing begins (we call it a "Score Card").  After all, if a product is lightning fast, but isn't stable enough to rely on, it can't be a consideration.  If its perfect in every way, but priced prohibitively, its unattainable.  There's a balance of qualities that makes a "best" choice.  We found Informatica was very robust and relatively easy to use.  The bottleneck we found with Informatica was on the capture side.  Informatica and Streams (and Shareplex) used Oracle Logminer for their CDC which was too slow to meet our requirements.  We had planned to take a look at Shareplex, but decided not to when we saw it mined logs with the same method as Informatica and Streams.

 After spending a few weeks with experts from Informatica, we finally had to say that it wasn't possible to meet the performance criteria we had set out, although they were close.

To be fair, our requirements are very difficult (and for a while, I thought unattainable), and I'm sure all the CDC technologies that use logminer would be more than sufficient for 99.99% of the requirements out there.

Before I go on...let me refer you to my previous post/rant of Golden Gate Director.  Its the worst software product I've ever implemented, and my original motivation for blogging.  It was so bad, I felt like I had to warn the world to avoid it.  We just installed the latest, greatest version of Director...I really hope its been improved.  The rest of this post I'm talking about GG CDC only....

In the CDC world, Golden Gate is different.  I was told that the guy that originally designed Oracle Logminer did so more for auditing purposes and it wasn't originally designed for performance.  Later, he was hired by Golden Gate and designed a new, proprietary miner...designed from the ground up for performance.  Take that with a grain of salt...I heard this from GG sales guys.

My first concern was - using a 3rd party miner to mine Oracle's proprietary redo/archive logs would create a dependency in our system on a relationship between 2 competing companies.  What happens to us if that relationship breaks down?  Right around the time we went live with Goldengate, Oracle bought them- so that concern was removed.  Thanks Larry. :)

As testing began on the P5's, GG was immediately much faster with our workload than logminer, and after a few days it was easy enough to use.  In our non-RAC system we were able to mine around 55GB/hr of archivelogs, at which point the single CPU core it used was pegged.  There was really no way to parallelize it to use more CPU.  A GG consultant that moved to Oracle University taught me a trick to improve non-RAC GG performance when the CPU is the bottleneck (thanks Hitomi)...set it up like its a 1 node RAC.  (ie, when setting up the extract, say "threads 1").  This will spawn 2 processes off, one to purely parse the log, one to hand off the work.  This improved performance around 10% in our environment, to around 60GB/hr.

With our Exadata system, we'll be making around 1,300GB/day of archivelogs...so 60GB/hr isn't going to cut it.  When I say 1,300GB/day, there are peeks and vallys of generation, and the app needs to stay as near real-time as possible.  So...although 60GB/hr is fast enough on average...through out the day, we'd have times of lag that would be unacceptable to the performance of the applications.

As I mentioned, the CPU usage on the capture side was the bottleneck on the P5.  Exadata's Nahalem EP not only has faster cores than the 5 yr old P595 cores, we're now in RAC, so we get to use more cores.  When you set up the extract in GG for RAC, you have to specify the number of threads...each thread equating to the archivelog sequence from a specific node.  In the full HP X2-2 Exadata machine I was testing on, that meant we had 8 threads and 8 nodes...so the bottleneck we had seen in the past not only had faster cores, we now could use 8 of them.  This made me think...unless something else becomes the bottleneck first, we could see over 800% performance increase due to this parallelism.  Ok, not really parallelism...but you know what I mean.  

Before any tweaking on our first try, we were able to get about 400GB/hr.  I've worked with a few people from Oracle/Goldengate, and Mike Bowen is one of the best.  He has decades of experience in CDC and finds creative ways to overcome obstacles.  Oracle sent him to us and together we were able to tweak the performance of Golden Gate on Exadata.  We were only focused on the capture side.  The 4 notable things he changed were:

1. Increase the size of the trail files.  There's considerable overhead as trail files are switched (similar to redo log switching in the database), so increasing the size of these can improve performance by reducing the switch frequency..

2. Trail file storage destination was changed from local storage on the compute node to dbfs.  This is something that should have been done per best practices anyway.  To allow for failover in the event of a node failure, the trail files must be located on some form of shared storage.  Having them on dbfs not only meets that requirement, but now they're on very fast spindles in the high performance storage nodes.

3. Switching from the Golden Gate v10 "TRANSLOGOPTIONS ASMUSER" method of ASM access to the new v11 DBLOGREADER method.  As with new features, there were bugs found right away.  This isn't what Oracle recommends, but I pull the latest/greatest GG version from Metalink patches, rather than eDelivery to avoid the bugs.  GG patches aren't one-offs...they're the entire build...so the newest patch on Metalink includes every patch ever made and in theory, is the most stable.  This, and everything else I'll ever say or type, is just my humble opinion...do what you think is best.

4. Increasing the read buffer size of the dblogreader to 4M.

TRANLOGOPTIONS BUFSIZE 4096000
TRANLOGOPTIONS DBLOGREADER
TRANLOGOPTIONS DBLOGREADERBUFSIZE 4096000

Just a note if you're using Goldengate on Exadata.  We encountered a bug that happens when your extract comes across HCC compressed objects in the log...even objects that are excluded, it causes the extract to abend.  If you have any compression in Exadata at all (and of course you will), GG will abend.   Oracle created a patch where that's corrected...if a table you aren't mining is compressed, that's no longer a problem.  Hopefully that patch will be generally available on Metalink soon.  For that matter, hopefully mining compressed tables will be possible soon.

After adding "TRANLOGOPTIONS DBLOGREADER" it created new entries in the dba_capture view (the view used by Streams), which is strange.  The capture processes were disabled, but "start_time" was more recent than the start time of the GG capture process.  Since there's a streams capture process created for GG now, this causes the RMAN-08137 errors when you try to delete archivelogs, even archivelogs that aren't needed for Goldengate (1079953.1).  This is due to a delay in updating this view...so even though the log isn't needed, for a period of time, its still reported as needed.  My opinion is, its great they brought this functionality to Golden Gate, to prevent you from removing archivelogs during a backup that GG hasn't read yet...but this needs to be up-to-date information!  From this feature there's a new wait in the database:
  • Wait event "Streams miscellaneous event" in wait class "Other" was consuming significant database time.
Don't worry if you see this...it SHOULD BE an idle event, and its another bug.  From Metalink:

  • The Streams miscellaneous event will be renamed to  "Waiting for additional work from the logfile" to better describe the activity from Oracle release 11.2.0.2.x See detail in BugDB 12341046 for more information

After dealing with the new bugs we were able to read 700GB of mixed OLTP/DSS archivelogs in 46 minutes, to achieve just over 1000GB/hr (over 125GB/hr/node).  After spending several weeks 2 yrs ago trying to scrounge for every last byte of CDC capture speed (I think I peeked below 40GB/hr on the old hardware with logminer), I find this near TB/hr speed ridiculous.  Not only are we good to go for this year's peek performance requirements, we'll do it easily, in near real-time, with so much bandwidth to spare, its hard to imagine a day when this speed will ever be insufficient.  Just like the "AT Class Computer", the awe of 1000GB/hr CDC will...I'm sure, be temporary.  Golden Gate performance, although...not cheap, is pretty awesome...for now. :)  

Thursday, June 9, 2011

How much faster is Exadata High Capacity vs High Performance Storage? (short stroking)

Compression is a powerful feature in Exadata, especially HCC compression...but in the real world you have time constraints on your migration project and it isn't necessarily possible to compress everything you'd like to compress immediately...you have to test and compare the performance impact on queries vs storage gains.There are so many options for compression in Exadata...basic compression (formerly bulk compression), compress for OLTP (formerly advanced compression), HCC compress for query (low, high) and HCC compress for archive (low, high)...not to mention index key compression, which I'll post about later.  Which one you choose is dictated by your data access patterns.   It takes time to figure out which is right for each table/partition.


This came up because my client has too much data to fit into a full high performance storage cell machine (over 28TB).  We've dropped indexes where possible, which has had a hugely positive impact...but now we're finding many of them need to be added again for performance reasons.  This was expected and planned...and we should be fine on capacity for the migration, but when this database hits the expected growth curve in a few years, we'll be in trouble.


One of the on-site Oracle consultants (who was very stressed about getting things to fit) suggested we move from our 80/20 data/fra High Performance storage machine to a High Capacity storage machine.  He said, "Since most of your IO will be hitting the flash cache anyway, you should only see it be a few percentage slower."  He said this w/o ever looking at our access patterns, so I dismissed it out of hand.  Some other people on the project pointed out...companies don't buy Exadata because they want "good enough"...you buy Exadata because you want ultra performance.  There were other "good enough" platforms that were much cheaper than Exadata.  Knowing they have ~1:1 read/write ratio, I wanted to try to quantify the performance difference between the options.  Everything is identical between an Exadata X2-2 high performance storage machine and an Exadata X2-2 high capacity machine...except for the spindles, so that's what I'll be focusing on.


Of course...all things being equal, High Capacity SAS storage cells are slower than High Performance SAS storage cells.  The HP cells have 15000rpm 600GB SAS disks.  HC cells have 7200rpm 2TB SAS disks (which are mechanically similar to the SATA disks from Exadata V1 machines...although the sales guys won't say it, Kevin Closson did:  "...think in the same way you do about technology like FC-SATA (a SATA disk with FC attach and FC-SATA head electronics."




Things are not necessarily equal though because Exadata short strokes its storage.  The idea of short stroking disks is that...the outside circumference of a spinning disk is moving faster than the inside...not in RPM's of course, but in the distance travelled by the head of the disk...so the throughput of the outside of the spindle is higher than the throughput of the inside, and if the head doesn't have as far to move, it will take less time for it to be where you want it to be.


Exadata takes advantage of short stroking and puts the most performant storage on the outside of the spindles for the data diskgroup, followed by the fra, and the most inside, slowest part of the spindle is used for dbfs.  There's a new 11.2 feature that does something similar for local storage...but that's a topic for a different post.  Once the diskgroup is created in ASM, there's a hash algorithm that distributes that data evenly around on the storage w/in that diskgroup...so short stroking in ASM can only happen when the diskgroup is created (barring a resize).


There's a standard data/fra ratio that Exadata uses normally...but the percentage of the storage vs FRA dedicated to data is variable in the Exadata setup script...we chose an 80/20 path.  Obviously flash cache plays a huge role in IO for reads...but we're a little write heavy.  Its possible that at some point as we short stroke the HC disks more and more, we'll approach or possibly excede the performance of the HP disks.


Here's the math for the SAS-2(6GB/s) drives:
2TB SAS Spec Sheet


2TB (High Capacity)
seek:8.9
rotational speed=7200
latency=7200r/m=120r/s=8ms/r
avg latency=8ms/2=4 (4.16ms from data sheet)
avg access time=13.06
Sustained Sequencial Read is 90MB/s(ID) and 144MB/s(OD)
Avg throughput:117MB/s


600GB SAS Spec Sheet
600GB (High Performance)
seek:3.65
rotational speed=15000
latency=15000r/m=250r/s=4ms/r
avg latency=4ms/2=2ms
avg access time=5.65
Sustained Sequential Read is 122MB/s(ID) and 204MB/s(OD)
Avg throughput:163MB/s


Latency is the time it takes for the spindle to spin around...sometimes you're closer and sometimes you're farther from the data you're going after...so worst case scenario, you're a full spin away...best case you're next to the data...so on average take latency/2 and that's what you can typically expect for avg latency.  Avg access time is avg latency+seek time.  This is how long you can expect the head to move per IO.


Again...we're doing an 80/20 configuration for data/fra...with data shortstroked on the outside edge we lower the capacity to 480GB (80% of 600GB), but our throughput on HP should be a little better than the spec at (204-122)*(1-.8)+122=138.4MBs (minimum inside).  204MBs (outside) to 138.4MBs(inside) so the avg throughput after short stroking is 171.2MB/s, increased from 163MB/s.


For the same storage (480GB) of data on the 2TB high capacity disks, we'll be using only the outer 23.4% (480GB/2048 GB) of the HC spindles. The spec throughput range is 90MB/s(inside) to144MB/s(outside), so (144-90)*(1-.234)+90=131MB/s (minimum inside).  (144+131)/2 gives us an average of 137.5MB/s, increased from 117MB/s.


The same logic works for average seek times.  If you only use the outer 50% of the spindle, you cut your seek time in half.  In our case, we're using the outer 80% for HP and outer 23.4% for HC.




2TB (High Capacity Short Stroked)
seek:8.9(8.9ms*23.4%)=2.08ms
rotational speed=7200
latency=7200r/m=120r/s=8ms/r
avg latency=8ms/2=4 (4.16ms from data sheet)
avg access time=13.06ms6.24ms
Sustained Sequencial Read is 90MB/s(ID) 131MB/s(ID)and 144MB/s(OD)
Avg throughput:117MB/s137.5MB/s
0-byte IOPS=77160
32k IOPS=~154

32k*160=5MB/s@137.5MB/s=1/27.5 seconds for transferring, rather than accessing the data, so the 32k IOPS would happen about 154 times/sec. 

600GB (High Performance Short Stroked)
seek:3.65(3.65ms*80%)=2.92
rotational speed=15000
latency=15000r/m=250r/s=4ms/r
avg latency=4ms/2=2ms
avg access time=5.654.92ms
Sustained Sequential Read is 122MB/s(ID) 138.4MB/s(ID) and 204MB/s(OD)
Avg throughput:163MB/s171.2MB/s
0-byte IOPS=177203
32k IOPS=~195




32k*203=6.34MB/s@171.2 would take 1/27th of a second for transferring, rather than accessing the data, so that would lower the 32k IOPS to about 195.


Soo...for throughput 137.5/171.2 tells us the HC disks would be 80.3% as fast as the HP disks, not counting caching.  For the all-important IOPS measurement, HC would be about 79% as fast as the HP disks.  With the 1:1 read/write workload in this database and around a 90% hit ratio (a lot of activity goes to dbfs, otherwise the flash hit rate would be higher), we'd be looking at around 18.7% more performance from the HP disks.


18.7% might not seem like very much, but when you're paying for the state of the art, why would you take an 18.7% performance hit?  At that point, alternative platforms become more attractive.  Instead of going down to 4 Exadata machines filled with HC storage, we opted for 3 HP machines and 1 HC machine, with the option to add a 5th chassis filled with more HP storage cells, at an additional expense, of course.


Still...this was pretty close.  Eventually as technology improves X3-2 (or 4 or 5)'s I bet we'll have a 3TB 10K HC spindle option.  If so, they might come very close to outperforming today's HP spindles, after they're short stroked.  Just for the heck of it...if Sun/Oracle does offer 3TB 10K's in the X3-2, this is a possibility of what we would see (still keeping with the 480GB/spindle for data use):


3TB (Future High Capacity Short Stroked)
seek:6.4(6.4ms*15.6%)=1ms
rotational speed=10000
latency=10000r/m=166.6r/s=6ms/r
avg latency=6ms/2=3
avg access time=9.4ms 4ms
Sustained Sequencial Read is 125MB/s(ID) and 200MB/s(OD)
Avg throughput:162.5MB/s 
0-byte IOPS=250
32k IOPS=~238