Monday, June 13, 2011

Extreme Goldengate performance on Exadata

I had a birthday recently, and now I'm going to show my age.  I remember once thinking, "What application could possibly demand the performance of an AT class computer?"  I was in awe of how fast it programs I had on my 12 Mhz XT ran so fast on my 20Mhz AT (with a turbo button) some of them were unusable.  It wasn't long before that awe came and went.  I think I may still have that machine buried somewhere in my basement.  New technology and performance is always relative to the future, and its a fight today's very best technology can never win because no matter how unlikely it may seem today, there's always room for improvement.

As you may have read in my previous posts, I'm working on a project to move 22 databases (~30TB) times 4 environments from 3 IBM P5 595 frames to 4 Exadata frames.  On the P5's, we used materialized views extensively to keep the inter-related data in sync within the multiple databases.  As the databases became more and more busy, the mv logs grew and with their increase in size, the refresh performance dropped.  We made multiple changes to have them refresh more frequently, but eventually we could see we were reaching the edge of the abilities of the technology on our hardware.  At some point we began to have issues with performance where MV's weren't going to cut it any longer.

Something often forgotten or overlooked in the use of MV's is that there's huge overhead caused by MV logs.  You don't just increase the overhead by refreshing the MV' increase the overhead with every DML statement on a table that has a MV log on it.  Before I looked at it and quantified it, I assumed this was trivial, but its not.  Jonathan Lewis did extensive research on that, showing that a single dml statement on a table with a MV log causes multiple DML statements in the background..and the overhead from "on commit" MV's makes them almost unusable in a high DML environment.

We felt like we reached the edge of the MV replication technology on our hardware...the next step forward in db replication was change data capture (CDC) technology.  The concept with CDC is that every change made in a database (at least on the tables you care to run CDC on) is logged in redo and archived redo logs anyway for backup purposes.  So, instead of adding overhead with triggers or MV's, you can just get the inserts, updates and deletes that happened from the logs you're already creating.  All the changes are just need to parse them and apply them to the remote database.  The first step is to instantiate the remote table (copy as of an SCN via db link) then start CDC after the SCN you copied the data from.  The next time you do an insert on the table on the source side, the change is put into a redo log.  The CDC capture (or extract) process will see that and move that same insert to the table in the remote db.  They're kept in sync in near real-time.  (There's some lag in parse overhead, network speed, etc...)

I wasn't involved in the first pass at CDC here.  About a year previously Oracle had sent some people to implement of them a top notch Oak Table member, but they couldn't get it stable enough to depend on.  This was an early implementation of "down" Streams in 10g (Asynchronous Autolog mode).  We wanted to have the load of mining be on the target, not the source, which is a little atypical...maybe that's why we had so many issues?  With the talent they sent us I can safely say it was the product, not the implementation that failed.  Bug after bug was filed, after months and after huge expense the project was abandoned.  Incidentally, I'm told 11g Streams (which is at end of life with no future features after is much more stable today.

So...a year later the issue resurfaced and became a new "hot item."  This time, I got a shot at it.  I performed some proof of concept performance tests of multiple products working with vendors such as Informatica and Oracle Streams and at the time, an independent company called Golden Gate (which has since be bought out by Oracle).  There was a very bad taste still in the mouth from the last time Streams was there was no way that was going to be the chosen direction.  Still, it was useful to include it in testing for comparison purposes.

When I perform these tests I usually have a list of qualities for the product prepared before testing begins (we call it a "Score Card").  After all, if a product is lightning fast, but isn't stable enough to rely on, it can't be a consideration.  If its perfect in every way, but priced prohibitively, its unattainable.  There's a balance of qualities that makes a "best" choice.  We found Informatica was very robust and relatively easy to use.  The bottleneck we found with Informatica was on the capture side.  Informatica and Streams (and Shareplex) used Oracle Logminer for their CDC which was too slow to meet our requirements.  We had planned to take a look at Shareplex, but decided not to when we saw it mined logs with the same method as Informatica and Streams.

 After spending a few weeks with experts from Informatica, we finally had to say that it wasn't possible to meet the performance criteria we had set out, although they were close.

To be fair, our requirements are very difficult (and for a while, I thought unattainable), and I'm sure all the CDC technologies that use logminer would be more than sufficient for 99.99% of the requirements out there.

Before I go on...let me refer you to my previous post/rant of Golden Gate Director.  Its the worst software product I've ever implemented, and my original motivation for blogging.  It was so bad, I felt like I had to warn the world to avoid it.  We just installed the latest, greatest version of Director...I really hope its been improved.  The rest of this post I'm talking about GG CDC only....

In the CDC world, Golden Gate is different.  I was told that the guy that originally designed Oracle Logminer did so more for auditing purposes and it wasn't originally designed for performance.  Later, he was hired by Golden Gate and designed a new, proprietary miner...designed from the ground up for performance.  Take that with a grain of salt...I heard this from GG sales guys.

My first concern was - using a 3rd party miner to mine Oracle's proprietary redo/archive logs would create a dependency in our system on a relationship between 2 competing companies.  What happens to us if that relationship breaks down?  Right around the time we went live with Goldengate, Oracle bought them- so that concern was removed.  Thanks Larry. :)

As testing began on the P5's, GG was immediately much faster with our workload than logminer, and after a few days it was easy enough to use.  In our non-RAC system we were able to mine around 55GB/hr of archivelogs, at which point the single CPU core it used was pegged.  There was really no way to parallelize it to use more CPU.  A GG consultant that moved to Oracle University taught me a trick to improve non-RAC GG performance when the CPU is the bottleneck (thanks Hitomi)...set it up like its a 1 node RAC.  (ie, when setting up the extract, say "threads 1").  This will spawn 2 processes off, one to purely parse the log, one to hand off the work.  This improved performance around 10% in our environment, to around 60GB/hr.

With our Exadata system, we'll be making around 1,300GB/day of 60GB/hr isn't going to cut it.  When I say 1,300GB/day, there are peeks and vallys of generation, and the app needs to stay as near real-time as possible.  So...although 60GB/hr is fast enough on average...through out the day, we'd have times of lag that would be unacceptable to the performance of the applications.

As I mentioned, the CPU usage on the capture side was the bottleneck on the P5.  Exadata's Nahalem EP not only has faster cores than the 5 yr old P595 cores, we're now in RAC, so we get to use more cores.  When you set up the extract in GG for RAC, you have to specify the number of threads...each thread equating to the archivelog sequence from a specific node.  In the full HP X2-2 Exadata machine I was testing on, that meant we had 8 threads and 8 the bottleneck we had seen in the past not only had faster cores, we now could use 8 of them.  This made me think...unless something else becomes the bottleneck first, we could see over 800% performance increase due to this parallelism.  Ok, not really parallelism...but you know what I mean.  

Before any tweaking on our first try, we were able to get about 400GB/hr.  I've worked with a few people from Oracle/Goldengate, and Mike Bowen is one of the best.  He has decades of experience in CDC and finds creative ways to overcome obstacles.  Oracle sent him to us and together we were able to tweak the performance of Golden Gate on Exadata.  We were only focused on the capture side.  The 4 notable things he changed were:

1. Increase the size of the trail files.  There's considerable overhead as trail files are switched (similar to redo log switching in the database), so increasing the size of these can improve performance by reducing the switch frequency..

2. Trail file storage destination was changed from local storage on the compute node to dbfs.  This is something that should have been done per best practices anyway.  To allow for failover in the event of a node failure, the trail files must be located on some form of shared storage.  Having them on dbfs not only meets that requirement, but now they're on very fast spindles in the high performance storage nodes.

3. Switching from the Golden Gate v10 "TRANSLOGOPTIONS ASMUSER" method of ASM access to the new v11 DBLOGREADER method.  As with new features, there were bugs found right away.  This isn't what Oracle recommends, but I pull the latest/greatest GG version from Metalink patches, rather than eDelivery to avoid the bugs.  GG patches aren't one-offs...they're the entire the newest patch on Metalink includes every patch ever made and in theory, is the most stable.  This, and everything else I'll ever say or type, is just my humble what you think is best.

4. Increasing the read buffer size of the dblogreader to 4M.


Just a note if you're using Goldengate on Exadata.  We encountered a bug that happens when your extract comes across HCC compressed objects in the log...even objects that are excluded, it causes the extract to abend.  If you have any compression in Exadata at all (and of course you will), GG will abend.   Oracle created a patch where that's corrected...if a table you aren't mining is compressed, that's no longer a problem.  Hopefully that patch will be generally available on Metalink soon.  For that matter, hopefully mining compressed tables will be possible soon.

After adding "TRANLOGOPTIONS DBLOGREADER" it created new entries in the dba_capture view (the view used by Streams), which is strange.  The capture processes were disabled, but "start_time" was more recent than the start time of the GG capture process.  Since there's a streams capture process created for GG now, this causes the RMAN-08137 errors when you try to delete archivelogs, even archivelogs that aren't needed for Goldengate (1079953.1).  This is due to a delay in updating this even though the log isn't needed, for a period of time, its still reported as needed.  My opinion is, its great they brought this functionality to Golden Gate, to prevent you from removing archivelogs during a backup that GG hasn't read yet...but this needs to be up-to-date information!  From this feature there's a new wait in the database:
  • Wait event "Streams miscellaneous event" in wait class "Other" was consuming significant database time.
Don't worry if you see SHOULD BE an idle event, and its another bug.  From Metalink:

  • The Streams miscellaneous event will be renamed to  "Waiting for additional work from the logfile" to better describe the activity from Oracle release See detail in BugDB 12341046 for more information

After dealing with the new bugs we were able to read 700GB of mixed OLTP/DSS archivelogs in 46 minutes, to achieve just over 1000GB/hr (over 125GB/hr/node).  After spending several weeks 2 yrs ago trying to scrounge for every last byte of CDC capture speed (I think I peeked below 40GB/hr on the old hardware with logminer), I find this near TB/hr speed ridiculous.  Not only are we good to go for this year's peek performance requirements, we'll do it easily, in near real-time, with so much bandwidth to spare, its hard to imagine a day when this speed will ever be insufficient.  Just like the "AT Class Computer", the awe of 1000GB/hr CDC will...I'm sure, be temporary.  Golden Gate performance, although...not cheap, is pretty awesome...for now. :)  


  1. I have a question about Flashback Data Archive tables in Goldengate replication. Are these tables usually excluded in an extract like "TABLEEXCLUDE .SYS_FBA_*"?
    What is the approach for replicating and initial load of Flash Back Data Archive tables?
    Here is my problem. I exported a user schema using Oracle DataPump, imported into destination database. GoldenGate abended saying some tables dont exist on dest. I checked tables and there were about 200 tables that were not exported because they are FBDA tables and DataPump just ignores them. So, I recreated them on source with scripts.
    So, how is this done? Source and destination have their own FBDA and these tables should not be neither recreated with scripts on dest or replicated? Or they have to be replicated with contents?

  2. I haven't had the opportunity to be in an environment licensed for Total Recall/Flashback Data Archive that also used Goldengate, so I can't answer you definitively.

    It seems to me like you can deal with it in one of two ways.
    1. Either you exclude the FBA objects and enable FBA on source and target (which is what it sounds like you're doing)
    2. You can only enable FBA on the source and allow the objects to propogate via GG. If you choose to do this, you have to instantiate the objects somehow first...if you haven't already bought it...I *highly* recommend Expert Oracle Exadata. Tanel Poder has a great chapter on migrations where he talks about moving data from a source database to Exadata (but it works even if you aren't in Exadata.) He creates the tables and then copies them over multiple database links, calculating how many connections he needs by looking at the round trip timing, and then parallelizing it. The end result (on Exadata) is the source spindles are pegged, or the network is saturated, depending on your resources. Either way, even huge tables are moved as efficiently as possible. You can use that same method to instantiate goldengate...but I would also add flashback query to the mix, so you know what SCN you left off at. So...something like this:

    insert /*+ APPEND PARALLEL 8 */ into table@remotedb select /*+ PARALLEL 8 */ * from table@localdb as of scn 1231312321;

    Then you can set up your extract process for that table starting at the SCN you stated above...that way you shouldn't lose any transactions.

    Make sense? Its not as bad as it sounds...but after reading what I just said, I would go with option #1. There are a lot of things like this you have to exclude with Goldengate...compressed tables and queue tables to name a few....

  3. Just another thought...if you don't choose option #1, would FBA work on the target? Assuming your app has the FBA option built-in to use on the target, that might be the deciding factor.

    If you go with #2 and if you use the scn in your query for a point in time earlier than what's being captured in redo...will it use the flashback archive table you copied over from the other database, or will it give you a "snapshot too old" error? I'd bet you'd get a snapshot too old error. The more I think about it, the more I think its best to set up FBA on both db's and exclude them from GG's extract.

    Now I'm curious...let me know what you decide to do.

  4. I am curios , What kind of network do you used between source & target to achive 1000GB/sec. Was WAN or LAN ?

    1. Umair,

      You can configure GG to compress TCP package at cost of CPU.

  5. This was all between Exadata 40Gb infiniband. In the details, they measure performance alittle differently than you typically see with network speeds. 10GB ethernet is capable of ~1GB up and 1GB down. You would think then that 40Gb IB is capable of 4GB up and 4GB down...but that's not the case...its actually 20Gb up and 20Gb you ~2GB up and 2GB down. IB is less efficient than ethernet too. Still...its the fastest thing we've got. Soon, Cisco will start selling 40Gb and 100Gb ethernet...I can't wait. :)

    You bring up an interesting question though...I wonder if it would be faster to compress archivelogs and mine them remotely across a wan or mine them on the host and send the trail files across the wan. It probably would depend a lot on what % of the dml you're actually incuding in the mining. If only 5% of the dml activity is on tables you've included for goldengate, it would be much faster to mine them on the source and send the trail files on to the remote site.

  6. Good Informative Post,

    I m also in process of Implementing GG on Exadata X2.

    4M is maximum BUFFSIZE GG can offer ??

    Regarding RMAN-08137, Did you try LOGRETENTION parameter to Integrate GG with RMAN for archive deletion.


  7. For the record, Shareplex does not use Logminer to read the redo log. Never has. Also, the story about the guy who wrote the original Logminer being hired and he wrote a much faster one can't possible be true. Since I know the original developers that brought GG for Oracle product to market. They were from Quest's Shareplex team.

  8. Hi Andy
    Curious to see if any of this has changed for the better here in 2016 (soon to be 2017...) or if any changes in throughput have been noted