I had a birthday recently, and now I'm going to show my age. I remember once thinking, "What application could possibly demand the performance of an AT class computer?" I was in awe of how fast it was...how programs I had on my 12 Mhz XT ran so fast on my 20Mhz AT (with a turbo button) some of them were unusable. It wasn't long before that awe came and went. I think I may still have that machine buried somewhere in my basement. New technology and performance is always relative to the future, and its a fight today's very best technology can never win because no matter how unlikely it may seem today, there's always room for improvement.
As you may have read in my previous posts, I'm working on a project to move 22 databases (~30TB) times 4 environments from 3 IBM P5 595 frames to 4 Exadata frames. On the P5's, we used materialized views extensively to keep the inter-related data in sync within the multiple databases. As the databases became more and more busy, the mv logs grew and with their increase in size, the refresh performance dropped. We made multiple changes to have them refresh more frequently, but eventually we could see we were reaching the edge of the abilities of the technology on our hardware. At some point we began to have issues with performance where MV's weren't going to cut it any longer.
Something often forgotten or overlooked in the use of MV's is that there's huge overhead caused by MV logs. You don't just increase the overhead by refreshing the MV's...you increase the overhead with every DML statement on a table that has a MV log on it. Before I looked at it and quantified it, I assumed this was trivial, but its not. Jonathan Lewis did extensive research on that, showing that a single dml statement on a table with a MV log causes multiple DML statements in the background..and the overhead from "on commit" MV's makes them almost unusable in a high DML environment.
We felt like we reached the edge of the MV replication technology on our hardware...the next step forward in db replication was change data capture (CDC) technology. The concept with CDC is that every change made in a database (at least on the tables you care to run CDC on) is logged in redo and archived redo logs anyway for backup purposes. So, instead of adding overhead with triggers or MV's, you can just get the inserts, updates and deletes that happened from the logs you're already creating. All the changes are there...you just need to parse them and apply them to the remote database. The first step is to instantiate the remote table (copy as of an SCN via db link) then start CDC after the SCN you copied the data from. The next time you do an insert on the table on the source side, the change is put into a redo log. The CDC capture (or extract) process will see that and move that same insert to the table in the remote db. They're kept in sync in near real-time. (There's some lag in parse overhead, network speed, etc...)
I wasn't involved in the first pass at CDC here. About a year previously Oracle had sent some people to implement Streams...one of them a top notch Oak Table member, but they couldn't get it stable enough to depend on. This was an early implementation of "down" Streams in 10g (Asynchronous Autolog mode). We wanted to have the load of mining be on the target, not the source, which is a little atypical...maybe that's why we had so many issues? With the talent they sent us I can safely say it was the product, not the implementation that failed. Bug after bug was filed, after months and after huge expense the project was abandoned. Incidentally, I'm told 11g Streams (which is at end of life with no future features after 11.2.0.2) is much more stable today.
So...a year later the issue resurfaced and became a new "hot item." This time, I got a shot at it. I performed some proof of concept performance tests of multiple products working with vendors such as Informatica and Oracle Streams and at the time, an independent company called Golden Gate (which has since be bought out by Oracle). There was a very bad taste still in the mouth from the last time Streams was attempted...so there was no way that was going to be the chosen direction. Still, it was useful to include it in testing for comparison purposes.
When I perform these tests I usually have a list of qualities for the product prepared before testing begins (we call it a "Score Card"). After all, if a product is lightning fast, but isn't stable enough to rely on, it can't be a consideration. If its perfect in every way, but priced prohibitively, its unattainable. There's a balance of qualities that makes a "best" choice. We found Informatica was very robust and relatively easy to use. The bottleneck we found with Informatica was on the capture side. Informatica and Streams (and Shareplex) used Oracle Logminer for their CDC which was too slow to meet our requirements. We had planned to take a look at Shareplex, but decided not to when we saw it mined logs with the same method as Informatica and Streams.
After spending a few weeks with experts from Informatica, we finally had to say that it wasn't possible to meet the performance criteria we had set out, although they were close.
To be fair, our requirements are very difficult (and for a while, I thought unattainable), and I'm sure all the CDC technologies that use logminer would be more than sufficient for 99.99% of the requirements out there.
Before I go on...let me refer you to my previous post/rant of Golden Gate Director. Its the worst software product I've ever implemented, and my original motivation for blogging. It was so bad, I felt like I had to warn the world to avoid it. We just installed the latest, greatest version of Director...I really hope its been improved. The rest of this post I'm talking about GG CDC only....
In the CDC world, Golden Gate is different. I was told that the guy that originally designed Oracle Logminer did so more for auditing purposes and it wasn't originally designed for performance. Later, he was hired by Golden Gate and designed a new, proprietary miner...designed from the ground up for performance. Take that with a grain of salt...I heard this from GG sales guys.
My first concern was - using a 3rd party miner to mine Oracle's proprietary redo/archive logs would create a dependency in our system on a relationship between 2 competing companies. What happens to us if that relationship breaks down? Right around the time we went live with Goldengate, Oracle bought them- so that concern was removed. Thanks Larry. :)
As testing began on the P5's, GG was immediately much faster with our workload than logminer, and after a few days it was easy enough to use. In our non-RAC system we were able to mine around 55GB/hr of archivelogs, at which point the single CPU core it used was pegged. There was really no way to parallelize it to use more CPU. A GG consultant that moved to Oracle University taught me a trick to improve non-RAC GG performance when the CPU is the bottleneck (thanks Hitomi)...set it up like its a 1 node RAC. (ie, when setting up the extract, say "threads 1"). This will spawn 2 processes off, one to purely parse the log, one to hand off the work. This improved performance around 10% in our environment, to around 60GB/hr.
With our Exadata system, we'll be making around 1,300GB/day of archivelogs...so 60GB/hr isn't going to cut it. When I say 1,300GB/day, there are peeks and vallys of generation, and the app needs to stay as near real-time as possible. So...although 60GB/hr is fast enough on average...through out the day, we'd have times of lag that would be unacceptable to the performance of the applications.
As I mentioned, the CPU usage on the capture side was the bottleneck on the P5. Exadata's Nahalem EP not only has faster cores than the 5 yr old P595 cores, we're now in RAC, so we get to use more cores. When you set up the extract in GG for RAC, you have to specify the number of threads...each thread equating to the archivelog sequence from a specific node. In the full HP X2-2 Exadata machine I was testing on, that meant we had 8 threads and 8 nodes...so the bottleneck we had seen in the past not only had faster cores, we now could use 8 of them. This made me think...unless something else becomes the bottleneck first, we could see over 800% performance increase due to this parallelism. Ok, not really parallelism...but you know what I mean.
Before any tweaking on our first try, we were able to get about 400GB/hr. I've worked with a few people from Oracle/Goldengate, and Mike Bowen is one of the best. He has decades of experience in CDC and finds creative ways to overcome obstacles. Oracle sent him to us and together we were able to tweak the performance of Golden Gate on Exadata. We were only focused on the capture side. The 4 notable things he changed were:
1. Increase the size of the trail files. There's considerable overhead as trail files are switched (similar to redo log switching in the database), so increasing the size of these can improve performance by reducing the switch frequency..
2. Trail file storage destination was changed from local storage on the compute node to dbfs. This is something that should have been done per best practices anyway. To allow for failover in the event of a node failure, the trail files must be located on some form of shared storage. Having them on dbfs not only meets that requirement, but now they're on very fast spindles in the high performance storage nodes.
3. Switching from the Golden Gate v10 "TRANSLOGOPTIONS ASMUSER" method of ASM access to the new v11 DBLOGREADER method. As with new features, there were bugs found right away. This isn't what Oracle recommends, but I pull the latest/greatest GG version from Metalink patches, rather than eDelivery to avoid the bugs. GG patches aren't one-offs...they're the entire build...so the newest patch on Metalink includes every patch ever made and in theory, is the most stable. This, and everything else I'll ever say or type, is just my humble opinion...do what you think is best.
4. Increasing the read buffer size of the dblogreader to 4M.
TRANLOGOPTIONS BUFSIZE 4096000
TRANLOGOPTIONS DBLOGREADER
TRANLOGOPTIONS DBLOGREADERBUFSIZE 4096000
Just a note if you're using Goldengate on Exadata. We encountered a bug that happens when your extract comes across HCC compressed objects in the log...even objects that are excluded, it causes the extract to abend. If you have any compression in Exadata at all (and of course you will), GG will abend. Oracle created a patch where that's corrected...if a table you aren't mining is compressed, that's no longer a problem. Hopefully that patch will be generally available on Metalink soon. For that matter, hopefully mining compressed tables will be possible soon.
After adding "TRANLOGOPTIONS DBLOGREADER" it created new entries in the dba_capture view (the view used by Streams), which is strange. The capture processes were disabled, but "start_time" was more recent than the start time of the GG capture process. Since there's a streams capture process created for GG now, this causes the RMAN-08137 errors when you try to delete archivelogs, even archivelogs that aren't needed for Goldengate (1079953.1). This is due to a delay in updating this view...so even though the log isn't needed, for a period of time, its still reported as needed. My opinion is, its great they brought this functionality to Golden Gate, to prevent you from removing archivelogs during a backup that GG hasn't read yet...but this needs to be up-to-date information! From this feature there's a new wait in the database:
After dealing with the new bugs we were able to read 700GB of mixed OLTP/DSS archivelogs in 46 minutes, to achieve just over 1000GB/hr (over 125GB/hr/node). After spending several weeks 2 yrs ago trying to scrounge for every last byte of CDC capture speed (I think I peeked below 40GB/hr on the old hardware with logminer), I find this near TB/hr speed ridiculous. Not only are we good to go for this year's peek performance requirements, we'll do it easily, in near real-time, with so much bandwidth to spare, its hard to imagine a day when this speed will ever be insufficient. Just like the "AT Class Computer", the awe of 1000GB/hr CDC will...I'm sure, be temporary. Golden Gate performance, although...not cheap, is pretty awesome...for now. :)
As you may have read in my previous posts, I'm working on a project to move 22 databases (~30TB) times 4 environments from 3 IBM P5 595 frames to 4 Exadata frames. On the P5's, we used materialized views extensively to keep the inter-related data in sync within the multiple databases. As the databases became more and more busy, the mv logs grew and with their increase in size, the refresh performance dropped. We made multiple changes to have them refresh more frequently, but eventually we could see we were reaching the edge of the abilities of the technology on our hardware. At some point we began to have issues with performance where MV's weren't going to cut it any longer.
Something often forgotten or overlooked in the use of MV's is that there's huge overhead caused by MV logs. You don't just increase the overhead by refreshing the MV's...you increase the overhead with every DML statement on a table that has a MV log on it. Before I looked at it and quantified it, I assumed this was trivial, but its not. Jonathan Lewis did extensive research on that, showing that a single dml statement on a table with a MV log causes multiple DML statements in the background..and the overhead from "on commit" MV's makes them almost unusable in a high DML environment.
We felt like we reached the edge of the MV replication technology on our hardware...the next step forward in db replication was change data capture (CDC) technology. The concept with CDC is that every change made in a database (at least on the tables you care to run CDC on) is logged in redo and archived redo logs anyway for backup purposes. So, instead of adding overhead with triggers or MV's, you can just get the inserts, updates and deletes that happened from the logs you're already creating. All the changes are there...you just need to parse them and apply them to the remote database. The first step is to instantiate the remote table (copy as of an SCN via db link) then start CDC after the SCN you copied the data from. The next time you do an insert on the table on the source side, the change is put into a redo log. The CDC capture (or extract) process will see that and move that same insert to the table in the remote db. They're kept in sync in near real-time. (There's some lag in parse overhead, network speed, etc...)
I wasn't involved in the first pass at CDC here. About a year previously Oracle had sent some people to implement Streams...one of them a top notch Oak Table member, but they couldn't get it stable enough to depend on. This was an early implementation of "down" Streams in 10g (Asynchronous Autolog mode). We wanted to have the load of mining be on the target, not the source, which is a little atypical...maybe that's why we had so many issues? With the talent they sent us I can safely say it was the product, not the implementation that failed. Bug after bug was filed, after months and after huge expense the project was abandoned. Incidentally, I'm told 11g Streams (which is at end of life with no future features after 11.2.0.2) is much more stable today.
So...a year later the issue resurfaced and became a new "hot item." This time, I got a shot at it. I performed some proof of concept performance tests of multiple products working with vendors such as Informatica and Oracle Streams and at the time, an independent company called Golden Gate (which has since be bought out by Oracle). There was a very bad taste still in the mouth from the last time Streams was attempted...so there was no way that was going to be the chosen direction. Still, it was useful to include it in testing for comparison purposes.
When I perform these tests I usually have a list of qualities for the product prepared before testing begins (we call it a "Score Card"). After all, if a product is lightning fast, but isn't stable enough to rely on, it can't be a consideration. If its perfect in every way, but priced prohibitively, its unattainable. There's a balance of qualities that makes a "best" choice. We found Informatica was very robust and relatively easy to use. The bottleneck we found with Informatica was on the capture side. Informatica and Streams (and Shareplex) used Oracle Logminer for their CDC which was too slow to meet our requirements. We had planned to take a look at Shareplex, but decided not to when we saw it mined logs with the same method as Informatica and Streams.
After spending a few weeks with experts from Informatica, we finally had to say that it wasn't possible to meet the performance criteria we had set out, although they were close.
To be fair, our requirements are very difficult (and for a while, I thought unattainable), and I'm sure all the CDC technologies that use logminer would be more than sufficient for 99.99% of the requirements out there.
Before I go on...let me refer you to my previous post/rant of Golden Gate Director. Its the worst software product I've ever implemented, and my original motivation for blogging. It was so bad, I felt like I had to warn the world to avoid it. We just installed the latest, greatest version of Director...I really hope its been improved. The rest of this post I'm talking about GG CDC only....
In the CDC world, Golden Gate is different. I was told that the guy that originally designed Oracle Logminer did so more for auditing purposes and it wasn't originally designed for performance. Later, he was hired by Golden Gate and designed a new, proprietary miner...designed from the ground up for performance. Take that with a grain of salt...I heard this from GG sales guys.
My first concern was - using a 3rd party miner to mine Oracle's proprietary redo/archive logs would create a dependency in our system on a relationship between 2 competing companies. What happens to us if that relationship breaks down? Right around the time we went live with Goldengate, Oracle bought them- so that concern was removed. Thanks Larry. :)
As testing began on the P5's, GG was immediately much faster with our workload than logminer, and after a few days it was easy enough to use. In our non-RAC system we were able to mine around 55GB/hr of archivelogs, at which point the single CPU core it used was pegged. There was really no way to parallelize it to use more CPU. A GG consultant that moved to Oracle University taught me a trick to improve non-RAC GG performance when the CPU is the bottleneck (thanks Hitomi)...set it up like its a 1 node RAC. (ie, when setting up the extract, say "threads 1"). This will spawn 2 processes off, one to purely parse the log, one to hand off the work. This improved performance around 10% in our environment, to around 60GB/hr.
With our Exadata system, we'll be making around 1,300GB/day of archivelogs...so 60GB/hr isn't going to cut it. When I say 1,300GB/day, there are peeks and vallys of generation, and the app needs to stay as near real-time as possible. So...although 60GB/hr is fast enough on average...through out the day, we'd have times of lag that would be unacceptable to the performance of the applications.
As I mentioned, the CPU usage on the capture side was the bottleneck on the P5. Exadata's Nahalem EP not only has faster cores than the 5 yr old P595 cores, we're now in RAC, so we get to use more cores. When you set up the extract in GG for RAC, you have to specify the number of threads...each thread equating to the archivelog sequence from a specific node. In the full HP X2-2 Exadata machine I was testing on, that meant we had 8 threads and 8 nodes...so the bottleneck we had seen in the past not only had faster cores, we now could use 8 of them. This made me think...unless something else becomes the bottleneck first, we could see over 800% performance increase due to this parallelism. Ok, not really parallelism...but you know what I mean.
Before any tweaking on our first try, we were able to get about 400GB/hr. I've worked with a few people from Oracle/Goldengate, and Mike Bowen is one of the best. He has decades of experience in CDC and finds creative ways to overcome obstacles. Oracle sent him to us and together we were able to tweak the performance of Golden Gate on Exadata. We were only focused on the capture side. The 4 notable things he changed were:
1. Increase the size of the trail files. There's considerable overhead as trail files are switched (similar to redo log switching in the database), so increasing the size of these can improve performance by reducing the switch frequency..
2. Trail file storage destination was changed from local storage on the compute node to dbfs. This is something that should have been done per best practices anyway. To allow for failover in the event of a node failure, the trail files must be located on some form of shared storage. Having them on dbfs not only meets that requirement, but now they're on very fast spindles in the high performance storage nodes.
3. Switching from the Golden Gate v10 "TRANSLOGOPTIONS ASMUSER" method of ASM access to the new v11 DBLOGREADER method. As with new features, there were bugs found right away. This isn't what Oracle recommends, but I pull the latest/greatest GG version from Metalink patches, rather than eDelivery to avoid the bugs. GG patches aren't one-offs...they're the entire build...so the newest patch on Metalink includes every patch ever made and in theory, is the most stable. This, and everything else I'll ever say or type, is just my humble opinion...do what you think is best.
4. Increasing the read buffer size of the dblogreader to 4M.
TRANLOGOPTIONS BUFSIZE 4096000
TRANLOGOPTIONS DBLOGREADER
TRANLOGOPTIONS DBLOGREADERBUFSIZE 4096000
Just a note if you're using Goldengate on Exadata. We encountered a bug that happens when your extract comes across HCC compressed objects in the log...even objects that are excluded, it causes the extract to abend. If you have any compression in Exadata at all (and of course you will), GG will abend. Oracle created a patch where that's corrected...if a table you aren't mining is compressed, that's no longer a problem. Hopefully that patch will be generally available on Metalink soon. For that matter, hopefully mining compressed tables will be possible soon.
After adding "TRANLOGOPTIONS DBLOGREADER" it created new entries in the dba_capture view (the view used by Streams), which is strange. The capture processes were disabled, but "start_time" was more recent than the start time of the GG capture process. Since there's a streams capture process created for GG now, this causes the RMAN-08137 errors when you try to delete archivelogs, even archivelogs that aren't needed for Goldengate (1079953.1). This is due to a delay in updating this view...so even though the log isn't needed, for a period of time, its still reported as needed. My opinion is, its great they brought this functionality to Golden Gate, to prevent you from removing archivelogs during a backup that GG hasn't read yet...but this needs to be up-to-date information! From this feature there's a new wait in the database:
- Wait event "Streams miscellaneous event" in wait class "Other" was consuming significant database time.
- The Streams miscellaneous event will be renamed to "Waiting for additional work from the logfile" to better describe the activity from Oracle release 11.2.0.2.x See detail in BugDB 12341046 for more information
After dealing with the new bugs we were able to read 700GB of mixed OLTP/DSS archivelogs in 46 minutes, to achieve just over 1000GB/hr (over 125GB/hr/node). After spending several weeks 2 yrs ago trying to scrounge for every last byte of CDC capture speed (I think I peeked below 40GB/hr on the old hardware with logminer), I find this near TB/hr speed ridiculous. Not only are we good to go for this year's peek performance requirements, we'll do it easily, in near real-time, with so much bandwidth to spare, its hard to imagine a day when this speed will ever be insufficient. Just like the "AT Class Computer", the awe of 1000GB/hr CDC will...I'm sure, be temporary. Golden Gate performance, although...not cheap, is pretty awesome...for now. :)