Search This Blog

Monday, March 5, 2012

IOPS requirements

So...let's say you're moving your database from the antique hardware its running on (the old hardware is always relatively an antique, right?) to the newest latest greatest hardware.  You're also upgrading your storage to a new SAN or NAS.  The storage architect says, "Yep...according to this shiny marketing page on the storage vendor's website, its the fastest storage in its class, so it should be good enough...and we can thin provision and deduplicate and do backups with snaps and pool all the storage into one monster ring of storage to rule them all!  And, since disk drives are cheaper when you buy them in 3tb versions, we'll save lots of money by getting rid of our smaller spindles and using these big sata spindles!

Ok...maybe I'm pessimistic, but whenever I hear an extreme used like "fastest", I get a red flag in my head...and although the great features of modern storage might be worth it...there's usually some overhead involved.  Also, the term "larger spindles" usually translates in my head to "less IOPS", unless you short stroke (check out my "How much faster is Exadata High Performance than High Capacity" blog for details about that) but in a way, if you're short stroking they really don't have that much more capacity, and you have to use the capacity smarter.

...and in general, marketing pages on websites are usually only correct under extreme examples that make their hardware look better.  Testing...and quantifying performance are always necessary.

After you run your ORION tests and you now know that your new storage is capable of 500,000 IOPS.  Is that enough for your databases?  Here's a query you can use (for non-RAC) to find out.  It'll go through the AWR data in your database to find when the busiest hours have been, and the number of IOPS you used then.  Its possible that this is lower than the peak amount of IOPS you used, since this is an average over an hour, so treat these results as minimums.

I looked around on the internet and I couldn't find anything that met my requirements...the solution that came closest ignored the fact that a physical IOP can be a read or a write...which is a pretty big miss.  In my situation, I actually had multiple databases using the same SAN, so I took these results and put them into a table, where I could sum up the results by hour, and see when the overlapping busiest IOP requirements were, and how high they were.

Looking back, I should have written this using SQL's lag, but...this works.  Keep in mind, this is what Oracle believes to be a physical IO...if you're not in ASM (you should be) you may be using the OS filesystem cache...even if you're in ASM, you may be caching on your NAS or SAN....both situation are hidden from Oracle.  When I put this together, there were a couple of issues I had to overcome.

1. The stat_id for  physical reads/writes was different from db to db...even databases with the same version.  I'm not sure why this is...but I suspect they were originally different versions and upgraded to the same versions.

2. This data is accumulated from snap to snap since the last bounce, so I had to take a snap and compare it to the previous snap to find the delta, then find the number of seconds between those snaps and divide to find the IO/sec.  If there's a database bounce, the accumulator will reset to zero, and it would report a negative value, so I filtered out the negative results.

3. Snaps aren't exactly as long as you think they are...and they're configurable...so you might have them set for 15 min or an hour...and they may actually be running for 15 minutes and 1 second.  This takes the variable length of a snap into account.

This works for RAC and non-RAC databases.  I hope this helps you in your struggle against the vendor-marketing-website-page-believing storage administrator in your life.


select (select name from v$database) db_name, sample_hour, (rps+wps) IOPS
from (
with snaps as (
select hiof1.instance_number,hiof1.snap_id, sum(hiof1.value) reads, sum(hiof2.value) writes
from sys.WRH$_SYSSTAT HIOF1, sys.WRH$_SYSSTAT HIOF2
where exists (select 'X' from v$statname sn where name like '%physical read total IO%' and stat_id=HIOF1.stat_id)
  and exists (select 'X' from v$statname sn where name like '%physical write total IO%' and stat_id=HIOF2.stat_id)
  and HIOF1.snap_id=hiof2.snap_id
  and hiof1.instance_number=hiof2.instance_number
group by hiof1.instance_number,hiof1.snap_id 
),
my_snaps as
(select snap_id, instance_number, begin_interval_time, end_interval_time,
 extract(second from (end_interval_time-begin_interval_time))+
 (extract(minute from (end_interval_time-begin_interval_time))*60)+
 (extract(hour from (end_interval_time-begin_interval_time))*60*60) seconds
 from dba_hist_snapshot)
select s1.snap_id snap_1, s2.snap_id snap_2, to_date(to_char(begin_interval_time,'MM/DD/YYYY HH24'),'MM/DD/YYYY HH24') sample_hour, sum(s2.reads-s1.reads) reads, sum(s2.writes-s1.writes) writes,
  trunc(sum(s2.reads-s1.reads)/sum(seconds)) rps, trunc(sum(s2.writes-s1.writes)/sum(seconds)) wps
from snaps s1, snaps s2, my_snaps ms
where s1.snap_id=ms.snap_id
  and s1.snap_id=(s2.snap_id-1)
  and (s2.reads-s1.reads)>1
  and (s2.writes-s1.writes)>1
  and s1.instance_number=s2.instance_number
  and s1.instance_number=ms.instance_number
group by s2.snap_id, to_date(to_char(begin_interval_time,'MM/DD/YYYY HH24'),'MM/DD/YYYY HH24'), s1.snap_id
) where sample_hour>sysdate-90 order by 3 desc;

Keep in mind, Oracle considers usage of AWR tables to be a performance pack licensed option...so if you run this query on a database that isn't licensed...shame on you.  Jonathan Lewis has a nice post to show you how to clean the audit trail.

For example output, here's the results from a busy database I've recently worked on.  I changed the db name and times, but not the iops (yes, that's 210 million physical IOPS...from Oracle's perspective):

-->
db_name SAMPLE_HOUR IOPS
mydb01 01/05/11 08:00 AM 210487926
mydb01 01/25/11 11:00 PM 33824606
mydb01 01/26/11 01:00 AM 24807386
mydb01 09/04/12 03:00 PM 16523389
mydb01 07/26/11 02:00 AM 14372534
mydb01 06/04/12 02:00 PM 9307085
mydb01 05/19/12 02:00 PM 8735280
mydb01 04/20/12 01:00 PM 2439049
mydb01 03/06/11 11:00 AM 1460471
mydb01 09/17/12 07:00 PM 665055

Monday, January 30, 2012

GoldenGate capture with compressed tables will be available soon!

  I've been told "Golden Gate will support compressed tables...soon" for literally years now by multiple people at Oracle.  My first SR on this was opened on March 29th, 2010, nearly 2 years ago!  That SR's resolution was to add me to bugs:

bugdb - 9416239 which tracks SUPPORT of OLTP TABLE COMPRESSION
bugdb - 9428399 which tracks exadata V2 HCC compression
bugdb – 9426065 - SUPPORT ORACLE COMPRESSED TABLES

  Mining compressed tables is a necessary requirement for many large databases that use CDC...especially databases in Exadata, where you're all but expected to make use of Hybrid Columnar Ccompression (HCC). Think of all the benefits of compression in your database:

1. Compressed data means smaller datafiles, which means faster restore times, improving your RTO.

2. Your 8k block actually stores more than 8k of data which means you need to read/write fewer blocks.  Since the blocks are stored compressed in memory, it also means you increase the amount of data in the same size of your db cache. As the blocks move from memory to disk, it improves your potential IOPS capacity.  Ok, not really increasing IOPS capacity, but increasing the amount of data you can move per IO, which has a similar effect.

3. Compression means less storage requirements which in turn mean less storage costs.

In a previous post, I mentioned how, when I first started looking at Golden Gate with Exadata, I found that not only were compressed tables not captured, if GG came across a logged change of a compressed table in the archivelogs or redo, it would abend...even if it was a table that was excluded from capture.  We submitted a prio 1 SR and Oracle created a patch for this-so we were able to begin using GG with Exadata...but all tables that used compression were excluded from GG capture because of this limitation-put another way, we were forced to not compress tables that needed to be captured.  Those were all the biggest, most compressible tables.  This forced my client to use much more storage in their Exadata cells than they anticipated, and today they're preparing to buy additional storage cells, partially because of this limitation.

Imagine the fun they had explaining to their management they need hundreds of thousands of dollars to purchase additional storage from Oracle, ultimately due to an Oracle bug. :)  Come to think of it...no wonder it took Oracle 2 years to fix this! :)

Not to get too off track, but for a truly high-performance database, you should really test how advanced compression affects your system performance.  Your milage will vary, but based on my testing, I would expect it to improve your performance.  See http://otipstricks.blogspot.com/2011/02/exadata-index-dropping-and-compression.htmlThere are other huge performance improvements 11.2.0.3 offers for Exadata, especially for OLTP environments.  

Performance aside, IMHO you should begin to prepare for the upgrade to 11.2.0.3 PB 3 to take advantage of this new GG feature/bug fix, so you're ready to go when the new version of GG is released. 

I have no special knowledge from a friend at Oracle this time...I'm gleaning this from a statement in a PDF from Douglas Reid, Oracle GoldenGate product mgmt.  GG 11.2.1 will have tighter integration with XStream Out API (Capture), which means GG will be using a call to a procedure already in the Oracle kernel.  That internal call will be what handles OLTP and HCC compression, which to this point hasn't been possible.  Soooo...based on the schedule in that PDF of the approx March/April release of GG 11.2.1, there must be, prior to that time,  a database change to allow that.  Since we can't do it in 11.2.0.3 PB2 and its going to happen in the next several weeks...it must be coming in PB3.  The release schedule "remains at the discretion of Oracle"...but short of mind reading, this is the best we've got.

So, 11.2.0.3PB3 will be out w/in the next few weeks.  Based on the previous release cycles of patch bundles, it will be sooner than that.  I would guess sometime this week Exadata 11.2.0.3 PB 3 will be released (it requires Exadata Storage Server 11.2.2.4.)  Check Metalink note 888828.1 for updates...by the time you read this, you'll likely see 11.2.0.3 PB3 listed in that note...if not, check back in a few days. (Add that note to your MOS favourites by clicking the little star on it.  Its the best source to find what's currently GA.)

To sum it up, when you use GG 11.2.1 with Exadata 11.2.0.3 PB3, you'll be able to FINALLY mine compressed tables in Goldengate.
 I could be wrong, but my impression is that a lot of people at different companies are using Goldengate with Exadata.  "Extreme Goldengate Performance on Exadata" is one of my most popular posts.  Given the cost per GB of storage in Exadata, compression could save you a *huge* amount of money.  Once GG 11.2.1 is released, the only reason I can think of NOT to compress everything you can is that your access patterns don't work well with it...assuming the budget that was big enough to buy Exadata is also big enough to license compression. ;)

After ~2 years of waiting for these 3 bugs to be fixed, the wait is finally over.

Wednesday, January 4, 2012

Yet another plug for a great book

If you know anything about Oracle databases, you've likely heard of Jonathan Lewis and the great work he's done in the past on Oracle internals (why things do what they do...what's going on under the hood of the database)...especially his work on the cost-based optimizer.  I know he's inspired me to be much better than I would have been, because after reading his books, I was humbled by my relative ignorance, and became desperate to improve.  He's literally an alien of extraordinary ability....


We exist in a field filled with many geniuses.  What sets him apart from all but very few is his ability to bring people to his depth of understanding...his ability to explain complicated things is unmatched...but come to think of it...many of the "greats" have this ability.  Carry Millsap, Tanel Poder, Richard Niemic, Kerry Osborne...not to mention many others.  Hmm...maybe that's the difference between obscurity and notoriety...


I bring this up because I've been reading one of his newest books, Oracle Core: Essential Internals for DBAs and Developers.  His approach to explaining things in the book is interesting...he's recognized the circular manor of Oracle internals...you have to have a clue about everything before you can get a little depth about something in particular.  You have to have a little depth about everything before you can get deep into anything.  This is how he explains things.  By the time he gets into real depth, he's already brought you to a place where its just a small step, rather than a leap.  This makes his book interesting to people who have used Oracle for years and newbies too.


I'll add it to my book list on the right....



Thursday, December 22, 2011

Databases on Flash done Inexpensively


If you're looking at putting your database (SAP, Hana, Sql Server, MySQL, Oracle...whatever) on flash, you should *really* take a look at FusionIO.  FusionIO is how Facebook is able to drive its MySQL databases so fast.  The ioDrive card is a single point of failure for your storage, so you need some way to protect it.  Oracle ASM's normal redundancy is effectively software mirroring...and its very simple to use and set up.  If you have the budget, you can get truly extreme performance by using ASM's normal redundancy to mirror two ioDrives, but this method of protection would effectively double the cost of your FusionIO purchase. 

If you need more performance than you currently have, and you can't afford the cost of the storage it would take to put your entire database on flash, there's a different way to get it done, while still protecting the storage from a single point of hardware failure.

In my previous post about databases on flash, I talked about "option #4", which is-set up ASM as if its an extended cluster (when its not) and set it to use the fast, relatively expensive (per GB) FusionIO ioDrive storage as the preferred read failgroup.  The other failgroup is your existing SAN storage.  There are advantages of doing this instead of just getting faster storage on your SAN.  You're more HA than before, because you're protected from a storage failure from either the SAN or from the ioDrive.  If the SAN storage is unavailable, the database will stay up and running, tracking all the changes that are made since the SAN failed.  Depending on how long the outage lasts, and how you configured the diskgroup, when the SAN comes back, it'll apply those changes and after a while the SAN storage will again be in sync with the ioDrive.  If the ioDrive goes down, the database will continue to run completely on SAN storage until the ioDrive is back online...at which point, things can be synced up again.

I wanted to quantify how this performs in a little more detail with a few tests.  Using the uc460 from the previous DB on Flash tests, I added an HBA, set up some mirrored SSD's on a SAN with 2 partitions, configured an ioDrive with 2 partitions and then I created 3 databases...one purely on the SSD's, one purely on FusionIO, and one that uses normal redundancy with the preferred reads coming from FusionIO.  All the databases are set up exactly the same, with only 256M of db_cache each.

Here is atop reporting IO while Swingbench is running on a database that's completely on the SSDs (sdb is the SSD presented from the SAN).

Test 1:


This is the same idea, but this time no SSD...with purely the ioDrive (fioa).

Test 2:


To set the preferred read fail group, you simply need to do this:

alter system set asm_preferred_read_failure_groups = 'diskgroup_name.failgroup_name';

This is the combination of the two, letting ASM distribute the write load in normal redundancy, with reads coming from the FusionIO card.


Test 3:



...as you can see:

1. All reads (yes, except for those 4 stray IO's) are coming from fioa, and the writes are distributed pretty much equally (both in IO's and MB/s) between fioa and sdb.

2. Atop is showing that even though all reads are coming from fio and its doing the same amount of writes as the SSDs on the SAN, its still easily able to keep up with the workload...its being throttled by the slower sdb storage.  One more time I have to point out...the ioDrive is sooo fast.  Incidentally, this speed is from the slowest, cheapest ioDrive made...the other models are much faster.

3. The Swingbench workload test is forcing the exact ratio of reads/writes will always happen.  The potential for more than the 200% read performance shown above exists.  What you would see if you logged in to the database while the test is running is that reads are lightning fast, and writes are 30% faster than they've ever been before on the legacy storage.  In this configuration the legacy storage is only required to do writes and no reads, so its able to do the writes faster than before (60,596 vs 46,466 IO's and 57.19 vs 43.92 MB/s).  All this performance boost (200%+ reads, 130% writes), and we now have an HA storage solution...at half the cost of moving the database completely to flash.

In the real world, your legacy storage wouldn't be a mirrored SSD on a SAN, it would likely be much faster FC storage in a raid array.  This ioDrive could be a failgroup to san storage 3 times faster than the SSD before you'd get diminishing returns...at which point, you could just add a 2nd ioDrive.  Still, I think the approach and the results will be the same.  Unless you have a legacy SAN solution that's faster than a few of these can do together, there are definite performance gains to be had.


Monday, December 19, 2011

Flash with Databases-Part 3

In my previous posts on this topic, I talked about flash technology of SSD's and compared it to the performance of a FusionIO ioDrive card.  I also talked about how, in order to justify the cost of high performance storage, especially FusionIO technology, you have to transform the storage discussion from "How many GB of storage do you need?" to "Besides how much storage, what are your IOPS requirements?"  I'm telling you now...you will get resistance from storage administrators.  Try to point out the extreme case...that you wouldn't run the enterprise database on slow sata disks...so by bringing up FusionIO, you're just talking about a different, faster storage tier...specialized for databases.  They'll point out that the storage isn't 100% busy now...so getting faster storage won't help.  This is a fallacy...well, I should say...this might be a fallacy.  Usually % busy is talking about either average throughput/max throughput or percent of time waiting on storage.  Response time (which is key to designing a fast database) is often hidden from those reports.  This means your storage is often your bottleneck...you just aren't measuring the right things...so you can't see it in your reports.  When he shows you your slow database is only 50% busy on storage and your AWR reports are complaining about your storage performance...What is the bottleneck then?  If your CPU utilization is low, you don't have excessive waits on locks and your AWR report is complaining about storage...I'll bet its your storage.

So...how do you design for necessary IOPS?  This is actually a more difficult question that you might think.  What I did a few months ago for a client to consolidate 22 databases into one was...add up each of the requirements of the 22 databases (found by dba_hist_iostat_filetype), joining by the time of their snaps, and then I could find the requirements of the peak times at the storage level.  This was interesting because the peaks and lulls for the performance requirements didn't coincide when we would have thought.

For a single database, its much easier.  I'm talking about OLTP databases here...for DSS/warehouse databases...look at throughput, not IOPS.  As it relates to configuring for FusionIO, this query below will get you started.  DBA_IOSTAT has the number of physical IO's by file type and increments since the last bounce.  It will also reset when you bounce your database, so you have to filter out negative results.  What I did below is take the number of IO's between two snaps in sequence, then found the number of seconds in the duration of that snap, and divided to get the IOPS.  If you'll looking at a DSS database, you can look at the other columns in that view to find throughput per second.


select * from (
with snaps as (
select snap_id, sum(small_read_reqs+large_read_reqs) reads, sum(small_write_reqs+large_write_reqs) writes
from DBA_HIST_IOSTAT_FILETYPE HIOF
group by snap_id
),
my_snaps as
(select snap_id, instance_number, begin_interval_time, end_interval_time,
 extract(second from (end_interval_time-begin_interval_time))+
 (extract(minute from (end_interval_time-begin_interval_time))*60)+
 (extract(hour from (end_interval_time-begin_interval_time))*60*60) seconds
 from dba_hist_snapshot)
select s1.snap_id snap_1, s2.snap_id snap_2, to_char(ms.begin_interval_time,'MM/DD/YYYY') sample_day, sum(s2.reads-s1.reads) reads, sum(s2.writes-s1.writes) writes, 
  trunc(sum(s2.reads-s1.reads)/sum(seconds)) rps, trunc(sum(s2.writes-s1.writes)/sum(seconds)) wps
from snaps s1, snaps s2, my_snaps ms
where s1.snap_id=ms.snap_id
  and s1.snap_id=(s2.snap_id-1)
  and (s2.reads-s1.reads)>1
group by s2.snap_id, to_char(ms.begin_interval_time,'MM/DD/YYYY'), s1.snap_id 
order by 7 desc 
) where rownum<11;

The output will look something like this:

SNAP_1 SNAP_2 SAMPLE_DAY READS WRITES RPS WPS



664318 6492660 184 1801



521170 6356511 144 1763



2242398 1930744 1984 1708



1836943 5757317 509 1597



68419 9689 11201 1586



7373341 2418357 4794 1572



631855 5669473 175 1571



5218554 5272317 1446 1461



1059836 4884479 293 1354



5422603 4726101 1504 1310



631855 5669473 175 1571



5218554 5272317 1446 1461



1059836 4884479 293 1354



5422603 4726101 1504 1310


For configuring your database to use flash from SSD/FusionIO storage, there are 4 ways to do it that come to mind (there might be more...if you know of a better way than these, let me know)...each with their own pros and cons.  The problem we're trying to deal with here is that an SSD or FusionIO is a single point of failure:

1. Use Oracle's DB Cache Flash.  This can be thought of as a level 2 extension of the db_cache.  As a block ages out of the db_cache in RAM, it goes into the db flash cache. 

Pros: Easy to use, designed for flash.  Since it moves "hot" data at the (usually) 8k block level, its *much* more efficient than most SANs that move hot data at a page level.  Depending on the SAN...this can be anything from a few megs to 256MB+.  Usually the SAN technology moves the data on some sort of schedule...ie: daily, it will look to see what 42MB sections were hot, and move them to flash.  If your activity is consistent, that's fine, if not, you probably won't be helped by this.  So, obviously moving hot blocks in real time is much more efficient.

Cons: There's a "warm up" time for this strategy...the db_flash_cache is cleared when there's a db bounce, and blocks need to come back to the cache over time.  Also, you have to use OEL for this feature to work.  From what I've been told by smart people at Oracle, this isn't a technical requirement, its built in to the database logic.  Its a marketing ploy to encourage people to migrate to OEL.  I'm the biggest cheerleader I know for companies to at least take a look at Oracle Enterprise Linux...it has a lot of merits that mostly come down to support cost savings.  If companies choose to go with OEL, it should be because its the best choice for them...not because Oracle forced them to do it by disabling features in their database if they don't.  This is the same argument I made for enabling HCC on non-Sun storage.  What happens if the FusionIO card or SSD you're using for db_flash_cache crashes?  I know the data isn't primarily stored there...its also on your normal storage, so your data isn't in jeopardy...but will it crash your database?  If so, you need 2 FusionIO cards to mirror the db flash cache, which doubles your FusionIO storage cost.

2. Use Dataguard.  You can set up a 2nd server and have the redo applied in real-time to it, so if there's a failure on the local FusionIO storage, you just fail over to the new server.  Both servers would have a single FusionIO card.

Pros: Provides redundancy for all non-logical problems, including server hardware.

Cons: This doubles your server hardware costs, FusionIO storage cost and might more than double your Oracle licensing fees.

3. Use Oracle's ASM to mirror the storage across two FusionIO cards with normal redundancy.

Pros: Easy to do
Cons: Doubles $ per GB costs of storage

4. Use Oracle ASM's extended clustering option and set up FusionIO as the preferred read failgroup, and use cheaper SAN storage for the 2nd failgroup in normal redundacy.  This is a topic for a whole new post.

Pros:  Provided redundancy between FusionIO and existing legacy storage, but send no read IO's to that legacy storage, making it able to keep up with write IO's much faster.  Usually, reads are over 50% of OLTP workload...sometimes 3-4X the writes.  You can use the query above to find out if that's true with your workload.  This means, when paired with FusionIO, your existing storage will perform many X faster than it is today because its only getting a fraction of the workload.  Its also HA: if there's a failure in your legacy storage or your FusionIO storage, the database keeps running while you correct the issue.  11gR2 has a new feature to increase the sync speed of fail groups after the problem is corrected...it tracks changes being made while one of the failgroups is down.  When it comes back, it only has to apply those changes, rather than copying all the data.  Your reads (and therefore, the majority of your database workload) are as fast as can be provided by FusionIO (very, very fast).

Cons: Writes don't perform as well as a pure FusionIO solution, but writes will perform faster than your existing storage, because they won't get the read workload.

When you consider the cost/benefits...#4 is an excellent solution.  #3 is better, if you have the budget.  If you're doing RAC, I think #1 is your only option, if you want to use FusionIO.  Remember if you doing RAC, that the db flash cache in #1 is local to the node.  Normal L1 db_cache will go across the interconnect to other nodes, but L2 db_flash_cache does not.

For comparison's sake, the output above from the query is from a fairly fast database with 40+ cores on a fast EMC Clariion CX-4 with 15k FC storage.  This is NOT to say that's the performance limit of the CX-4, its only to say that this particular database could perform as well on FusionIO. Here's the numbers from swingbench on a single Fusion IO with the UC460 server I used in my previous post:

SNAP_ID SAMPLE_DAY SNAP_ID2 READS WRITES RPS WPS
21 12/15/2011 20 8297280 10464273 3843 4846


...this means FusionIO is a viable alternative to a large, expensive SAN storage for this particular OLTP database, if you can deal with its "local storage" limitations.  This db could perform 2.5X or faster on FusionIO...the size of it would make me want to get 2 cards...which would give us many times the potential IOPS it can get today for the SAN, for just a few thousand dollars.

All this so far was about OLTP...let's talk for a second about DSS databases.  I've seen 1900MB/sec throughput on a Clariion CX-4 that was short stroked with many 15k FC drives on an IBM P5 595...I'm sure there are better results out there, but from a database, that's the best I've seen on a CX-4.  It would likely take 2 FusionIO cards like I tested to match that throughput (based on specs), but there are other, bigger, badder FusionIO cards that could blow that performance away with speeds up to 6.7GB/sec/card, while still getting the microsecond response times.  Assuming you have several PCI-E slots in your large server, you could use many of these things together.  Unless you have a monstrous server, with this storage CPU is the new bottleneck, not storage.

Summary:  Ignoring products like Kaminario K2, FusionIO has a niche-it can't work for all workloads and environments because it has to be locally installed in a server's PCI-E slot.  There are a lot of ways to use it with databases...make sure you recognize the fact that its a single point of failure and protect your data.  For OLTP databases that are a few TB, I can't imagine anything faster.  FusionIO ioDrive has become the defacto standard to run SAP's HANA database, and more and more Sql Server databases are using it to break performance thresholds too.  List price for the entry level card I tested (384GB/MLC) is around $7k, but the cost for the larger cards isn't linear...the 2.4TB cards are cheaper per GB than the small card I tested.  The FusionIO Octal cards are 10.24TB in size.  If you have a few thousand to spend, you should check it out.  For that matter, if you have a few thousand laying around...take a look at their stock, FIO.  Its a small company with a great product that's taking huge market share.  I've been told they're currently the flash market leader, after only 1 year.  I wouldn't be surprised if they get bought out by one of the storage giants (Dell/EMC) soon.


<*UPDATE*>
I've just heard that FusionIO has added a former HP hotshot to its board...I guess we now know who's going to buy them out. :)


Also, I came across this interesting blog talking about FusionIO with MySql.
<*/UPDATE*>

I realize this is starting to sound like a sales pitch for FusionIO...I have no ties to them at all...but as a geek, I get excited when I see technology that can boost database performance to this degree.  People are afraid of flash technology because very early products didn't have the longevity or stability that was offered by enterprise class FC disks.  This has hugely improved over the years.  The FusionIO cards are expected to last ~7 years, and have a 5 year warranty.  

Flash with Databases-Part 2

In my previous post on Flash with databases, I talked about the upcoming FusionIO tests.


Here's the hardware configuration overview:
Server1          : Dell 610 2X4 (8 cores) 
Server2          : Cisco UC460 4X8 (32 threads, 16 cores)
SSD               : 2 mirrored SSD disks
HBAs            : 2-4GB/s
DB_CACHE : 256MB
FusionIO        : IODrive 


First, let's look at the SSD to establish a baseline.  The configuration is 2 SSD disks, mirrored over 2, 4GB HBA's.  The tests were done multiple times, so these results are averages.  For all the Swingbench tests after the first, I reduced the db_cache to just 256MB to reduce the IO to be nearly all physical.  I want to stress the storage, not the cache.  A good storage test must have the storage be the bottleneck.  It might be interesting to compare these results to the Hitachi USP-V tests from last year.  The testing was very similar, but the results were opposites of each other, due to the extreme response time differences.







ORION Swingbench on 11gR2 topas/atop
IOPS MBPS Latency Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU%
8303 100.98 0.52 164393 3489 2197 62 100 1.61



54674 1100 872 156 100 1.1


IMHO, these are very nice results...this storage would support a very fast database.  Obviously, more SSD's would mean more throughput (if we have enough HBA's.)


The SSD tests above were done on a Dell 610, dual-socket, 8 core server.  As you can see from the green Disk Busy and CPU columns, atop was reporting 100% disk utilization and 1.1 core utilized (110% cpu used.)


When I first started the tests using FusionIO, I could see a huge speed difference right off.  Here are the Orion results:





ORION
IOPS MBPS Latency
33439 750.5 0.08


Compared to the SSD, that's 4X the IOPS, 7.5X the throughput and 6.5X faster response time!  The response time is measured in milliseconds...so what Orion is saying is that the response time is .08 milliseconds...which is 80 MICRO seconds. My baseline is very fast SSD disks...to give you perspective...it wouldn't be unusual to see normal FC storage on a SAN show 10-15 ms latency.


Above, I mention that in order to test storage, the storage has to be the bottleneck in the test...but I had a problem when I did the Swingbench test.  No matter what I did, I couldn't cause the storage to be the bottleneck after I moved the indexes, data and redo to flash.  At first I thought there was something wrong and the FusionIO driver was eating up the CPU...but I didn't see the problem in the Orion tests on FusionIO, so that didn't make sense. 



 
Swingbench on 11gR2 Topas/Atop
MaxTPM Max TPS Avg TPS Avg Resp Disk Busy % CPU% Notes
20726 478 261 597 100/6 6.64 Indexes on Flash;First time using the FusionIO-CPU is very high
189132 3522 2897 47 100/63 4.92 Indexes and Data on Flash
162165 2868 2638 42 97/100 8 Indexes, Data and Redo on Flash;all CPU cores are completely pegged


The problem wasn't the driver...the problem was I was able to generate more load than I had ever done before, because the bottleneck had moved from storage to CPU...even with my 256MB db_cache! 

Notice too that atop was reporting FusionIO was 100% utilized...strange things happen to monitoring when CPU is pegged.  For the Disk Busy% field, the first number is the SAN storage, the second number is % busy reported by atop for the FusionIO card.  This 100% busy is accurate from a certain perspective.  The FusionIO card driver uses the CPU from the server to access the storage, and it couldn't get any cpu cycles to the driver to access the storage.  It appears to atop that its waiting for the storage, and so the storage was reported as 100% busy.  That isn't to say the FusionIO card was maxed out...just that you couldn't access the data any faster...which isn't exactly the same thing in this case.  The CPU bottleneck created what appeared to be a storage bottleneck because FusionIO uses the server CPU cycles, and there weren't any available.  I didn't investigate further, but I would suspect a tweaking of the nice CPU priority settings for the driver's process would allow the FusionIO to perform better and report more accurately.  At any rate, the Dell 610 with 2X4 (2 sockets, 4 cores each) couldn't push the FusionIO card to its limits with Swingbench because it didn't have the CPU cycles needed to make storage the bottleneck.


To deal with this CPU limitation, Cisco was kind enough to let us borrow a UC460, which has the awesome UCS technology and 4 sockets with 8 core processors each.  I only had 2 sockets populated, which more than doubled my compute power, giving me 16 Nahalem-EX cores and 32 threads of power (a huge upgrade from 8 Nahalem-EP cores).  I installed the Fusion IO card and retested.  With everything in the database on FusionIO with my extremely low db_cache to force physical IO instead of logical IO, it took 12.37 Nahalem-EX threads to push the FusionIO card to 100 utilization.  When I first did the test with SSD on the Dell, using a normal db_cache, I was only able to get 167k TPM.  Here, I did 170k.  This means I was able to get more speed with purely physical IO's than I was able to get from physical+logical IO's on the SSD's.





Swingbench on 11gR2 topas/atop Notes
Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU%
170036 123438 2749 48 0/100 12.37 Everything (including undo) on flash




This made me wonder...if I didn't hold back the db cache, what could the Cisco UC460 do with FusionIO?  In a normal db configuration, how would it perform?  The answer:



Swingbench on 11gR2 topas/atop Notes
Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU%
440489 7599 6888 8 0/60 31.41 Everything on flash, with normal memory settings




It took almost 32 threads, and now I'm once again almost 100% CPU utilized.  At this point with 32 threads on a high-performance Cisco UC460, the FusionIO is only 60% utilized. This is the fastest server I have to test with...there's never an 8X10 laying around when you need one.... :)  That's ok, I have enough information to extrapolate.  If 32 threads can drive this thing to 60% utilization...I can calculate this:



Swingbench on 11gR2 topas/atop Notes
Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU
734148 12665 11480
0/100 53 Extrapolating (if we had the CPU to make FIO the bottleneck)


Unless some new bottleneck is introduced, a database on FusionIO, (with the 53 Nahalem-EX threads to drive it), would be the fastest database I've ever seen, according to Swingbench testing.  734k TPM....I'm not talking IOPS...I'm saying achieved transactions, with each one having many IO's.  To put things in perspective, that's 13.6X faster than the SSD. There are 6 PCI-E slots on this server...so we could easily still see the bottleneck move to CPU, even with 4 sockets and 64 threads.


To be fair though, the SSD disk has a lot of things outside of the control of the manufacturer...it has to go though SAN hardware and HBA's.  The FusionIO has direct access with the PCI-E bus...which isn't really a bus, if you get into the details of PCI-E.  That's one of the reasons why, as the FusionIO card gets more and more busy, the response time continues to stay low.  In my testing, the worst response time I saw was .13 milliseconds (there's a . before the 13...that's 130 microseconds...not quite RAM speed, but closer to RAM response time than spindle response time.)...when the UC460 was completely pegged (which alone, is a difficult thing to accomplish.)


Summary:
FusionIO is ridiculously fast and from an IOPS/$ perspective, extremely cheap.  They captured a huge percentage of the flash market in the first year of their public offering, and its easy to see why.  In the techy world we scratch and claw for any technology that can bring a few percentage of performance improvements to our databases.  Its rare we see something that so completely transforms the environment...where we can shift bottlenecks and get 10X+ performance.  Think about what this means...today you likely have CPU monitoring on your servers to make sure a process doesn't spin and take up CPU needlessly...that's going to have to go out the window or become more intelligent...because the CPU will be expected to be the bottleneck.  We'll need to use Oracle Resource Manager more than ever. 


Downside: FusionIO is local storage (for now.)  Although their cost per IOP is low, their cost relative to cheaper storage on SAN is very high from a GB/$ standpoint, which is the traditional way storage is approached from storage administrators.  Its takes a strong database engineer to convince management to approach the issue from an IOPS requirement rather than a GB requirement.  Its a paradigm shift that needs to be made to reach ultimate performance goals.  Kaminario has addressed the local storage requirement issue well, from what I've heard.  Hopefully I'll take a look at that within a few months.


So...flash technology and especially FusionIO can be a game changer...but how can you configure it to be useful while most efficiently using your storage budget for databases?  I'll look at that next...


Flash with Databases - Part 1
Flash with Databases - Part 2
Flash with Databases - Part 3
Flash with Databases Inexpensively