Monday, December 19, 2011

Flash with Databases-Part 3

In my previous posts on this topic, I talked about flash technology of SSD's and compared it to the performance of a FusionIO ioDrive card.  I also talked about how, in order to justify the cost of high performance storage, especially FusionIO technology, you have to transform the storage discussion from "How many GB of storage do you need?" to "Besides how much storage, what are your IOPS requirements?"  I'm telling you will get resistance from storage administrators.  Try to point out the extreme case...that you wouldn't run the enterprise database on slow sata by bringing up FusionIO, you're just talking about a different, faster storage tier...specialized for databases.  They'll point out that the storage isn't 100% busy getting faster storage won't help.  This is a fallacy...well, I should say...this might be a fallacy.  Usually % busy is talking about either average throughput/max throughput or percent of time waiting on storage.  Response time (which is key to designing a fast database) is often hidden from those reports.  This means your storage is often your just aren't measuring the right you can't see it in your reports.  When he shows you your slow database is only 50% busy on storage and your AWR reports are complaining about your storage performance...What is the bottleneck then?  If your CPU utilization is low, you don't have excessive waits on locks and your AWR report is complaining about storage...I'll bet its your storage. do you design for necessary IOPS?  This is actually a more difficult question that you might think.  What I did a few months ago for a client to consolidate 22 databases into one was...add up each of the requirements of the 22 databases (found by dba_hist_iostat_filetype), joining by the time of their snaps, and then I could find the requirements of the peak times at the storage level.  This was interesting because the peaks and lulls for the performance requirements didn't coincide when we would have thought.

For a single database, its much easier.  I'm talking about OLTP databases here...for DSS/warehouse databases...look at throughput, not IOPS.  As it relates to configuring for FusionIO, this query below will get you started.  DBA_IOSTAT has the number of physical IO's by file type and increments since the last bounce.  It will also reset when you bounce your database, so you have to filter out negative results.  What I did below is take the number of IO's between two snaps in sequence, then found the number of seconds in the duration of that snap, and divided to get the IOPS.  If you'll looking at a DSS database, you can look at the other columns in that view to find throughput per second.

select * from (
with snaps as (
select snap_id, sum(small_read_reqs+large_read_reqs) reads, sum(small_write_reqs+large_write_reqs) writes
group by snap_id
my_snaps as
(select snap_id, instance_number, begin_interval_time, end_interval_time,
 extract(second from (end_interval_time-begin_interval_time))+
 (extract(minute from (end_interval_time-begin_interval_time))*60)+
 (extract(hour from (end_interval_time-begin_interval_time))*60*60) seconds
 from dba_hist_snapshot)
select s1.snap_id snap_1, s2.snap_id snap_2, to_char(ms.begin_interval_time,'MM/DD/YYYY') sample_day, sum(s2.reads-s1.reads) reads, sum(s2.writes-s1.writes) writes, 
  trunc(sum(s2.reads-s1.reads)/sum(seconds)) rps, trunc(sum(s2.writes-s1.writes)/sum(seconds)) wps
from snaps s1, snaps s2, my_snaps ms
where s1.snap_id=ms.snap_id
  and s1.snap_id=(s2.snap_id-1)
  and (s2.reads-s1.reads)>1
group by s2.snap_id, to_char(ms.begin_interval_time,'MM/DD/YYYY'), s1.snap_id 
order by 7 desc 
) where rownum<11;

The output will look something like this:


664318 6492660 184 1801

521170 6356511 144 1763

2242398 1930744 1984 1708

1836943 5757317 509 1597

68419 9689 11201 1586

7373341 2418357 4794 1572

631855 5669473 175 1571

5218554 5272317 1446 1461

1059836 4884479 293 1354

5422603 4726101 1504 1310

631855 5669473 175 1571

5218554 5272317 1446 1461

1059836 4884479 293 1354

5422603 4726101 1504 1310

For configuring your database to use flash from SSD/FusionIO storage, there are 4 ways to do it that come to mind (there might be more...if you know of a better way than these, let me know)...each with their own pros and cons.  The problem we're trying to deal with here is that an SSD or FusionIO is a single point of failure:

1. Use Oracle's DB Cache Flash.  This can be thought of as a level 2 extension of the db_cache.  As a block ages out of the db_cache in RAM, it goes into the db flash cache. 

Pros: Easy to use, designed for flash.  Since it moves "hot" data at the (usually) 8k block level, its *much* more efficient than most SANs that move hot data at a page level.  Depending on the SAN...this can be anything from a few megs to 256MB+.  Usually the SAN technology moves the data on some sort of daily, it will look to see what 42MB sections were hot, and move them to flash.  If your activity is consistent, that's fine, if not, you probably won't be helped by this.  So, obviously moving hot blocks in real time is much more efficient.

Cons: There's a "warm up" time for this strategy...the db_flash_cache is cleared when there's a db bounce, and blocks need to come back to the cache over time.  Also, you have to use OEL for this feature to work.  From what I've been told by smart people at Oracle, this isn't a technical requirement, its built in to the database logic.  Its a marketing ploy to encourage people to migrate to OEL.  I'm the biggest cheerleader I know for companies to at least take a look at Oracle Enterprise has a lot of merits that mostly come down to support cost savings.  If companies choose to go with OEL, it should be because its the best choice for them...not because Oracle forced them to do it by disabling features in their database if they don't.  This is the same argument I made for enabling HCC on non-Sun storage.  What happens if the FusionIO card or SSD you're using for db_flash_cache crashes?  I know the data isn't primarily stored there...its also on your normal storage, so your data isn't in jeopardy...but will it crash your database?  If so, you need 2 FusionIO cards to mirror the db flash cache, which doubles your FusionIO storage cost.

2. Use Dataguard.  You can set up a 2nd server and have the redo applied in real-time to it, so if there's a failure on the local FusionIO storage, you just fail over to the new server.  Both servers would have a single FusionIO card.

Pros: Provides redundancy for all non-logical problems, including server hardware.

Cons: This doubles your server hardware costs, FusionIO storage cost and might more than double your Oracle licensing fees.

3. Use Oracle's ASM to mirror the storage across two FusionIO cards with normal redundancy.

Pros: Easy to do
Cons: Doubles $ per GB costs of storage

4. Use Oracle ASM's extended clustering option and set up FusionIO as the preferred read failgroup, and use cheaper SAN storage for the 2nd failgroup in normal redundacy.  This is a topic for a whole new post.

Pros:  Provided redundancy between FusionIO and existing legacy storage, but send no read IO's to that legacy storage, making it able to keep up with write IO's much faster.  Usually, reads are over 50% of OLTP workload...sometimes 3-4X the writes.  You can use the query above to find out if that's true with your workload.  This means, when paired with FusionIO, your existing storage will perform many X faster than it is today because its only getting a fraction of the workload.  Its also HA: if there's a failure in your legacy storage or your FusionIO storage, the database keeps running while you correct the issue.  11gR2 has a new feature to increase the sync speed of fail groups after the problem is tracks changes being made while one of the failgroups is down.  When it comes back, it only has to apply those changes, rather than copying all the data.  Your reads (and therefore, the majority of your database workload) are as fast as can be provided by FusionIO (very, very fast).

Cons: Writes don't perform as well as a pure FusionIO solution, but writes will perform faster than your existing storage, because they won't get the read workload.

When you consider the cost/benefits...#4 is an excellent solution.  #3 is better, if you have the budget.  If you're doing RAC, I think #1 is your only option, if you want to use FusionIO.  Remember if you doing RAC, that the db flash cache in #1 is local to the node.  Normal L1 db_cache will go across the interconnect to other nodes, but L2 db_flash_cache does not.

For comparison's sake, the output above from the query is from a fairly fast database with 40+ cores on a fast EMC Clariion CX-4 with 15k FC storage.  This is NOT to say that's the performance limit of the CX-4, its only to say that this particular database could perform as well on FusionIO. Here's the numbers from swingbench on a single Fusion IO with the UC460 server I used in my previous post:

21 12/15/2011 20 8297280 10464273 3843 4846

...this means FusionIO is a viable alternative to a large, expensive SAN storage for this particular OLTP database, if you can deal with its "local storage" limitations.  This db could perform 2.5X or faster on FusionIO...the size of it would make me want to get 2 cards...which would give us many times the potential IOPS it can get today for the SAN, for just a few thousand dollars.

All this so far was about OLTP...let's talk for a second about DSS databases.  I've seen 1900MB/sec throughput on a Clariion CX-4 that was short stroked with many 15k FC drives on an IBM P5 595...I'm sure there are better results out there, but from a database, that's the best I've seen on a CX-4.  It would likely take 2 FusionIO cards like I tested to match that throughput (based on specs), but there are other, bigger, badder FusionIO cards that could blow that performance away with speeds up to 6.7GB/sec/card, while still getting the microsecond response times.  Assuming you have several PCI-E slots in your large server, you could use many of these things together.  Unless you have a monstrous server, with this storage CPU is the new bottleneck, not storage.

Summary:  Ignoring products like Kaminario K2, FusionIO has a niche-it can't work for all workloads and environments because it has to be locally installed in a server's PCI-E slot.  There are a lot of ways to use it with databases...make sure you recognize the fact that its a single point of failure and protect your data.  For OLTP databases that are a few TB, I can't imagine anything faster.  FusionIO ioDrive has become the defacto standard to run SAP's HANA database, and more and more Sql Server databases are using it to break performance thresholds too.  List price for the entry level card I tested (384GB/MLC) is around $7k, but the cost for the larger cards isn't linear...the 2.4TB cards are cheaper per GB than the small card I tested.  The FusionIO Octal cards are 10.24TB in size.  If you have a few thousand to spend, you should check it out.  For that matter, if you have a few thousand laying around...take a look at their stock, FIO.  Its a small company with a great product that's taking huge market share.  I've been told they're currently the flash market leader, after only 1 year.  I wouldn't be surprised if they get bought out by one of the storage giants (Dell/EMC) soon.

I've just heard that FusionIO has added a former HP hotshot to its board...I guess we now know who's going to buy them out. :)

Also, I came across this interesting blog talking about FusionIO with MySql.

I realize this is starting to sound like a sales pitch for FusionIO...I have no ties to them at all...but as a geek, I get excited when I see technology that can boost database performance to this degree.  People are afraid of flash technology because very early products didn't have the longevity or stability that was offered by enterprise class FC disks.  This has hugely improved over the years.  The FusionIO cards are expected to last ~7 years, and have a 5 year warranty.  

No comments:

Post a Comment