Search This Blog

Thursday, December 22, 2011

Databases on Flash done Inexpensively

If you're looking at putting your database (SAP, Hana, Sql Server, MySQL, Oracle...whatever) on flash, you should *really* take a look at FusionIO.  FusionIO is how Facebook is able to drive its MySQL databases so fast.  The ioDrive card is a single point of failure for your storage, so you need some way to protect it.  Oracle ASM's normal redundancy is effectively software mirroring...and its very simple to use and set up.  If you have the budget, you can get truly extreme performance by using ASM's normal redundancy to mirror two ioDrives, but this method of protection would effectively double the cost of your FusionIO purchase. 

If you need more performance than you currently have, and you can't afford the cost of the storage it would take to put your entire database on flash, there's a different way to get it done, while still protecting the storage from a single point of hardware failure.

In my previous post about databases on flash, I talked about "option #4", which is-set up ASM as if its an extended cluster (when its not) and set it to use the fast, relatively expensive (per GB) FusionIO ioDrive storage as the preferred read failgroup.  The other failgroup is your existing SAN storage.  There are advantages of doing this instead of just getting faster storage on your SAN.  You're more HA than before, because you're protected from a storage failure from either the SAN or from the ioDrive.  If the SAN storage is unavailable, the database will stay up and running, tracking all the changes that are made since the SAN failed.  Depending on how long the outage lasts, and how you configured the diskgroup, when the SAN comes back, it'll apply those changes and after a while the SAN storage will again be in sync with the ioDrive.  If the ioDrive goes down, the database will continue to run completely on SAN storage until the ioDrive is back which point, things can be synced up again.

I wanted to quantify how this performs in a little more detail with a few tests.  Using the uc460 from the previous DB on Flash tests, I added an HBA, set up some mirrored SSD's on a SAN with 2 partitions, configured an ioDrive with 2 partitions and then I created 3 purely on the SSD's, one purely on FusionIO, and one that uses normal redundancy with the preferred reads coming from FusionIO.  All the databases are set up exactly the same, with only 256M of db_cache each.

Here is atop reporting IO while Swingbench is running on a database that's completely on the SSDs (sdb is the SSD presented from the SAN).

Test 1:

This is the same idea, but this time no SSD...with purely the ioDrive (fioa).

Test 2:

To set the preferred read fail group, you simply need to do this:

alter system set asm_preferred_read_failure_groups = 'diskgroup_name.failgroup_name';

This is the combination of the two, letting ASM distribute the write load in normal redundancy, with reads coming from the FusionIO card.

Test 3: you can see:

1. All reads (yes, except for those 4 stray IO's) are coming from fioa, and the writes are distributed pretty much equally (both in IO's and MB/s) between fioa and sdb.

2. Atop is showing that even though all reads are coming from fio and its doing the same amount of writes as the SSDs on the SAN, its still easily able to keep up with the workload...its being throttled by the slower sdb storage.  One more time I have to point out...the ioDrive is sooo fast.  Incidentally, this speed is from the slowest, cheapest ioDrive made...the other models are much faster.

3. The Swingbench workload test is forcing the exact ratio of reads/writes will always happen.  The potential for more than the 200% read performance shown above exists.  What you would see if you logged in to the database while the test is running is that reads are lightning fast, and writes are 30% faster than they've ever been before on the legacy storage.  In this configuration the legacy storage is only required to do writes and no reads, so its able to do the writes faster than before (60,596 vs 46,466 IO's and 57.19 vs 43.92 MB/s).  All this performance boost (200%+ reads, 130% writes), and we now have an HA storage half the cost of moving the database completely to flash.

In the real world, your legacy storage wouldn't be a mirrored SSD on a SAN, it would likely be much faster FC storage in a raid array.  This ioDrive could be a failgroup to san storage 3 times faster than the SSD before you'd get diminishing which point, you could just add a 2nd ioDrive.  Still, I think the approach and the results will be the same.  Unless you have a legacy SAN solution that's faster than a few of these can do together, there are definite performance gains to be had.

Monday, December 19, 2011

Flash with Databases-Part 3

In my previous posts on this topic, I talked about flash technology of SSD's and compared it to the performance of a FusionIO ioDrive card.  I also talked about how, in order to justify the cost of high performance storage, especially FusionIO technology, you have to transform the storage discussion from "How many GB of storage do you need?" to "Besides how much storage, what are your IOPS requirements?"  I'm telling you will get resistance from storage administrators.  Try to point out the extreme case...that you wouldn't run the enterprise database on slow sata by bringing up FusionIO, you're just talking about a different, faster storage tier...specialized for databases.  They'll point out that the storage isn't 100% busy getting faster storage won't help.  This is a fallacy...well, I should say...this might be a fallacy.  Usually % busy is talking about either average throughput/max throughput or percent of time waiting on storage.  Response time (which is key to designing a fast database) is often hidden from those reports.  This means your storage is often your just aren't measuring the right you can't see it in your reports.  When he shows you your slow database is only 50% busy on storage and your AWR reports are complaining about your storage performance...What is the bottleneck then?  If your CPU utilization is low, you don't have excessive waits on locks and your AWR report is complaining about storage...I'll bet its your storage. do you design for necessary IOPS?  This is actually a more difficult question that you might think.  What I did a few months ago for a client to consolidate 22 databases into one was...add up each of the requirements of the 22 databases (found by dba_hist_iostat_filetype), joining by the time of their snaps, and then I could find the requirements of the peak times at the storage level.  This was interesting because the peaks and lulls for the performance requirements didn't coincide when we would have thought.

For a single database, its much easier.  I'm talking about OLTP databases here...for DSS/warehouse databases...look at throughput, not IOPS.  As it relates to configuring for FusionIO, this query below will get you started.  DBA_IOSTAT has the number of physical IO's by file type and increments since the last bounce.  It will also reset when you bounce your database, so you have to filter out negative results.  What I did below is take the number of IO's between two snaps in sequence, then found the number of seconds in the duration of that snap, and divided to get the IOPS.  If you'll looking at a DSS database, you can look at the other columns in that view to find throughput per second.

select * from (
with snaps as (
select snap_id, sum(small_read_reqs+large_read_reqs) reads, sum(small_write_reqs+large_write_reqs) writes
group by snap_id
my_snaps as
(select snap_id, instance_number, begin_interval_time, end_interval_time,
 extract(second from (end_interval_time-begin_interval_time))+
 (extract(minute from (end_interval_time-begin_interval_time))*60)+
 (extract(hour from (end_interval_time-begin_interval_time))*60*60) seconds
 from dba_hist_snapshot)
select s1.snap_id snap_1, s2.snap_id snap_2, to_char(ms.begin_interval_time,'MM/DD/YYYY') sample_day, sum(s2.reads-s1.reads) reads, sum(s2.writes-s1.writes) writes, 
  trunc(sum(s2.reads-s1.reads)/sum(seconds)) rps, trunc(sum(s2.writes-s1.writes)/sum(seconds)) wps
from snaps s1, snaps s2, my_snaps ms
where s1.snap_id=ms.snap_id
  and s1.snap_id=(s2.snap_id-1)
  and (s2.reads-s1.reads)>1
group by s2.snap_id, to_char(ms.begin_interval_time,'MM/DD/YYYY'), s1.snap_id 
order by 7 desc 
) where rownum<11;

The output will look something like this:


664318 6492660 184 1801

521170 6356511 144 1763

2242398 1930744 1984 1708

1836943 5757317 509 1597

68419 9689 11201 1586

7373341 2418357 4794 1572

631855 5669473 175 1571

5218554 5272317 1446 1461

1059836 4884479 293 1354

5422603 4726101 1504 1310

631855 5669473 175 1571

5218554 5272317 1446 1461

1059836 4884479 293 1354

5422603 4726101 1504 1310

For configuring your database to use flash from SSD/FusionIO storage, there are 4 ways to do it that come to mind (there might be more...if you know of a better way than these, let me know)...each with their own pros and cons.  The problem we're trying to deal with here is that an SSD or FusionIO is a single point of failure:

1. Use Oracle's DB Cache Flash.  This can be thought of as a level 2 extension of the db_cache.  As a block ages out of the db_cache in RAM, it goes into the db flash cache. 

Pros: Easy to use, designed for flash.  Since it moves "hot" data at the (usually) 8k block level, its *much* more efficient than most SANs that move hot data at a page level.  Depending on the SAN...this can be anything from a few megs to 256MB+.  Usually the SAN technology moves the data on some sort of daily, it will look to see what 42MB sections were hot, and move them to flash.  If your activity is consistent, that's fine, if not, you probably won't be helped by this.  So, obviously moving hot blocks in real time is much more efficient.

Cons: There's a "warm up" time for this strategy...the db_flash_cache is cleared when there's a db bounce, and blocks need to come back to the cache over time.  Also, you have to use OEL for this feature to work.  From what I've been told by smart people at Oracle, this isn't a technical requirement, its built in to the database logic.  Its a marketing ploy to encourage people to migrate to OEL.  I'm the biggest cheerleader I know for companies to at least take a look at Oracle Enterprise has a lot of merits that mostly come down to support cost savings.  If companies choose to go with OEL, it should be because its the best choice for them...not because Oracle forced them to do it by disabling features in their database if they don't.  This is the same argument I made for enabling HCC on non-Sun storage.  What happens if the FusionIO card or SSD you're using for db_flash_cache crashes?  I know the data isn't primarily stored there...its also on your normal storage, so your data isn't in jeopardy...but will it crash your database?  If so, you need 2 FusionIO cards to mirror the db flash cache, which doubles your FusionIO storage cost.

2. Use Dataguard.  You can set up a 2nd server and have the redo applied in real-time to it, so if there's a failure on the local FusionIO storage, you just fail over to the new server.  Both servers would have a single FusionIO card.

Pros: Provides redundancy for all non-logical problems, including server hardware.

Cons: This doubles your server hardware costs, FusionIO storage cost and might more than double your Oracle licensing fees.

3. Use Oracle's ASM to mirror the storage across two FusionIO cards with normal redundancy.

Pros: Easy to do
Cons: Doubles $ per GB costs of storage

4. Use Oracle ASM's extended clustering option and set up FusionIO as the preferred read failgroup, and use cheaper SAN storage for the 2nd failgroup in normal redundacy.  This is a topic for a whole new post.

Pros:  Provided redundancy between FusionIO and existing legacy storage, but send no read IO's to that legacy storage, making it able to keep up with write IO's much faster.  Usually, reads are over 50% of OLTP workload...sometimes 3-4X the writes.  You can use the query above to find out if that's true with your workload.  This means, when paired with FusionIO, your existing storage will perform many X faster than it is today because its only getting a fraction of the workload.  Its also HA: if there's a failure in your legacy storage or your FusionIO storage, the database keeps running while you correct the issue.  11gR2 has a new feature to increase the sync speed of fail groups after the problem is tracks changes being made while one of the failgroups is down.  When it comes back, it only has to apply those changes, rather than copying all the data.  Your reads (and therefore, the majority of your database workload) are as fast as can be provided by FusionIO (very, very fast).

Cons: Writes don't perform as well as a pure FusionIO solution, but writes will perform faster than your existing storage, because they won't get the read workload.

When you consider the cost/benefits...#4 is an excellent solution.  #3 is better, if you have the budget.  If you're doing RAC, I think #1 is your only option, if you want to use FusionIO.  Remember if you doing RAC, that the db flash cache in #1 is local to the node.  Normal L1 db_cache will go across the interconnect to other nodes, but L2 db_flash_cache does not.

For comparison's sake, the output above from the query is from a fairly fast database with 40+ cores on a fast EMC Clariion CX-4 with 15k FC storage.  This is NOT to say that's the performance limit of the CX-4, its only to say that this particular database could perform as well on FusionIO. Here's the numbers from swingbench on a single Fusion IO with the UC460 server I used in my previous post:

21 12/15/2011 20 8297280 10464273 3843 4846

...this means FusionIO is a viable alternative to a large, expensive SAN storage for this particular OLTP database, if you can deal with its "local storage" limitations.  This db could perform 2.5X or faster on FusionIO...the size of it would make me want to get 2 cards...which would give us many times the potential IOPS it can get today for the SAN, for just a few thousand dollars.

All this so far was about OLTP...let's talk for a second about DSS databases.  I've seen 1900MB/sec throughput on a Clariion CX-4 that was short stroked with many 15k FC drives on an IBM P5 595...I'm sure there are better results out there, but from a database, that's the best I've seen on a CX-4.  It would likely take 2 FusionIO cards like I tested to match that throughput (based on specs), but there are other, bigger, badder FusionIO cards that could blow that performance away with speeds up to 6.7GB/sec/card, while still getting the microsecond response times.  Assuming you have several PCI-E slots in your large server, you could use many of these things together.  Unless you have a monstrous server, with this storage CPU is the new bottleneck, not storage.

Summary:  Ignoring products like Kaminario K2, FusionIO has a niche-it can't work for all workloads and environments because it has to be locally installed in a server's PCI-E slot.  There are a lot of ways to use it with databases...make sure you recognize the fact that its a single point of failure and protect your data.  For OLTP databases that are a few TB, I can't imagine anything faster.  FusionIO ioDrive has become the defacto standard to run SAP's HANA database, and more and more Sql Server databases are using it to break performance thresholds too.  List price for the entry level card I tested (384GB/MLC) is around $7k, but the cost for the larger cards isn't linear...the 2.4TB cards are cheaper per GB than the small card I tested.  The FusionIO Octal cards are 10.24TB in size.  If you have a few thousand to spend, you should check it out.  For that matter, if you have a few thousand laying around...take a look at their stock, FIO.  Its a small company with a great product that's taking huge market share.  I've been told they're currently the flash market leader, after only 1 year.  I wouldn't be surprised if they get bought out by one of the storage giants (Dell/EMC) soon.

I've just heard that FusionIO has added a former HP hotshot to its board...I guess we now know who's going to buy them out. :)

Also, I came across this interesting blog talking about FusionIO with MySql.

I realize this is starting to sound like a sales pitch for FusionIO...I have no ties to them at all...but as a geek, I get excited when I see technology that can boost database performance to this degree.  People are afraid of flash technology because very early products didn't have the longevity or stability that was offered by enterprise class FC disks.  This has hugely improved over the years.  The FusionIO cards are expected to last ~7 years, and have a 5 year warranty.  

Flash with Databases-Part 2

In my previous post on Flash with databases, I talked about the upcoming FusionIO tests.

Here's the hardware configuration overview:
Server1          : Dell 610 2X4 (8 cores) 
Server2          : Cisco UC460 4X8 (32 threads, 16 cores)
SSD               : 2 mirrored SSD disks
HBAs            : 2-4GB/s
FusionIO        : IODrive 

First, let's look at the SSD to establish a baseline.  The configuration is 2 SSD disks, mirrored over 2, 4GB HBA's.  The tests were done multiple times, so these results are averages.  For all the Swingbench tests after the first, I reduced the db_cache to just 256MB to reduce the IO to be nearly all physical.  I want to stress the storage, not the cache.  A good storage test must have the storage be the bottleneck.  It might be interesting to compare these results to the Hitachi USP-V tests from last year.  The testing was very similar, but the results were opposites of each other, due to the extreme response time differences.

ORION Swingbench on 11gR2 topas/atop
IOPS MBPS Latency Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU%
8303 100.98 0.52 164393 3489 2197 62 100 1.61

54674 1100 872 156 100 1.1

IMHO, these are very nice results...this storage would support a very fast database.  Obviously, more SSD's would mean more throughput (if we have enough HBA's.)

The SSD tests above were done on a Dell 610, dual-socket, 8 core server.  As you can see from the green Disk Busy and CPU columns, atop was reporting 100% disk utilization and 1.1 core utilized (110% cpu used.)

When I first started the tests using FusionIO, I could see a huge speed difference right off.  Here are the Orion results:

33439 750.5 0.08

Compared to the SSD, that's 4X the IOPS, 7.5X the throughput and 6.5X faster response time!  The response time is measured in what Orion is saying is that the response time is .08 milliseconds...which is 80 MICRO seconds. My baseline is very fast SSD give you wouldn't be unusual to see normal FC storage on a SAN show 10-15 ms latency.

Above, I mention that in order to test storage, the storage has to be the bottleneck in the test...but I had a problem when I did the Swingbench test.  No matter what I did, I couldn't cause the storage to be the bottleneck after I moved the indexes, data and redo to flash.  At first I thought there was something wrong and the FusionIO driver was eating up the CPU...but I didn't see the problem in the Orion tests on FusionIO, so that didn't make sense. 

Swingbench on 11gR2 Topas/Atop
MaxTPM Max TPS Avg TPS Avg Resp Disk Busy % CPU% Notes
20726 478 261 597 100/6 6.64 Indexes on Flash;First time using the FusionIO-CPU is very high
189132 3522 2897 47 100/63 4.92 Indexes and Data on Flash
162165 2868 2638 42 97/100 8 Indexes, Data and Redo on Flash;all CPU cores are completely pegged

The problem wasn't the driver...the problem was I was able to generate more load than I had ever done before, because the bottleneck had moved from storage to CPU...even with my 256MB db_cache! 

Notice too that atop was reporting FusionIO was 100% utilized...strange things happen to monitoring when CPU is pegged.  For the Disk Busy% field, the first number is the SAN storage, the second number is % busy reported by atop for the FusionIO card.  This 100% busy is accurate from a certain perspective.  The FusionIO card driver uses the CPU from the server to access the storage, and it couldn't get any cpu cycles to the driver to access the storage.  It appears to atop that its waiting for the storage, and so the storage was reported as 100% busy.  That isn't to say the FusionIO card was maxed out...just that you couldn't access the data any faster...which isn't exactly the same thing in this case.  The CPU bottleneck created what appeared to be a storage bottleneck because FusionIO uses the server CPU cycles, and there weren't any available.  I didn't investigate further, but I would suspect a tweaking of the nice CPU priority settings for the driver's process would allow the FusionIO to perform better and report more accurately.  At any rate, the Dell 610 with 2X4 (2 sockets, 4 cores each) couldn't push the FusionIO card to its limits with Swingbench because it didn't have the CPU cycles needed to make storage the bottleneck.

To deal with this CPU limitation, Cisco was kind enough to let us borrow a UC460, which has the awesome UCS technology and 4 sockets with 8 core processors each.  I only had 2 sockets populated, which more than doubled my compute power, giving me 16 Nahalem-EX cores and 32 threads of power (a huge upgrade from 8 Nahalem-EP cores).  I installed the Fusion IO card and retested.  With everything in the database on FusionIO with my extremely low db_cache to force physical IO instead of logical IO, it took 12.37 Nahalem-EX threads to push the FusionIO card to 100 utilization.  When I first did the test with SSD on the Dell, using a normal db_cache, I was only able to get 167k TPM.  Here, I did 170k.  This means I was able to get more speed with purely physical IO's than I was able to get from physical+logical IO's on the SSD's.

Swingbench on 11gR2 topas/atop Notes
Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU%
170036 123438 2749 48 0/100 12.37 Everything (including undo) on flash

This made me wonder...if I didn't hold back the db cache, what could the Cisco UC460 do with FusionIO?  In a normal db configuration, how would it perform?  The answer:

Swingbench on 11gR2 topas/atop Notes
Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU%
440489 7599 6888 8 0/60 31.41 Everything on flash, with normal memory settings

It took almost 32 threads, and now I'm once again almost 100% CPU utilized.  At this point with 32 threads on a high-performance Cisco UC460, the FusionIO is only 60% utilized. This is the fastest server I have to test with...there's never an 8X10 laying around when you need one.... :)  That's ok, I have enough information to extrapolate.  If 32 threads can drive this thing to 60% utilization...I can calculate this:

Swingbench on 11gR2 topas/atop Notes
Max TPM Max TPS Avg TPS Avg Resp Disk Busy % CPU
734148 12665 11480
0/100 53 Extrapolating (if we had the CPU to make FIO the bottleneck)

Unless some new bottleneck is introduced, a database on FusionIO, (with the 53 Nahalem-EX threads to drive it), would be the fastest database I've ever seen, according to Swingbench testing.  734k TPM....I'm not talking IOPS...I'm saying achieved transactions, with each one having many IO's.  To put things in perspective, that's 13.6X faster than the SSD. There are 6 PCI-E slots on this we could easily still see the bottleneck move to CPU, even with 4 sockets and 64 threads.

To be fair though, the SSD disk has a lot of things outside of the control of the has to go though SAN hardware and HBA's.  The FusionIO has direct access with the PCI-E bus...which isn't really a bus, if you get into the details of PCI-E.  That's one of the reasons why, as the FusionIO card gets more and more busy, the response time continues to stay low.  In my testing, the worst response time I saw was .13 milliseconds (there's a . before the 13...that's 130 microseconds...not quite RAM speed, but closer to RAM response time than spindle response time.)...when the UC460 was completely pegged (which alone, is a difficult thing to accomplish.)

FusionIO is ridiculously fast and from an IOPS/$ perspective, extremely cheap.  They captured a huge percentage of the flash market in the first year of their public offering, and its easy to see why.  In the techy world we scratch and claw for any technology that can bring a few percentage of performance improvements to our databases.  Its rare we see something that so completely transforms the environment...where we can shift bottlenecks and get 10X+ performance.  Think about what this you likely have CPU monitoring on your servers to make sure a process doesn't spin and take up CPU needlessly...that's going to have to go out the window or become more intelligent...because the CPU will be expected to be the bottleneck.  We'll need to use Oracle Resource Manager more than ever. 

Downside: FusionIO is local storage (for now.)  Although their cost per IOP is low, their cost relative to cheaper storage on SAN is very high from a GB/$ standpoint, which is the traditional way storage is approached from storage administrators.  Its takes a strong database engineer to convince management to approach the issue from an IOPS requirement rather than a GB requirement.  Its a paradigm shift that needs to be made to reach ultimate performance goals.  Kaminario has addressed the local storage requirement issue well, from what I've heard.  Hopefully I'll take a look at that within a few months.

So...flash technology and especially FusionIO can be a game changer...but how can you configure it to be useful while most efficiently using your storage budget for databases?  I'll look at that next...

Flash with Databases - Part 1
Flash with Databases - Part 2
Flash with Databases - Part 3
Flash with Databases Inexpensively 

Friday, December 16, 2011

Flash with Databases-Part 1

For those (like me) that just want to see the results, go <* HERE *>

I've been asked to review some flash storage from different vendors from a database perspective.  Storage Administrators have run their battery of tests and at this point narrowed down the field to a famous enterprise storage manufacturer's SSD's on SAN storage and FusionIO (320GB first generation).  The difference between these technologies is pretty vast...starting with their interface.  Placing SSD drives in a SAN solution is transparent to the could be any other storage (FC, SATA) connecting to the server from the just gets presented as a lun.  There are  a lot of advantages to that approach with ease of administration and scalability.  FusionIO uses PCI-E cards connected locally to the server.  There are obvious pros and cons to this approach too. If a card dies, you have to go to that server and pull the card...also, there are a limited number of PCI-E slots in servers...some of the most popular, dense servers, like Cisco's B230M2 have no PCI-E they aren't compatible with FusionIO.  The sales guys from FusionIO tell me they're working closely with Cisco to begin selling storage with mezzanine interface, to eliminate the PCI-E slot requirement.

For FusionIO, the speed of a direct PCI-E interface is tremendous, and it really shows up in the response time...which, to a database is hugely important...but being limited to a single server for this storage, IMHO is a limitation that's difficult for a storage admin to accept.  The SAN-based approach taken by the SSD manufacturer allows much more simple, traditional central management of the storage through an HBA to your server.  Likely, if you went with this solution it would be the same as your current FC storage, only faster.  No major changes needed for monitoring, administration, etc.  These advantages of the SSD solution can't be quantified, but they should at least be mentioned.  FusionIO's response is that their cards are expected to last through ~7 years of typical use (depending on how much storage you reserve, which is configurable) so although administration and monitoring can be handled with SNMP traps and hot-swappable interfaces, its likely unnecessary because your server will be life cycled out long before your need to replace your FusionIO card.  That being said, even the most reliable hardware has failures, so your configuration needs to compensate with redundancy.

The plan is to test the storage with Orion and Swingbench, moving different items in the database (data, indexes, redo logs, temp, undo) onto flash to find the best way to use flash and to see the performance benefit.  Much of this testing has already been done by others, but there were some things I saw missing in the flash tests on databases I've seen:

  1. The affect of undo on flash

  2. The resource consumption of FusionIO, which uses a driver to simulate disk storage.  This must introduce overhead, but I've never seen it measured. 

  3. As a flash device fills, it slows down.  To be precise, its write performance slows down.  Performance can also degrade after extended periods of high IO.  How will this affect a database?

The tests will be similar to the tests done in previous posts to compare storage on/behind the Hitachi USP-V. Part 1, Part 2, Part 3, Part 4.

FusionIO uses a driver on the server to access their storage to emulate a disk drive...similar to a RAM disk.  This strategy must introduce some overhead.  Texas Memory Systems considers that a selling point of their flash technology, it offloads the CPU overhead to a chip directly on their flash cards.  Since Oracle licenses by core on the system (yes, that's overly simplified), that means you have to pay Oracle for the expensive cpu cycles used by FusionIO.  This may not turn out to be significant...we'll see in testing.  I may get a look at one of the TMS systems in a few'll be an interesting comparison to the FusionIO card, since they're both PCI-E based.  I'm  also thinking about throwing Kaminario in the mix if that's possible.  Kaminario doesn't have the name recognition of EMC, but they have an interesting product, designed for ulta low latency of flash.  They use DRAM for their tier 1 storage and FusionIO for their Tier 2 storage.  A rep from FusionIO told me their cards in Kaminario are nearly as fast as their cards plugged directly into a local PCI-E slot...pretty interesting.

Which is cheaper?

FusionIO and all single SSD drives I've seen are a single point of failure.  For a database that's important enough to your company to justify the cost of ultra-fast storage, its likely important enough to require elimination of a hardware single point of failure.  For an SSD on a SAN or NAS you already own, that might mean you just increased the cost per GB by 25%, (because you'll do 4+1 RAID5.)  For FusionIO, since there's a limited number of PCI-E slots, your best option for a database is to do what Oracle did on Exaadata...use ASM's redundancy this case, that means you probably mirror it with normal redundancy...which means you doubled the cost per GB.  When you consider the cost of SSD vs can't just look at your quoted price per GB, you have to consider configuration, which makes FusionIO more expensive at first glance.  Another option is to use 2 servers and license a dataguard standby...but that increases your Oracle licensing and again, doubles the cost per GB of FusionIO storage.  If you don't already have the infrastructure in place with extra capacity, you also have to consider the "other" costs on the SSD side...HBA's and SAN costs.

Usually...that's the traditional kind of price comparison that's done with storage.  For database storage, what I highly recommend is that you base your requirements on IOPS per $ if its an OLTP database and throughput per $ if its a DSS database, not GB per $.  When you're talking about a network share...$/GB might be the only consideration...but a database needs IOPS to perform well.  You COULD run your enterprise database on cheap, large SATA disks you bought from Fry's or Microcenter, but when people see how it performs you'll be fired and replaced by somebody who isn't afraid to have the difficult conversations that justify SSD's for databases. :)  You could buy a room full of these drives and maybe achieve the IOPS or throughput required for your database, but at that point, FusionIO or a SAN with SSD's would be cheaper...especially when you consider administration costs of maintaining and cooling that room.

I'm testing with the 320GB FusionIO Version 1 card. Here are the specs:

ioDrive Capacity  320GB 
Nand Type  Multi Level Cell (MLC)
Read Bandwidth (64kB)  770 MB/s
Write Bandwidth (64kB)  790 MB/s
Read IOPS (512 Byte)  140,000
Write IOPS (512 Byte)  135,000
Mixed IOPS (75/25/r/w)  119,000
Access Latency (512 Byte)  26 μs

In contrast, here's the specs for the Version 2 of this card:

ioDrive2 Capacity  400GB  365GB
Nand Type  SLC MLC
Read Bandwidth (1 MB)  1.4 GB/s  710 MB/s
Write Bandwidth (1 MB)  1.3 GB/s  560 MB/s
Read IOPS (512 B)  351,000 84,000
Write IOPS (512 B)  511,000 502,000
Read Access Latency  47µs  68µs
Write Access Latency  15µs  15µs

You might notice the difference in read throughput and IOPS on the V2 card dropped from the V1.  This is because they changed the design of the card to be more dense moving from 30nm to ~22nm.  MLC is noisy, so to ensure error-free data, they've implemented ECC, which serializes reads.  Writes are still parallel.  Latency has been lowered with the new design from 29 μs on the V1 to 15µs on v2 for writes...but read latency is much slower (but still insanely fast) at 68µs.

FIO expects the read speeds to become closer to their write speeds by driver updates in the near future.  This is really cutting edge tech at this chip nm density, its only now starting to be available for non-Facebook customers of FusionIO.  Facebook is a huge customer, and they get the good stuff first...

Anyway, since the V2 card isn't performing as well as it eventually will, and since the V1 card is faster, I opted for testing the V1 card.

I look forward to these tests...they should be interesting.  I'll let you know what I find out.

Flash with Databases - Part 1
Flash with Databases - Part 2
Flash with Databases - Part 3
Flash with Databases Inexpensively