Thursday, June 9, 2011

How much faster is Exadata High Capacity vs High Performance Storage? (short stroking)

Compression is a powerful feature in Exadata, especially HCC compression...but in the real world you have time constraints on your migration project and it isn't necessarily possible to compress everything you'd like to compress have to test and compare the performance impact on queries vs storage gains.There are so many options for compression in Exadata...basic compression (formerly bulk compression), compress for OLTP (formerly advanced compression), HCC compress for query (low, high) and HCC compress for archive (low, high)...not to mention index key compression, which I'll post about later.  Which one you choose is dictated by your data access patterns.   It takes time to figure out which is right for each table/partition.

This came up because my client has too much data to fit into a full high performance storage cell machine (over 28TB).  We've dropped indexes where possible, which has had a hugely positive impact...but now we're finding many of them need to be added again for performance reasons.  This was expected and planned...and we should be fine on capacity for the migration, but when this database hits the expected growth curve in a few years, we'll be in trouble.

One of the on-site Oracle consultants (who was very stressed about getting things to fit) suggested we move from our 80/20 data/fra High Performance storage machine to a High Capacity storage machine.  He said, "Since most of your IO will be hitting the flash cache anyway, you should only see it be a few percentage slower."  He said this w/o ever looking at our access patterns, so I dismissed it out of hand.  Some other people on the project pointed out...companies don't buy Exadata because they want "good enough" buy Exadata because you want ultra performance.  There were other "good enough" platforms that were much cheaper than Exadata.  Knowing they have ~1:1 read/write ratio, I wanted to try to quantify the performance difference between the options.  Everything is identical between an Exadata X2-2 high performance storage machine and an Exadata X2-2 high capacity machine...except for the spindles, so that's what I'll be focusing on.

Of course...all things being equal, High Capacity SAS storage cells are slower than High Performance SAS storage cells.  The HP cells have 15000rpm 600GB SAS disks.  HC cells have 7200rpm 2TB SAS disks (which are mechanically similar to the SATA disks from Exadata V1 machines...although the sales guys won't say it, Kevin Closson did:  "...think in the same way you do about technology like FC-SATA (a SATA disk with FC attach and FC-SATA head electronics."

Things are not necessarily equal though because Exadata short strokes its storage.  The idea of short stroking disks is that...the outside circumference of a spinning disk is moving faster than the inside...not in RPM's of course, but in the distance travelled by the head of the the throughput of the outside of the spindle is higher than the throughput of the inside, and if the head doesn't have as far to move, it will take less time for it to be where you want it to be.

Exadata takes advantage of short stroking and puts the most performant storage on the outside of the spindles for the data diskgroup, followed by the fra, and the most inside, slowest part of the spindle is used for dbfs.  There's a new 11.2 feature that does something similar for local storage...but that's a topic for a different post.  Once the diskgroup is created in ASM, there's a hash algorithm that distributes that data evenly around on the storage w/in that short stroking in ASM can only happen when the diskgroup is created (barring a resize).

There's a standard data/fra ratio that Exadata uses normally...but the percentage of the storage vs FRA dedicated to data is variable in the Exadata setup script...we chose an 80/20 path.  Obviously flash cache plays a huge role in IO for reads...but we're a little write heavy.  Its possible that at some point as we short stroke the HC disks more and more, we'll approach or possibly excede the performance of the HP disks.

Here's the math for the SAS-2(6GB/s) drives:
2TB SAS Spec Sheet

2TB (High Capacity)
rotational speed=7200
avg latency=8ms/2=4 (4.16ms from data sheet)
avg access time=13.06
Sustained Sequencial Read is 90MB/s(ID) and 144MB/s(OD)
Avg throughput:117MB/s

600GB SAS Spec Sheet
600GB (High Performance)
rotational speed=15000
avg latency=4ms/2=2ms
avg access time=5.65
Sustained Sequential Read is 122MB/s(ID) and 204MB/s(OD)
Avg throughput:163MB/s

Latency is the time it takes for the spindle to spin around...sometimes you're closer and sometimes you're farther from the data you're going worst case scenario, you're a full spin case you're next to the on average take latency/2 and that's what you can typically expect for avg latency.  Avg access time is avg latency+seek time.  This is how long you can expect the head to move per IO.

Again...we're doing an 80/20 configuration for data/fra...with data shortstroked on the outside edge we lower the capacity to 480GB (80% of 600GB), but our throughput on HP should be a little better than the spec at (204-122)*(1-.8)+122=138.4MBs (minimum inside).  204MBs (outside) to 138.4MBs(inside) so the avg throughput after short stroking is 171.2MB/s, increased from 163MB/s.

For the same storage (480GB) of data on the 2TB high capacity disks, we'll be using only the outer 23.4% (480GB/2048 GB) of the HC spindles. The spec throughput range is 90MB/s(inside) to144MB/s(outside), so (144-90)*(1-.234)+90=131MB/s (minimum inside).  (144+131)/2 gives us an average of 137.5MB/s, increased from 117MB/s.

The same logic works for average seek times.  If you only use the outer 50% of the spindle, you cut your seek time in half.  In our case, we're using the outer 80% for HP and outer 23.4% for HC.

2TB (High Capacity Short Stroked)
rotational speed=7200
avg latency=8ms/2=4 (4.16ms from data sheet)
avg access time=13.06ms6.24ms
Sustained Sequencial Read is 90MB/s(ID) 131MB/s(ID)and 144MB/s(OD)
Avg throughput:117MB/s137.5MB/s
0-byte IOPS=77160
32k IOPS=~154

32k*160=5MB/s@137.5MB/s=1/27.5 seconds for transferring, rather than accessing the data, so the 32k IOPS would happen about 154 times/sec. 

600GB (High Performance Short Stroked)
rotational speed=15000
avg latency=4ms/2=2ms
avg access time=5.654.92ms
Sustained Sequential Read is 122MB/s(ID) 138.4MB/s(ID) and 204MB/s(OD)
Avg throughput:163MB/s171.2MB/s
0-byte IOPS=177203
32k IOPS=~195

32k*203=6.34MB/s@171.2 would take 1/27th of a second for transferring, rather than accessing the data, so that would lower the 32k IOPS to about 195.

Soo...for throughput 137.5/171.2 tells us the HC disks would be 80.3% as fast as the HP disks, not counting caching.  For the all-important IOPS measurement, HC would be about 79% as fast as the HP disks.  With the 1:1 read/write workload in this database and around a 90% hit ratio (a lot of activity goes to dbfs, otherwise the flash hit rate would be higher), we'd be looking at around 18.7% more performance from the HP disks.

18.7% might not seem like very much, but when you're paying for the state of the art, why would you take an 18.7% performance hit?  At that point, alternative platforms become more attractive.  Instead of going down to 4 Exadata machines filled with HC storage, we opted for 3 HP machines and 1 HC machine, with the option to add a 5th chassis filled with more HP storage cells, at an additional expense, of course.

Still...this was pretty close.  Eventually as technology improves X3-2 (or 4 or 5)'s I bet we'll have a 3TB 10K HC spindle option.  If so, they might come very close to outperforming today's HP spindles, after they're short stroked.  Just for the heck of it...if Sun/Oracle does offer 3TB 10K's in the X3-2, this is a possibility of what we would see (still keeping with the 480GB/spindle for data use):

3TB (Future High Capacity Short Stroked)
rotational speed=10000
avg latency=6ms/2=3
avg access time=9.4ms 4ms
Sustained Sequencial Read is 125MB/s(ID) and 200MB/s(OD)
Avg throughput:162.5MB/s 
0-byte IOPS=250
32k IOPS=~238

1 comment: