Search This Blog

Thursday, August 24, 2017

Effect of MBPS vs Latency

I do *a lot* of performance testing on high performance storage arrays for multiple vendors.  Usually if I'm involved, the client is expecting to put mission critical databases on their new expensive storage, and they need to know it performs well enough to meet their needs. 


So...parsing that out..."meet their needs"...means different things to different people.  Most businesses are cyclical, so the performance they need today is likely not the performance they need at their peek.  For example...Amazon does much more business the day after Thanksgiving than they do in a random day in May.  If you gather the usage stats being used in May and size it appropriately, you're going to get a call in a few months when performance is exposed. 


Before I talk about latency, let me just say AWR does a great job of keeping performance data, if you have your data kept long enough...preferably at least 2 business cycles so you can do comparisons and projections.


This statement will keep AWR data for 3 years, capturing it at an aggressive 15 minute interval:


execute dbms_workload_repository.modify_snapshot_settings (interval => 15,retention => 1576800);


...at that point, see my other post re:gathering IOPS and Throughput requirements.


Anyway, I often have discussions with people who don't understand the effect of latency on OLTP databases.  This is a overly-focused serial example, but its enough to make the point.  Think about this...let's say you have a normal 8K block Oracle database using Netapp or EMC NFS on an active-active 10Gb network.  Let's say your amazing all-SSD storage array is capable of flowing 10Gb between multiple paths.   So...the time to move 8K over a 10Gb pipe is...


Throughput...
10Gb/s=10485760Kb/s
(8Kb/s)/(10485760Kb/s)=0.000000762939453125 seconds to copy 8Kb over the 10Gb pipe.


Latency...by the time it passes through your FC network, gets processed by the storage array, gets retrieved from disk, and makes it back to your server can easily exceed a few ms...but for fun let's say we're getting an 8ms response time.  That's .008 seconds.


.008/0.000000762939453125=10,485.76...


...so the effect of latency on your block is 10,485X greater than the effect of throughput.  If your throughput magically got faster but your latency stayed the same...performance wouldn't really improve very much.  If you went from 8ms to 5ms, on the other hand, this would have a huge effect on your database performance.


There's a lot that can affect latency...usually the features in use on the storage array play a big part.  CPU utilization on the storage array can become too high.  This is ultra complicated for the storage array guys to diagnose.  On EMC VMAX3's for example, CPU is allocated to "pools" for different features.  So...even though you may not use eNAS, by default, you allocate a lot of your VMAX CPU cores to it.  When your FC traffic pegs its cores and latency tanks...the administrator may think to look at the CPU utilization and not see an issue...there's free CPU available...just not in the pool used for the FC front end cores, so it creates a bottleneck.  Awesome performance improvements are possible by working closely with your storage vendor to reduce latency during testing...about 6 months ago I worked with a team that achieved improvements by over 50% from the standard VMAX3 as delivered by adjusting those allocations.


All this to say...Latency is very important for common OLTP databases.  Don't ignore throughput, but don't focus on it.

1 comment: