Monday, July 6, 2015

Splitting out individual db CPU from a consolidated server

I remember an English class in high school where we were assigned a term paper.  When somebody asked our teacher (Mrs Theile) how long it should be, she replied, "A term paper should be like a dress.  Long enough to cover the topic but short enough to keep you interested."  It occurred to me today that term papers are also like virtual CPU's for databases.  Recently a DBA friend told me about a scenario where he's migrating several databases from a physical server to multiple virtual servers.  The motivations for doing this are that:

  • Today, multiple business units have to agree on when to take an outage whenever there are OS patches that need to be applied.  Coordination is a nightmare.  The new environment will have each database running in its own OS environment, so this will no longer be an issue.
  • The existing hardware is very old and ready to end its life cycle.

He asked for advice to size the VM's.  He wanted to know how much storage capacity, RAM and CPU to tell the virtual team each VM needs for each database.

For storage capacity, he looked at dba_segments, v$logfile and v$tempfile.  He took several days of redo generation to size the FRA capacity.  Using the advice views (ie: v$db_cache_advice, etc) he was able to gather how much ram each db needed to optimize logical caching in the VM's for each memory pool.  That was all straight forward.  The more difficult task is to figure out how much CPU each db within the server typically needs.

When he looks at system-related monitoring tools, they show the total amount of cpu used on the system, not broken out by database...so, that's not very helpful.  In his situation he has 10 databases per physical server.  Like the term paper, there's a delicate balance between having too much cpu and not having enough.  There's really no way to quantify what "too much" is because its expected that under extremely rare situations, you're going to peg your CPU.  How often that's acceptable depends on your SLA's.  Also, its important to keep in mind CPU scheduling in VMware will slow down the db if you overallocate CPU. (I mentioned this in point #2 in a previous post)

If you think of a busy core as CPU time divided by 1000 ms, then the WRH$ tables have the solution for us.  There are two potential (but probably very uncommon) pitfalls. 

1. The cpu time is recorded in increments of 10ms.  If your core is busy for less than 10ms, it gets counted as 0. 

2. Strange things happen to system stats when 100% of the CPU is used.  At that point, you have to consider the stats unreliable.  For our purposes, this should rarely happen since we're going to look at an average over a snap period (by default, 1 hr)  so I filtered out that data.  If its common for you to run at 100% cpu utilization, this query won't help you determine how much cpu is sufficient.

So...granted that sometimes we're going to peg the CPU, how many vCPU would fulfill my SLA? If you have a very cyclical usage model for your database...ie: you're a retail shop and you're only busy 2 days after Thanksgiving, and the day before and after Christmas.  On those days you could increase the CPU as needed (because you're in a virtual machine or a cloud), but you want it to be sufficient the rest of the time.  Let's say those peak times total 4 days.  Take the total seconds in the 4 days divided by the total seconds in a year:

60 sec/min*60min/hr*24hr/day=86400 seconds per day
86400 sec/day*4 peak days=345,600 sec of peak time
86400 sec/day*365day/yr=31,536,000 sec/yr
1-(345,600/31,536,000)=.989

...so you would need your SLA to be at least .989.  Using the query below you could see that the SLA historically could be met with 17 cores.  I like even numbers to help VMWare cpu scheduling (ideally it would be a power of 2...but that's a different post), so I'd bump it up to 18.  So...with 18 cores, we'd have enough CPU 99.06684% of the time.  The rest of the time would fall under those peak days where we'd temporarily increase the vCPU as needed.


CORES_USEDSECONDS_AT_CORE_USEDSECONDS_AT_OR_BELOW_CORE_USEDTOTAL_SECONDS_IN_SNAPSSLA_ACHIEVED
1800914944369819719619276822.27309
24246642812378969119619276863.09595
31609381516625612019619276884.74121
4553107318234993519619276892.94427
5217286218788100819619276895.76347
676994619005387119619276896.87099
750137319082381719619276897.26343
827848819132519119619276897.51898
923237419160367919619276897.66093
1033689819183605419619276897.77937
1135028719217295319619276897.95109
1228078419252324119619276898.12963
1327068319280402519619276898.27275
1438887619307470819619276898.41072
1533458819346358519619276898.60893
1627368919379817319619276898.77947
1729010919407186219619276898.91897
1829021819436197219619276899.06684
1933504219465219019619276899.21476
2021596519498723319619276899.38554
2126093119520319819619276899.49561
2224626719546413019619276899.62861
2328795519571039719619276899.75413
2419441519599835219619276899.90091


with cpu_info as
(
select /*+ PARALLEL 8 */ sum(seconds) weight, sum(seconds), cpups cores_used from (
select s2.snap_id snap_1, s2.snap_id snap_2, to_date(to_char(begin_interval_time,'MM/DD/YYYY HH24'),'MM/DD/YYYY HH24') sample_snap,
  s2.cpu-s1.cpu cpu, round(trunc((s2.cpu-s1.cpu)/seconds)/100) cpups, seconds
from (
select hiof1.snap_id, hiof1.value cpu
from sys.WRH$_SYSSTAT HIOF1
where HIOF1.stat_id = (select stat_id from v$statname where name = 'CPU used by this session')
) s1, (
select hiof1.snap_id, hiof1.value cpu
from sys.WRH$_SYSSTAT HIOF1
where HIOF1.stat_id = (select stat_id from v$statname where name = 'CPU used by this session')
) s2, (
select snap_id, instance_number, begin_interval_time, end_interval_time,
 extract(second from (end_interval_time-begin_interval_time))+
 (extract(minute from (end_interval_time-begin_interval_time))*60)+
 (extract(hour from (end_interval_time-begin_interval_time))*60*60) seconds
 from dba_hist_snapshot
)
 ms
where s1.snap_id=ms.snap_id
  and s1.snap_id=(s2.snap_id-1)
  and (s2.cpu-s1.cpu)>1
  and (round(trunc((s2.cpu-s1.cpu)/seconds)/100))<=(select sum(p1.value)*p2.value
                                                      from gv$parameter p1, v$parameter p2
                                                      where p1.name='cpu_count'
                                                        and p2.name='parallel_threads_per_cpu'
                                                      group by p2.value)
) group by cpups
)
select ci1.cores_used, trunc(ci1.weight) seconds_at_core_used, trunc(sum(ci2.weight)) seconds_at_or_below_core_used, trunc(ci3.weight) total_seconds_in_snaps, round(100*(sum(ci2.weight)/ci3.weight),5) SLA_Achieved from cpu_info ci1, cpu_info ci2, (select sum(weight) weight from cpu_info) ci3
where ci2.cores_used+1<=ci1.cores_used
group by ci1.weight, ci1.cores_used, ci3.weight
order by 1; 



No comments:

Post a Comment