Oracle Tips and Tricks: July 2015

While you read this next paragraph, use the voice of an old west cowboy.

Its high noon in the OCC coral. The boys are ridin' in, lookin' for a fight, and fate ain't gonna let them down. Who'll walk away when the dust settles and who will be eatin' the dust...Oracle Cloud Control, or the 12c Database?

In a recent SR, MOS referenced a note talking about an incompatibility between the 12c Oracle agent and 12c's threaded architecture in Linux/Unix. The MOS note I'm talking about is:
"Databases show down in Cloud Control when using Network Connection Pool, feature with 12.1.0.1 databases and 11.2 JDBC threaded_execution=true (Doc ID 1960485.1)"

In past posts I've talked about the performance and scalability virtues of 12c in a multi-threaded architecture...depending on your workload, you'll get increased caching, less CPU utilization and less memory utilization.  New connections are threads, not processes in Unix/Linux. Other blogs have posts on this topic that stated a 30% performance improvement. Its a beautiful thing. In a recent 12c RAC upgrade I worked on, the feedback from the user testing called this feature, "The Turbo Button."

The incompatibility is because...for some reason, Oracle elected to go with the ancient 11.1.0.7 JDBC drivers in the 12c OEM agent, and they have no plans to improve that until OEM 13 is released. The multi-threaded architecture is only functional for JDBC thin clients 11.2.0.1 and up. So...when you try to use the latest, greatest OEM agent, besides things appearing falsely down in OEM Cloud Control, many very small trace files are created in the db's diag trace directory every second or so. Very quickly your log destination will fill up. The traces contain an error like:
...
Network protocol error on first data after new connect
Probable error (ORA-28546) in network administration.

To work around this problem, we implemented the same solution I showed in my previous post for Netbackup. We have threaded_execution=true in the init.ora parameters and we have 2 listeners, one with DEDICATED_THROUGH_BROKER_LISTENER=ON and the other with it off. This makes the database a hybrid between threaded and traditional process architecture. For things like Netbackup and OEM we send them to the listener that creates a traditional process. For everything else, we use the threaded architecture. Its the best of both worlds...performance of multithreaded architecture and backwards compatibility with ancient 11.1 jdbc drivers.

Almost everybody uses OEM for monitoring/alerting and soon every supported database will be on 12c. I hope this helps you as you roll out your 12c implementation so you can complete it before January 2016 (when 12c becomes the only supported option.) I mentioned this work around in the SR with the hope that MOS would share this with other people having this issue. I never saw a western movie where the characters went to the corral for a gunfight, and went away shaking hands and working together, but in this "show down", with this work around, I think that's exactly what happened between 12c's multi-threaded architecture and the Cloud Control's OEM agent. Yippy-ki-yeah.

LISTENER_SCAN2=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN2))))                # line added by Agent
LISTENER_SCAN3=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN3))))                # line added by Agent
LISTENER=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER))))            # line added by Agent
LISTENER_SCAN1=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1))))                # line added by Agent
ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER_SCAN1=ON                # line added by Agent
VALID_NODE_CHECKING_REGISTRATION_LISTENER_SCAN1=OFF             # line added by Agent
ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER=ON              # line added by Agent
VALID_NODE_CHECKING_REGISTRATION_LISTENER=SUBNET                # line added by Agent

lsnr-no-thread =
(DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = hostbackupnetwork)(PORT = 1521))
      )
    )
)
SID_LIST_no-thread=
(SID_LIST =
    (SID_DESC =
      (SID_NAME = remedyp2)
      (ORACLE_HOME = /u01/app/oracle/product/12.1.0/dbhome_1)
    )
)

DEDICATED_THROUGH_BROKER_lsnr-no-thread=OFF
ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER_SCAN3=ON                # line added by Agent
VALID_NODE_CHECKING_REGISTRATION_LISTENER_SCAN3=OFF             # line added by Agent
ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER_SCAN2=ON                # line added by Agent
VALID_NODE_CHECKING_REGISTRATION_LISTENER_SCAN2=OFF             # line added by Agent

I remember an English class in high school where we were assigned a term paper. When somebody asked our teacher (Mrs Theile) how long it should be, she replied, "A term paper should be like a dress. Long enough to cover the topic but short enough to keep you interested." It occurred to me today that term papers are also like virtual CPU's for databases. Recently a DBA friend told me about a scenario where he's migrating several databases from a physical server to multiple virtual servers. The motivations for doing this are that:

Today, multiple business units have to agree on when to take an outage whenever there are OS patches that need to be applied. Coordination is a nightmare. The new environment will have each database running in its own OS environment, so this will no longer be an issue.

The existing hardware is very old and ready to end its life cycle.

He asked for advice to size the VM's. He wanted to know how much storage capacity, RAM and CPU to tell the virtual team each VM needs for each database.

For storage capacity, he looked at dba_segments, v$logfile and v$tempfile. He took several days of redo generation to size the FRA capacity. Using the advice views (ie: v$db_cache_advice, etc) he was able to gather how much ram each db needed to optimize logical caching in the VM's for each memory pool. That was all straight forward. The more difficult task is to figure out how much CPU each db within the server typically needs.

When he looks at system-related monitoring tools, they show the total amount of cpu used on the system, not broken out by database...so, that's not very helpful. In his situation he has 10 databases per physical server. Like the term paper, there's a delicate balance between having too much cpu and not having enough. There's really no way to quantify what "too much" is because its expected that under extremely rare situations, you're going to peg your CPU. How often that's acceptable depends on your SLA's. Also, its important to keep in mind CPU scheduling in VMware will slow down the db if you overallocate CPU. (I mentioned this in point #2 in a previous post)

If you think of a busy core as CPU time divided by 1000 ms, then the WRH$ tables have the solution for us. There are two potential (but probably very uncommon) pitfalls.

1. The cpu time is recorded in increments of 10ms. If your core is busy for less than 10ms, it gets counted as 0.

2. Strange things happen to system stats when 100% of the CPU is used. At that point, you have to consider the stats unreliable. For our purposes, this should rarely happen since we're going to look at an average over a snap period (by default, 1 hr) so I filtered out that data. If its common for you to run at 100% cpu utilization, this query won't help you determine how much cpu is sufficient.

So...granted that sometimes we're going to peg the CPU, how many vCPU would fulfill my SLA? If you have a very cyclical usage model for your database...ie: you're a retail shop and you're only busy 2 days after Thanksgiving, and the day before and after Christmas. On those days you could increase the CPU as needed (because you're in a virtual machine or a cloud), but you want it to be sufficient the rest of the time. Let's say those peak times total 4 days. Take the total seconds in the 4 days divided by the total seconds in a year:

60 sec/min*60min/hr*24hr/day=86400 seconds per day
86400 sec/day*4 peak days=345,600 sec of peak time
86400 sec/day*365day/yr=31,536,000 sec/yr
1-(345,600/31,536,000)=.989

...so you would need your SLA to be at least .989. Using the query below you could see that the SLA historically could be met with 17 cores. I like even numbers to help VMWare cpu scheduling (ideally it would be a power of 2...but that's a different post), so I'd bump it up to 18. So...with 18 cores, we'd have enough CPU 99.06684% of the time. The rest of the time would fall under those peak days where we'd temporarily increase the vCPU as needed.

CORES_USED	SECONDS_AT_CORE_USED	SECONDS_AT_OR_BELOW_CORE_USED	TOTAL_SECONDS_IN_SNAPS	SLA_ACHIEVED
1	80091494	43698197	196192768	22.27309
2	42466428	123789691	196192768	63.09595
3	16093815	166256120	196192768	84.74121
4	5531073	182349935	196192768	92.94427
5	2172862	187881008	196192768	95.76347
6	769946	190053871	196192768	96.87099
7	501373	190823817	196192768	97.26343
8	278488	191325191	196192768	97.51898
9	232374	191603679	196192768	97.66093
10	336898	191836054	196192768	97.77937
11	350287	192172953	196192768	97.95109
12	280784	192523241	196192768	98.12963
13	270683	192804025	196192768	98.27275
14	388876	193074708	196192768	98.41072
15	334588	193463585	196192768	98.60893
16	273689	193798173	196192768	98.77947
17	290109	194071862	196192768	98.91897
18	290218	194361972	196192768	99.06684
19	335042	194652190	196192768	99.21476
20	215965	194987233	196192768	99.38554
21	260931	195203198	196192768	99.49561
22	246267	195464130	196192768	99.62861
23	287955	195710397	196192768	99.75413
24	194415	195998352	196192768	99.90091

with cpu_info as

(

select /*+ PARALLEL 8 */ sum(seconds) weight, sum(seconds), cpups cores_used from (

select s2.snap_id snap_1, s2.snap_id snap_2, to_date(to_char(begin_interval_time,'MM/DD/YYYY HH24'),'MM/DD/YYYY HH24') sample_snap,

s2.cpu-s1.cpu cpu, round(trunc((s2.cpu-s1.cpu)/seconds)/100) cpups, seconds

from (

select hiof1.snap_id, hiof1.value cpu

from sys.WRH$_SYSSTAT HIOF1

where HIOF1.stat_id = (select stat_id from v$statname where name = 'CPU used by this session')

) s1, (

select hiof1.snap_id, hiof1.value cpu

from sys.WRH$_SYSSTAT HIOF1

where HIOF1.stat_id = (select stat_id from v$statname where name = 'CPU used by this session')

) s2, (

select snap_id, instance_number, begin_interval_time, end_interval_time,

extract(second from (end_interval_time-begin_interval_time))+

(extract(minute from (end_interval_time-begin_interval_time))*60)+

(extract(hour from (end_interval_time-begin_interval_time))*60*60) seconds

from dba_hist_snapshot

)

where s1.snap_id=ms.snap_id

and s1.snap_id=(s2.snap_id-1)

and (s2.cpu-s1.cpu)>1

and (round(trunc((s2.cpu-s1.cpu)/seconds)/100))<=(select sum(p1.value)*p2.value

from gv$parameter p1, v$parameter p2

where p1.name='cpu_count'

and p2.name='parallel_threads_per_cpu'

group by p2.value)

) group by cpups

)

select ci1.cores_used, trunc(ci1.weight) seconds_at_core_used, trunc(sum(ci2.weight)) seconds_at_or_below_core_used, trunc(ci3.weight) total_seconds_in_snaps, round(100*(sum(ci2.weight)/ci3.weight),5) SLA_Achieved from cpu_info ci1, cpu_info ci2, (select sum(weight) weight from cpu_info) ci3

where ci2.cores_used+1<=ci1.cores_used

group by ci1.weight, ci1.cores_used, ci3.weight

order by 1;

Oracle Tips and Tricks

Search This Blog

Tuesday, July 21, 2015

The funniest title for a MOS note ever? Using OEM and Cloud Control with 12c in multi-threaded mode

Monday, July 6, 2015

Splitting out individual db CPU from a consolidated server