Search This Blog

Thursday, July 23, 2020

The complete post-12c online procedure to move a file from file system to ASM (ORA-01110,ORA-01157,ORA-01187,ORA-12850,RMAN-03009)

In the Coronovirus-19 world, most of us would rather spend time learning about the Seattle Kraken, Max Ehrich, Roy Jones Jr and the Washington Football Team.  They'll probably eventually do what Prince did and just call themselves, "Symbol." Roy Jones and Max Ehrich will both get knocked out and the Blues will have one more team to beat before they get their next cup.  If you're reading this...you probably also care about Oracle databases...so let me tell you about a recent experience I had.

A client I work with had an issue the other day (actually 2 issues) in a container database.  This database has very strict uptime SLA's...so downtime for a bounce of an instance isn't an option.  The online process to move a datafile from file system to ASM is a 3 step process in 12c+...until now I've always thought it was a single "move datafile" command.

1. They hit a bug similar to the issue discussed in note 2558640.1, while doing a select from cdb_data_files, they got "ORA-12850: Could not allocate slaves on all specified instances: 3 needed, 0 allocated".  

2. This happened in a script that's used to add datafiles as needed.  Unfortunately, that script created a datafile with some of the error text instead of "+DATA"...so the file was created in the dbs directory on node 1 instead of putting it into ASM.  This is RAC, so the datafile was only able to be accessed on the first node.  RAC requires shared storage.


First order of business, get it working again.  This is 18.7, so it was easy to move the file into ASM without downtime.  The filename was ridiculous with special characters, and the file was on the file system on node 1, so from node 1 I did:


alter database move datafile 1654 to '+DATA' ;


...and that worked, now the datafile is in ASM's shared storage.  I kicked off a backup (which failed):



                RMAN-03009: failure of backup command on ch4 channel at 07/23/2020 11:23:24 
                ORA-01157: cannot identify/lock data file 1654 - see DBWR trace file
                ORA-01110: data file 1654: '+DATA/DB/DATAFILE/sysaux.22113.1246514439'

On node 1, this file is 100% working in ASM.  On other nodes, it was unavailable, giving ORA-01157's.  In gv$datafile I could see it was online and available on all nodes.  This made no sense to me.  I even double checked by going into asmcmd and doing an "ls" from one of the nodes that was failing...yep...it was there.  It was acting like it was offline, but reporting it was online.

I googled/checked metalink and didn't really find a good hit.  Finally I thought...there's no harm in onlining a file that says its online...so I did an:

ALTER DATABASE DATAFILE 1654 ONLINE;

The error changed from:
ORA-01157: cannot identify/lock data file 1654 
to:
ORA-01187 cannot read from file 1654 because it failed verification tests

It was still working on node 1, but getting ORA-01187 on the other nodes.  This time I found a solution:

ALTER SYSTEM CHECK DATAFILES;

...which I did on the nodes that were having issues.  After that, everything began to work.

With the solution and the original error, I went back on MOS and this time found the ultimate solution: (DOC 2499264.1) which states:

ORA-01157 error is expected to occur when running a query on dba_datafiles on a particular instance (of RAC database) for a file moved (just after creation) from Filesystem to ASM, without instance wide identification.
 
This problem was previously addressed in Bug 27268242 which was also closed as "Not a Bug" because this is an expected behavior.

(I think they meant cluster or database-wide identification.) 
<VENT> Come on, Oracle!  You shouldn't expect to get an error when running a correct query on a valid datafile.  This just causes more trouble and steps and after a datafile move, there's no reason to leave it in a partially working state.  If you're in RAC and somebody moves a file to ASM...let the other nodes know!  There's no conceivable situation where somebody would want to move a file in RAC and have it only available to a single instance...make the alter database datafile move check datafiles when its coming from the filesystem for us!  
<?VENT>

Until Larry starts listening to me...this means we have multiple steps.

That note also included a resolution:
alter system check datafiles;
alter system flush shared_pool; 

...so at this point, the datafile is working correctly on all nodes.

This means we have 3 statements to do when we move a file from the file system to ASM:
alter database move datafile 1654 to '+DATA';
alter system check datafiles;

alter system flush shared_pool; 

Now that its working, I can re-address problem #1 above...the root cause.

Note 2558640.1 discusses a similar problem when OEM is hitting the views that have the datafiles.  My issues isn't in OEM, but in every other way this matches up.  The work around:


alter system set "_px_cdb_view_enabled"=false comment='Work-around for internal view error and OEM-see bug 29595328';

This prevents the issue from happening again, but it also slows down the query (and others) hitting the cdb_data_files from 45 seconds (there's a lot of datafiles) to 90 seconds.  I suppose that's better than getting this error...I opened an SR to ask this be fixed in future releases.

I love the new feature in 12c+ to do online datafile moves...but it would be great if the process was documented when moving a datafile from the file system...all I've seen in the doc's is the basic move datafile command, not all 3 steps.

With the system working and in a state where this won't happen again...I can now return to C19's bad hair and toilet paper and coin shortages.