Search This Blog

Wednesday, August 7, 2019

Data Cleanup (how to exercise the ghost from the machine)

The scariest movie I ever saw was the Exorcist...to tell the truth, I don't think I actually saw it...while I was watching it I had my eyes closed half the time and I left early.  This is relevant to Oracle Databases...stay with me here.  Database upgrades are a *HOT* topic right now...with the extinction of 12.1.0.2 free support, everybody is moving to 18c or 19c.  I recently helped a customer with a problem that was preventing their database upgrades from being successful.  They were moving a legacy database into a multi-tenant pluggable database using datapump and rows were being dropped.  We initially assumed it was a character set issue, but we were going from UTF8 to UTF8.  We eventually learned that this database had long ago migrated to UTF8 from a different character set, and the conversion was done incorrectly, so we were left with bad data in the legacy tables.

They said they've had issues with "ghosts" in the data...two columns that appeared to equal each other not equaling each other in plsql, that kind of thing...but they always attributed it to bugs.  Now we knew the column's data had hidden, invisible characters in them.  They didn't care about the hidden characters, but the rest of the data in the column needed to be saved.  They only needed the basic, common ascii character set, so I wrote this quick little stored procedure to dynamically take the data and scan it, replacing any unusual characters with a space.  After this the import worked without an issue, and the database ghosts were exercised (no young and old priest required).  There were many 100 mil+ row tables, but the number of rows with bad data were only in the thousands...so I didn't do a forall update...if your data has a larger number of rows to change, you might want to modify this.

When its complete, you can see if everything worked ok with this:
select /*+ PARALLEL 8 */ * from ascii_check where CPOS=-1 order by 1,2,3;

[insert usual disclaimer here] I used this on many very large tables and it appears to do what I intended (which was to use the character's ascii code range to decide if it was a character we wanted to not), but review it closely to make sure it does what you want it to do in your environment before you use it. No warranties/promises...it'll probably not work...yada yada yada....

First, create this table that will serve as a log to keep all the data that was changed, in case something unexpected happens or you need to revert back:

create table ascii_check (Table_Owner varchar2(30),Table_Name varchar2(255), Col_Name varchar2(255), old_rowid varchar2(255), cpos number, code number, old_value varchar2(255 char));

...once you feel confident its doing what you want...you can submit it as a job, and have many tables being scanned and corrected at the same time.

CREATE OR REPLACE procedure data_cleanup
  (
  Table_Owner_in in varchar2,
  Table_Name_in in varchar2,
  Column_Name_in in varchar2,
  Testing_in in varchar2
  )
is
  v_Char number;
  v_String2 varchar2(32000);
  v_Flag boolean;
  v_Flag2 boolean := true;
  v_Counter number :=0;
  v_Errm varchar2(255);
  v_ErrTrack number :=0;
  v_Count number := 0;
  already_completed exception;
  pragma exception_init(already_completed,-20001);
  type rct_RowIDs is ref cursor;
  rc_RowIDs rct_RowIDs;
  v_Err_Rowid varchar2(18);
 
  type t_Row is record
  (
    Old_RowID varchar2(18),
    Old_Data varchar2(32000)
  );
  type t_Rows is table of t_Row index by pls_integer;
  c_Rows t_Rows;
begin
  execute immediate 'ALTER SESSION SET ddl_lock_timeout=900';
  select /*+ PARALLEL 16 */ count(0) into v_Count from ascii_check where table_owner=Table_Owner_in and Table_Name=Table_Name_in and col_name=Column_Name_in and old_value='Completed Successfully.';
  if v_Count!=0 then
    raise already_completed;
  end if;
  delete /*+ PARALLEL 16 */ from ascii_check where table_owner=Table_Owner_in and Table_Name=Table_Name_in and col_name=Column_Name_in and cpos=-1;
  v_ErrTrack := 1;

  OPEN rc_RowIDs FOR  'select /*+ PARALLEL 16 */ rowid,'||Column_Name_in||' from '||Table_Owner_in||'.'||Table_Name_in||' order by rowid';
  loop
    FETCH rc_RowIDs BULK COLLECT INTO c_Rows LIMIT 50000;
    v_ErrTrack := 2;
    -- for each row
    for i in 1..c_Rows.count loop
       
        v_ErrTrack := 3;
        v_String2 := '';
        v_Flag := false;
        -- for each character in each row
        for j in 1..nvl(length(c_Rows(i).Old_Data),0) loop
          v_ErrTrack := 4;
          v_Char := ascii(substr(c_Rows(i).Old_Data,j,1));
          v_ErrTrack := 5;
          if (v_Char<32 or v_Char>126) then
            v_String2 := v_String2||' ';
            v_Flag := true;
            v_ErrTrack := 6;
            insert into ascii_check values (Table_Owner_in,Table_Name_in,Column_Name_in,c_Rows(i).Old_RowID,j,v_Char,substr(c_Rows(i).Old_Data,1,240));
          elsif v_Char=39 then
            v_String2 := v_String2||'''''';
            v_Flag := true;
            v_ErrTrack := 6;
            insert into ascii_check values (Table_Owner_in,Table_Name_in,Column_Name_in,c_Rows(i).Old_RowID,j,v_Char,substr(c_Rows(i).Old_Data,1,240));
          else
            v_ErrTrack := 7;
            v_String2 := v_String2||substr(c_Rows(i).Old_Data,j,1);
          end if;
        end loop;
        v_ErrTrack := 8;
        if v_Flag then
          v_Counter := v_Counter +1;
          if Testing_in ='FALSE' then
            v_ErrTrack := 9;
            v_Err_Rowid := c_Rows(i).Old_RowID;
            begin
              execute immediate 'update '||Table_Owner_in||'.'||Table_Name_in||' set '||Column_Name_in||'='''||v_String2||''' where rowid='''||c_Rows(i).Old_RowID||'''';
            exception
              when DUP_VAL_ON_INDEX then
                execute immediate 'update '||Table_Owner_in||'.'||Table_Name_in||' set '||Column_Name_in||'=''EDITED-'||v_String2||''' where rowid='''||c_Rows(i).Old_RowID||'''';
                --insert into ascii_check values (Table_Owner_in,Table_Name_in, Column_Name_in, v_Err_Rowid, -1,-1,v_ErrTrack||'-'||v_String2);
                v_Flag2 := False;
            end;
          end if;
        end if;
        v_ErrTrack := 10;
    end loop;
    v_ErrTrack := 11;
    exit when c_Rows.count < 50000;
  end loop;
  close rc_RowIDs;
  dbms_output.put_line('There were '||v_Counter||' issues detected.');
  if v_Flag2 then
    insert into ascii_check values (Table_Owner_in,Table_Name_in,Column_Name_in,'',-1,-1,'Completed Successfully.');
  else
    insert into ascii_check values (Table_Owner_in,Table_Name_in,Column_Name_in,'',-1,-1,'Completed Unsuccessfully.');
  end if; 
  commit;
exception
  when already_completed then
    dbms_output.put_line('Already completed.  Exiting...');
  when others then   
    v_Errm := sqlerrm;
    dbms_output.put_line(v_Errm||'@'||v_ErrTrack);
    insert into ascii_check values (Table_Owner_in,Table_Name_in, Column_Name_in, substr(v_Errm,1,240), -1,-1,v_ErrTrack||'-'||v_String2||'-'||v_Err_Rowid);
    commit;
    close rc_RowIDs; 
end;
/




No comments:

Post a Comment