Monday, September 2, 2013

Oracle masala at work - 5:

Oracle masala at work - 5:


1. Exadata Instance crash with - ORA-00600: internal error code, arguments: [kjblpkeydrmqscchk:pkey] on 11.2.0.3 BP7
2. Bug 15936951: AUTO DOP IN 11.2.0.3 NOT OBEYING PARALLEL DEGREE LIMIT AS IT DID IN 11.2.0.2
3. Logical block corruption in Oracle:
4. Modifying Scan name:
5. Online rebuild index fails with ORA-08104:
6. How to change the session level parameters for a specific OS user:
7. NLSSORT filter shown in the SQL plan, causing the FULL table scans:
8. Modify the Timezone on Grid Infrastructure from EST to GMT:
9. How to switch the SCAN IP address:


Details:
1. Exadata Instance crash with - ORA-00600: internal error code, arguments: [kjblpkeydrmqscchk:pkey] on 11.2.0.3 BP7

Fix:
Apply Exadata Patch 14409183 on 11.2.0.3 BP7.

No-Exadata Instance crash with - ORA-00600: internal error code, arguments: [kjblpkeydrmqscchk:pkey] in 11.2.0.3
Fix: Hitting the bug-13397104 please apply the latest 11.2.0.3.4 GI PSU which will fix this issue. or apply the Patch 13397104 on 11.2.0.3.

Other workaround:
Setting the _gc_read_mostly_locking=false (11.2.0.3)


2. Bug 15936951: AUTO DOP IN 11.2.0.3 NOT OBEYING PARALLEL DEGREE LIMIT AS IT DID IN 11.2.0.2

Issue:
There is a hint that has been identified to cause this issue : /*+ parallel(table_alias) */
When we use the hint just with table_name/table_alias, it's calculating the AUTO DOP wrongly, causing other statements to queue.

As of this note no patch available,

WORKAROUND:
Do not specify parallel(table_alias) hints.
-- OR --

If we change this to any of the below, it should calculate DOP correctly and not queue:
/*+ parallel(table_alias, number) */
/*+ parallel */
-- OR --

Implement DBRM to cap the parallel threads.


3. Logical block corruption in Oracle:

Key Points:
RMAN Vs dbv Vs ANALYZE VALIDATE STRUCTURE CASCADE ?
Data block Vs Index block error (ORA-01498/ORA-01499) ?
Get the block dump for further analysis, what caused the corruption ?

RMAN Level 0 (Full backup) is required to fix the block corruption.
11g Active Data Guard Automatic Block Repair works well too.

Error/Issue:
SELECT Query failed with "ORA-12805: parallel query server died unexpectedly"

Alert Log Error:
ORA 600 [kjblpkeydrmqscchk:pkey]
ORA 603
ORA 7445 [_wordcopy_bwd_dest_aligned()+185]
ORA 7445 [_wordcopy_bwd_dest_aligned()+209]

a.
-- Export(expdp) error:
export - "ORA-00600: internal error code, arguments: [kluuscn_50]"


Tryed export with below init parameters, but no luck:
DB_BLOCK_CHECKSUM = OFF
DB_BLOCK_CHECKING = OFF


-- Tryed "DBMS_REPAIR.skip_corrupt_blocks" before taking export, "but no luck".
SQL> BEGIN
DBMS_REPAIR.skip_corrupt_blocks (
schema_name => 'UT_REP',
object_name => 'T1',
object_type => DBMS_REPAIR.table_object,
flags => DBMS_REPAIR.skip_flag);
END;
/
2 3 4 5 6 7 8
PL/SQL procedure successfully completed.

SQL>
SQL> commit;

Commit complete.

SQL> exit

b.
-- DBMS_STATS error:
dbms_stats - "ORA-20011: Approximate NDV failed: ORA-12805: parallel query server died unexpectedly"

c.
-- Analyze table error:
SQL> analyze table ut_rep.t1 VALIDATE STRUCTURE CASCADE;

analyze table ut_rep.t1 VALIDATE STRUCTURE CASCADE
*
ERROR at line 1:
ORA-01498: block check failure - see trace file


-- How many files and blocks where effected/Identifying corrupt blocks?:
BACKUP VALIDATE TABLESPACE UT_COMP; -- Checks Physical corruption only on given tablespace level.
BACKUP VALIDATE CHECK LOGICAL TABLESPACE UT_COMP; -- Checks Logical and Physical corruption on given tablespace level.

BACKUP VALIDATE CHECK LOGICAL DATABASE; -- Checks Logical and Physical corruption on the database level.


RMAN e.g..:
rman target /
spool log to 'x.log'

BACKUP VALIDATE CHECK LOGICAL TABLESPACE UT_COMP;

spool log off;
exit;

-- Displays information about database blocks that were corrupted after running the above RMAN command.
-- When the RMAN recover the corrupt block then automatically updates this view.
select * from V$DATABASE_BLOCK_CORRUPTION;

-- Identifying the object name
select owner, segment_name, segment_type, partition_name
from dba_extents
where file_id = 486
and 2048140 between block_id and block_id + blocks - 1;

Fix:
a.
$ rman target / catalog_user/catalog_user_password@catalogdb

BLOCKRECOVER CORRUPTION LIST;
BLOCKRECOVER CORRUPTION LIST RESTORE UNTIL TIME 'SYSDATE-10';
-- (OR) --

-- This checks FRA and backupset.
$ rman target / catalog_user/catalog_user_password@catalogdb
BLOCKRECOVER DATAFILE BLOCK ;
-- (OR) --

-- Recovery from backupset
BLOCKRECOVER DATAFILE BLOCK DATAFILE BLOCK FROM BACKUPSET;
-- (OR) --

-- Recovery from image copy
BLOCKRECOVER DATAFILE BLOCK DATAFILE BLOCK FROM DATAFILECOPY;
-- (OR) --

-- Recovery from backupset which have "FULL" tag
BLOCKRECOVER DATAFILE BLOCK DATAFILE BLOCK FROM TAG = FULL;

b.
-- If you don't have FULL RMAN backups!!! If it's OK to miss some data !!!
-- In this example, I am creating a GOOD_T1 table from T1 tables, without 3 corrupted blocks (2048140,2048862,2039785).
create table ut_rep.good_t1
nologging
parallel 8
tablespace ut_comp
as
select * from ut_rep.t1
where rowid not in (select rowid from ut_rep.t1
where to_number(utl_encode.base64_decode(utl_raw.cast_to_raw(lpad(substr(rowid, 10, 6), 8, 'A'))), 'XXXXXXXXXXXX') in (2048140,2048862,2039785));

-- *****
-- I used below logic to pull the effected blocks to analyze the data.
-- *****
rowid,
to_number(utl_encode.base64_decode(utl_raw.cast_to_raw(lpad(substr(rowid, 1, 6), 8, 'A'))), 'XXXXXXXXXXXX') as obj_id,
to_number(utl_encode.base64_decode(utl_raw.cast_to_raw(lpad(substr(rowid, 7, 3), 4, 'A'))), 'XXXXXX') as file_num,
to_number(utl_encode.base64_decode(utl_raw.cast_to_raw(lpad(substr(rowid, 10, 6), 8, 'A'))), 'XXXXXXXXXXXX') as block_num,
to_number(utl_encode.base64_decode(utl_raw.cast_to_raw(lpad(substr(rowid, 16, 3), 4, 'A'))), 'XXXXXX') as row_slot

c.
If above method did not work, we need to go by ROWID's method:

1a.
Find the object_id:
select data_object_id from dba_objects where owner = 'UT_REP' and OBJECT_NAME = 'T398'
-- 6540044

1b.
Problem file# - blocks#'s:
486 - 3619385
489 - 3006218

select dbms_rowid.rowid_create(rowid_type => 1,
object_number => 6540044,
relative_fno => 486,
block_number => 3619385,
row_number => 0) begin_rowid,
dbms_rowid.rowid_create(rowid_type => 1,
object_number => 6540044,
relative_fno => 486,
block_number => 3619385 + 1,
row_number => 0) end_rowid
from dual;

begin_rowid - AAY8sMAHmAANzo5AAA
end_rowid - AAY8sMAHmAANzo6AAA

select dbms_rowid.rowid_create(rowid_type => 1,
object_number => 6540044,
relative_fno => 489,
block_number => 3006218,
row_number => 0) begin_rowid,
dbms_rowid.rowid_create(rowid_type => 1,
object_number => 6540044,
relative_fno => 489,
block_number => 3006218 + 1,
row_number => 0) end_rowid
from dual;

begin_rowid - AAY8sMAHpAALd8KAAA
end_rowid - AAY8sMAHpAALd8LAAA

1c.
create table UT_REP.T398_good
as
SELECT /*+ ROWID(A) */ * FROM ut_rep.t398 A where rowid < 'AAY8sMAHmAANzo5AAA'; insert into UT_REP.T398_good SELECT /*+ ROWID(A) */ * FROM ut_rep.t398 A where rowid > 'AAY8sMAHmAANzo6AAA' and rowid < 'AAY8sMAHpAALd8KAAA'; insert into UT_REP.T398_good SELECT /*+ ROWID(A) */ * FROM ut_rep.t398 A * where rowid > 'AAY8sMAHpAALd8LAAA';

1d.
truncate table ut_rep.t398;
alter table ut_rep.t398 move;

select 'alter index ut_rep.'||index_name||' rebuild;' from dba_indexes where owner = 'UT_REP' and table_name = 'T398';

1e.
insert /*+ append */ into ut_rep.t398 select * from UT_REP.T398_good;
commit;

-- Missing records count:
select count(*) from ut_rep.t398;
select count(*) from UT_REP.T398_good;

d.
dbed utility !! -- Not tested.


-- Another very importenet thing is, What caused the logical block corruption?
-- Oracle might ask for block dumps for analysis.
sqlplus / as sysdba
oradebug setospid
oradebug unlimit
alter system dump datafile '' block ;
oradebug tracefile_name
exit


4. Modifying Scan name:

a. -- If existing databases are running on that cluster backup the spfile of all databases into a different location.
Example : create pfile='location' from spfile;

b. -- Update the pfile in the location from Step1 with the new scan name for remote_listener parameter.

c. -- Check if the new scan name has been setup on the cluster.

-- nslookup

Example:
nslookup ut-atl1.dwire.com

Server: atl1.dwire.com
Address: 10.xx.xxx.xx

Name: ut-atl1.dwire.com
Addresses: 10.xx.xxx.xx, 10.xx.xxx.xx, 10.xx.xxx.xx

d. -- Shutdown all instances on that cluster, stop the existing scan listener on the cluster.
-- Verify if everything is stopped.


Set the ORACLE_HOME to GRID_HOME.
. oraenv
+ASM1

$GRID_HOME/bin/srvctl stop scan_listener
$GRID_HOME/bin/srvctl stop scan

e. -- Configure new Scan Name.

-- $GRID_HOME/bin/srvctl modify scan -n

Example:
$GRID_HOME/bin/srvctl modify scan -n ut-atl1.dwire.com

f. -- Modify Scan Name in CRS. Execute this command as root

-- sudo crsctl modify type ora.scan_vip.type -attr "ATTRIBUTE=SCAN_NAME,DEFAULT_VALUE="

Example:
cd $GRID_HOME/bin/
sudo crsctl modify type ora.scan_vip.type -attr "ATTRIBUTE=SCAN_NAME,DEFAULT_VALUE=ut-atl1.dwire.com"

g. $GRID_HOME/bin/srvctl modify scan_listener -u

h. $GRID_HOME/bin/srvctl start scan_listener

i. -- Confirm if the new scan is working properly and the correct name.
$GRID_HOME/bin/srvctl status scan_listener
$GRID_HOME/bin/srvctl config scan

j. -- Startup the instance through sqlplus in nomount state.

k. -- Create the spfile inside the ASM with the pfile that was changed as part of Step b.

l. -- shutdown the instance.

m. -- Start the database through srvctl.

n. -- Verify if everything is working properly. Check the remote_listener parameter value in the database.
-- It should have the new scan name.

Example :
show parameter remote_listener


5. Online rebuild index fails with ORA-08104:

Error:
SQL> alter index ut_rep.ih132 rebuild online;
alter index ut_rep.ih132 rebuild online
*
ERROR at line 1:
ORA-08104: this index object 2249271 is being online built or rebuilt

SQL> exit

Fix:
During the kill immediate(ALTER SYSTEM KILL SESSION ',,@2' IMMEDIATE;) some extents still exists on the tablespace and the index has a marker, when a online operation is initiated.
To cleanup use the DBMS_REPAIR.ONLINE_INDEX_CLEAN package as below.

a.
SELECT object_id FROM dba_objects WHERE object_name = 'IH132';
-- 123456789

b.
DECLARE
RetVal BOOLEAN;
OBJECT_ID BINARY_INTEGER;
WAIT_FOR_LOCK BINARY_INTEGER;
BEGIN
OBJECT_ID := 123456789;
WAIT_FOR_LOCK := NULL;
RetVal := SYS.DBMS_REPAIR.ONLINE_INDEX_CLEAN (OBJECT_ID);
COMMIT;
END;
/


6. How to change the session level parameters for a specific OS user:

a.
connect sys@DB1

create or replace trigger logon_test
AFTER LOGON ON test.SCHEMA

BEGIN

IF sys_context('USERENV','OS_USER') in ('vdumpa')
THEN
execute immediate ('ALTER SESSION SET optimizer_features_enable=''11.2.0.1''');
END IF;

END;
/

-- To check the execute immediate:
SQL> select 'ALTER SESSION SET optimizer_features_enable=''11.2.0.1''' from dual;

'ALTERSESSIONSETOPTIMIZER_FEATURES_ENABLE=''11.2.0.1''
------------------------------------------------------
ALTER SESSION SET optimizer_features_enable='11.2.0.1'

SQL> select 'ALTER SESSION SET "_push_join_union_view"=false' from dual;

'ALTERSESSIONSET"_PUSH_JOIN_UNION_VIEW"=FALSE'
------------------------------------------------
ALTER SESSION SET "_push_join_union_view"=false

-- Few important sys_context parameters:
sys_context('USERENV','HOST') -- Host name
sys_context('USERENV','OS_USER') -- OS user name
sys_context('USERENV','SERVICE_NAME') -- Service name
sys_context('USERENV','SESSION_USER') -- DB user name
sys_context('USERENV','TERMINAL') -- Terminal name
sys_context('USERENV','MODULE') -- Module name
sys_context('USERENV','IP_ADDRESS') -- IP address of the machine from which the client is connected.

b.
-- vdumpa is logging in as test user.
connect test@DB1

c.
-- This is the test to check the session parameter value.

connect sys@DB1
SQL> show parameter optimizer_features_enable

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
optimizer_features_enable string 11.2.0.2

SQL> select sid, username, osuser from gv$session where username = 'TEST';

SID USERNAME OSUSER
---------- ------------------------------ ------------------------------
486 TEST vdumpa

SQL> select name,value from v$ses_optimizer_env where sid = 486 and name like '%optimizer_features_enable%';

NAME VALUE
---------------------------------------- -------------------------
optimizer_features_enable 11.2.0.1


7. NLSSORT filter shown in the SQL plan, causing the FULL table scans:

a.
-- Good plan
SQL> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('ap8rhqm23xdbz',0));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------
SQL_ID ap8rhqm23xdbz, child number 0
-------------------------------------
SELECT * FROM XYZ WHERE BATCHMODULEID=:B1

Plan hash value: 848178827

------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1 (100)| |
| 1 | TABLE ACCESS BY INDEX ROWID| XYZ | 1 | 189 | 1 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | SYS_C0021226 | 1 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BATCHMODULEID"=:B1)


19 rows selected.

-- Bad plan
SQL> SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR('ap8rhqm23xdbz',1));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
SQL_ID ap8rhqm23xdbz, child number 1
-------------------------------------
SELECT * FROM XYZ WHERE BATCHMODULEID=:B1

Plan hash value: 2660499978

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 4378 (100)| |
|* 1 | TABLE ACCESS FULL| XYZ | 7400 | 1365K| 4378 (2)| 00:00:53 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(NLSSORT(INTERNAL_FUNCTION("BATCHMODULEID"),'nls_sort=''GENERIC_B
ASELETTER''')=NLSSORT(:B1,'nls_sort=''GENERIC_BASELETTER'''))


19 rows selected.

Note: Notice the NLSSORT filter in the bad plan.. Causing FULL table scans!!!
b.
SQL> set linesize 120;

SELECT inst_id, hash_value, child_number, PLAN_HASH_VALUE,OLD_HASH_VALUE, executions, (elapsed_time/1000000)/decode(nvl(executions,0),0,1,executions) avg_etime
from gv$sql
where sql_id='ap8rhqm23xdbz';
SQL> SQL> SQL> 2 3
INST_ID HASH_VALUE CHILD_NUMBER PLAN_HASH_VALUE OLD_HASH_VALUE EXECUTIONS AVG_ETIME
---------- ---------- ------------ --------------- -------------- ---------- ----------
4 3292444031 0 2660499978 3621482215 21 3.72152276
2 3292444031 0 2660499978 3621482215 5 41.48197
2 3292444031 1 2660499978 3621482215 5 43.6717298

c.
-- Created function Index to fix the issue:
SQL> create index KFAXPROD.mh_idx1_XYZ on KFAXPROD.XYZ (NLSSORT(BATCHMODULEID,'NLS_SORT=GENERIC_BASELETTER'))
tablespace KFAX_DATA;

d.
-- After index creation both hash values average execution time dropped dramatically.
SQL> set linesize 120
SQL> SELECT inst_id, hash_value, child_number, PLAN_HASH_VALUE,OLD_HASH_VALUE, executions, (elapsed_time/1000000)/decode(nvl(executions,0),0,1,executions) avg_etime
from gv$sql
where sql_id='ap8rhqm23xdbz';
2 3
INST_ID HASH_VALUE CHILD_NUMBER PLAN_HASH_VALUE OLD_HASH_VALUE EXECUTIONS AVG_ETIME
---------- ---------- ------------ --------------- -------------- ---------- ----------
1 3292444031 0 848178827 3621482215 11 .000612636
2 3292444031 0 3848578838 3621482215 5 .0176316


8. Modify the Timezone on Grid Infrastructure from EST to GMT:

a. Please make sure to stop the CRS on one node

b. Change the Timezone on the OS side.

c. To display current setting:
$ cat /etc/sysconfig/clock
ZONE="America/New_York"
UTC=false
ARC=false

d. To find out all valid setting: ls -l /usr/share/zoneinfo and Anything that appears in this directory is valid to use.

e. Once OS default timezone is changed, make sure:
1. For 11.2.0.1, shell environment variable TZ is set correctly for grid/oracle user and root.
2. For 11.2.0.2 and above, TZ entry in $GRID_HOME/crs/install/s_crsconfig__env.txt sets to correct time zone.
$cd /u01/app/grid/11.2.0.3/crs/install
$ grep TZ s_crsconfig__env.txt
#TZ=America/New_York
TZ=GMT

f. start the CRS and all DB's.

g. Go to the next node and repeat the same.

Reference:
Note: 1209444.1 - How To Change Timezone for 11gR2 Grid Infrastructure


9. How to switch the SCAN IP address:

Sometimes we need to use the same SCAN name, when we are building the new servers to support no application connection change during the cut over.
In normal procedure, SCAN IP's should be in DNS (Not in /etc/host file) and build the GRID InfoStructure. Here we will have IP's in /etc/host file to build the servers; during the cut over we move it to DNS.

Here is process that we useed to switch the scan IP address:

a. Have network team change the DNS ip address's for the SCAN name.

b. Remove scan name entry in /etc/hosts file on all nodes.

c. Run nslookup on SCAN name and check returns new IP addresses.

d. Shutdown scan listener:
srvctl stop scan_listener
srvctl stop scan

e. Modify SCAN:
srvctl modify scan –n

f. Restart scan listener:
srvctl modify scan_listener -u
srvctl start scan_listener

g. To confirm the change:
srvctl config scan
srvctl config scan_listener

h. Add new scan listener to the system(listener.ora) on all nodes.

Note: This procedure is somewhat similar to above "Modifying Scan name" procedure.

Reference:
Note: 972500.1 - How to Modify SCAN Setting or SCAN Listener Port after Installation