Saturday, May 31, 2008

ASM could not mount diskgroup or see the disks:

What might have caused it? What are my options to fix or recover?

What might have caused it?:
a.
asm_diskstring and/or asm_diskgroups init parameters are not set correctly

b.
OS Disks/Devices ownership

c.
OS Disks/Devices permissions

d.
Check if Physical Disks/Devices exist or visible at OS level

e.
Make sure oracle ASMLib rpm's are installed. - Linux

f.
Make sure all the disks can be listed

g.
Make sure all the disks can be queried

h.
Make sure permission and ownership of /var/opt/oracle directory.

i.
Make sure that there are no duplicate paths or names that point to the same physical disk

Specific to RAC:
j.
Make sure disk name and underlying logical device id or LUN id is consistent across all the nodes

k.
Make sure /proc/partitions file is consistent (no missing entries) across the nodes


What are my options to fix or recover?: Let's talk about all the options mentioned above.
a.
asm_diskstring and/or asm_diskgroups init parameters are not set correctly

Raw device:
ALTER SYSTEM SET asm_diskstring = '/dev/raw/raw*, /dev/raw/raw1[1-5]' scope=BOTH sid='*';

ASMLib:
ALTER SYSTEM SET asm_diskstring = 'ORCL:*' scope=BOTH sid='*';


ALTER SYSTEM SET asm_diskgroups = 'ORADB_DATA, ORADB_IDX, ORADB_BACKUP, ORADB_ARC' scope=spfile sid='*';


b.
OS Disks/Devices ownership

oracle:dba for non OCR and Voting Disk.

root:dba for OCR.
oracle:dba for Voting Disk.


c.
OS Disks/Devices permissions

660 - for non OCR and Voting Disk.

640 - for OCR.
644 - for Voting Disk.


d.
Check if Physical Disks/Devices exist or visible at OS level

Raw device:
/dev/raw

ASMLib:
/dev/oracleasm/disks

Side Note: For more information, please refer to my other note on "ASM SAN migration Case Study: - How to find mapping of ASM disks to Physical Devices?"


e.
Make sure oracle ASMLib rpm's are installed. - Linux

ASMLib only!!

Check:
rpm -qa|grep asm
oracleasmlib-2.0.2-1
oracleasm-support-2.0.3-1
oracleasm-2.6.9-42.ELsmp-2.0.3-1


f.
Make sure all the disks can be listed

/etc/init.d/oracleasm listdisks


g.
Make sure all the disks can be queried

List all the disks:
for i in `cd /dev/oracleasm/disks;ls *`;
do
/etc/init.d/oracleasm querydisk $i 2>/dev/null
done

Look for any missing label's:
for i in `ls /dev/emcpower*1`;
do
/etc/init.d/oracleasm querydisk $i 2>/dev/null | grep "\"\"$"
done


$/dev/oracleasm/disks> ls -lt AO_6011_0972
brw-rw---- 1 oracle dba 120, 641 Mar 21 17:03 AO_6011_0972

$/dev/oracleasm/disks> /etc/init.d/oracleasm querydisk AO_6011_0972
Disk "AO_6011_0972" is a valid ASM disk on device [120, 641]


Side Note 1: When you create ASM disks, its best practice to follow some kind of naming convention.
Why?

For example, easy to recreate missing labels or easy to identify physical storage.

CLARiiON:
$ ls -lt /dev/emcpowerem
brwxrwx--- 1 oracle dba 120, 2272 Mar 18 10:04 /dev/emcpowerem
$ ls -lt /dev/emcpowerem1
brwxrwx--- 1 oracle dba 120, 2273 Apr 24 17:32 /dev/emcpowerem1
$ ls -lt /dev/oracleasm/disks/EM*
brw-rw---- 1 oracle dba 120, 2273 Mar 18 10:46 /dev/oracleasm/disks/EM_5438_1227

# /sbin/powermt display dev=emcpowerem
Pseudo name=emcpowerem
CLARiiON ID=APM00043005438 [d03_ORADB_Dev]
Logical device ID=60060160272013007325DC0E5FF4DC11 [LUN 1227]
state=alive; policy=CLAROpt; priority=0; queued-IOs=0
Owner: default=SP B, current=SP B
==============================================================================
---------------- Host --------------- - Stor - -- I/O Path - -- Stats ---
### HW Path I/O Paths Interf. Mode State Q-IOs Errors
==============================================================================
1 lpfc sdabf SP B1 active alive 0 0
1 lpfc sdacp SP A1 active alive 0 0
1 lpfc sdadz SP B0 active alive 0 0
1 lpfc sdafj SP A0 active alive 0 0
2 lpfc sdagt SP B2 active alive 0 0
2 lpfc sdaid SP A2 active alive 0 0
2 lpfc sdajn SP B3 active alive 0 0
2 lpfc sdakx SP A3 active alive 0 0

Naming Standards:
For CLARiiON: [emcpower Id]_[CLARiiON Id(last 4 dig)]_[LUN Id]
For EMC DMX : [emcpower Id]_[Symmetrix Id(last 4 dig)]_[Logical device Id]



EMC DMX:
$ /dev> ls -lt emcpoweray
brwxrwx--- 1 oracle dba 120, 800 Apr 22 05:46 emcpoweray
$ /dev> /etc/init.d/oracleasm listdisks | grep AY
AY_0469_0BB9
$ /dev> /sbin/powermt display dev=emcpoweray
Pseudo name=emcpoweray
Symmetrix ID=000187880469
Logical device ID=0BB9
state=alive; policy=SymmOpt; priority=0; queued-IOs=0
==============================================================================
---------------- Host --------------- - Stor - -- I/O Path - -- Stats ---
### HW Path I/O Paths Interf. Mode State Q-IOs Errors
==============================================================================
1 lpfc sdbg FA 3cA active alive 0 0
2 lpfc sddv FA 14cA active alive 0 0


Side Note 2: Always take backup of all ASMLib disk headers.

for i in `cd /dev/oracleasm/disks/;ls *`;
do
dd if=/dev/oracleasm/disks/$i of=/tmp/$i.dump bs=4096 count=1
done


h.
Make sure permission and ownership of /var/opt/oracle directory.

oracle:dba and relax the permissions


i.
Make sure that there are no duplicate paths or names that point to the same physical disk
Such duplication result in the below error in alert log file.

ORA-15020: discovered duplicate ASM disk ""

We can detect this using this command: "/sbin/powermt display dev=emcpowerem"


j.
Make sure disk name and underlying logical device id or LUN id is consistent across the nodes

Some times, disk name pointing to logical device id or LUN id on one node is different from other nodes.

We can detect this using this command: "/sbin/powermt display dev=emcpowerem"
We can correct this using this command: "emcpadm rename -s emcpoweram -t emcpowerah"


k.
Make sure /proc/partitions file is consistent(no missing entries) across the nodes


Case 1:
ASMLib could not see SAN storage after reboot or missing ASMLib labels?:


How to create the kfed utility?:
from DB home:
$ cd $ORACLE_HOME/rdbms/lib
$ make -f ins_rdbms.mk ikfed


We can recreate missing labels with below command:
# /etc/init.d/oracleasm force-renamedisk /dev/emcpoweray1 AY_0469_0BB9

Side Note: We can also look at ASM labels using kfed tool, kfed tool might have been used to tweak the headers.


kfed help in 10.2.0.3:
$ kfed help=y
as/mlib ASM Library [asmlib='lib']
aun/um AU number to examine or update [AUNUM=number]
aus/z Allocation Unit size in bytes [AUSZ=number]
blkn/um Block number to examine or update [BLKNUM=number]
blks/z Metadata block size in bytes [BLKSZ=number]
ch/ksum Update checksum before each write [CHKSUM=YES/NO]
cn/t Count of AUs to process [CNT=number]
d/ev ASM device to examine or update [DEV=string]
o/p KFED operation type [OP=READ/WRITE/MERGE/NEW/FORM/FIND/STRUCT]
p/rovnm Name for provisioning purposes [PROVNM=string]
te/xt File name for translated block text [TEXT=string]
ty/pe ASM metadata block type number [TYPE=number]
KFED-01000: USAGE: kfed [] [] []


e.g.:
$ ORACLE_ASM_HOME/bin/kfed read /dev/oracleasm/disks/AY_0469_0BB9
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj: 2147483650 ; 0x008: TYPE=0x8 NUMB=0x2
kfbh.check: 2426726764 ; 0x00c: 0x90a4e96c
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:ORCLDISKAY_0469_0BB9 ; 0x000: length=20
kfdhdb.driver.reserved[0]: 811555137 ; 0x008: 0x305f5941
kfdhdb.driver.reserved[1]: 1597584948 ; 0x00c: 0x5f393634
kfdhdb.driver.reserved[2]: 960643632 ; 0x010: 0x39424230
kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000
kfdhdb.compat: 168820736 ; 0x020: 0x0a100000
kfdhdb.dsknum: 2 ; 0x024: 0x0002
kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname: AY_0469_0BB9 ; 0x028: length=12
kfdhdb.grpname: ORADB_T1_BACKUP_01 ; 0x048: length=18

kfdhdb.fgname: AY_0469_0BB9 ; 0x068: length=12
kfdhdb.capname: ; 0x088: length=0
kfdhdb.crestmp.hi: 32902473 ; 0x0a8: HOUR=0x9 DAYS=0xa MNTH=0x3 YEAR=0x7d8
kfdhdb.crestmp.lo: 1396390912 ; 0x0ac: USEC=0x0 MSEC=0x2cf SECS=0x33 MINS=0x14
kfdhdb.mntstmp.hi: 32904078 ; 0x0b0: HOUR=0xe DAYS=0x1c MNTH=0x4 YEAR=0x7d8
kfdhdb.mntstmp.lo: 1783004160 ; 0x0b4: USEC=0x0 MSEC=0x19f SECS=0x24 MINS=0x1a
kfdhdb.secsize: 512 ; 0x0b8: 0x0200
kfdhdb.blksize: 4096 ; 0x0ba: 0x1000
kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000
kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80
kfdhdb.dsksize: 17263 ; 0x0c4: 0x0000436f
kfdhdb.pmcnt: 2 ; 0x0c8: 0x00000002
kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001
kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002
kfdhdb.f1b1locn: 0 ; 0x0d4: 0x00000000
kfdhdb.redomirrors[0]: 0 ; 0x0d8: 0x0000
kfdhdb.redomirrors[1]: 0 ; 0x0da: 0x0000
kfdhdb.redomirrors[2]: 0 ; 0x0dc: 0x0000
kfdhdb.redomirrors[3]: 0 ; 0x0de: 0x0000
kfdhdb.dbcompat: 168820736 ; 0x0e0: 0x0a100000
kfdhdb.grpstmp.hi: 32902473 ; 0x0e4: HOUR=0x9 DAYS=0xa MNTH=0x3 YEAR=0x7d8
kfdhdb.grpstmp.lo: 1396293632 ; 0x0e8: USEC=0x0 MSEC=0x270 SECS=0x33 MINS=0x14
kfdhdb.ub4spare[0]: 0 ; 0x0ec: 0x00000000
kfdhdb.ub4spare[1]: 0 ; 0x0f0: 0x00000000
kfdhdb.ub4spare[2]: 0 ; 0x0f4: 0x00000000
kfdhdb.ub4spare[3]: 0 ; 0x0f8: 0x00000000
kfdhdb.ub4spare[4]: 0 ; 0x0fc: 0x00000000
kfdhdb.ub4spare[5]: 0 ; 0x100: 0x00000000
kfdhdb.ub4spare[6]: 0 ; 0x104: 0x00000000
kfdhdb.ub4spare[7]: 0 ; 0x108: 0x00000000
kfdhdb.ub4spare[8]: 0 ; 0x10c: 0x00000000
kfdhdb.ub4spare[9]: 0 ; 0x110: 0x00000000
kfdhdb.ub4spare[10]: 0 ; 0x114: 0x00000000
kfdhdb.ub4spare[11]: 0 ; 0x118: 0x00000000
kfdhdb.ub4spare[12]: 0 ; 0x11c: 0x00000000
kfdhdb.ub4spare[13]: 0 ; 0x120: 0x00000000
kfdhdb.ub4spare[14]: 0 ; 0x124: 0x00000000
kfdhdb.ub4spare[15]: 0 ; 0x128: 0x00000000
kfdhdb.ub4spare[16]: 0 ; 0x12c: 0x00000000
kfdhdb.ub4spare[17]: 0 ; 0x130: 0x00000000
kfdhdb.ub4spare[18]: 0 ; 0x134: 0x00000000
kfdhdb.ub4spare[19]: 0 ; 0x138: 0x00000000
kfdhdb.ub4spare[20]: 0 ; 0x13c: 0x00000000
kfdhdb.ub4spare[21]: 0 ; 0x140: 0x00000000
kfdhdb.ub4spare[22]: 0 ; 0x144: 0x00000000
kfdhdb.ub4spare[23]: 0 ; 0x148: 0x00000000
kfdhdb.ub4spare[24]: 0 ; 0x14c: 0x00000000
kfdhdb.ub4spare[25]: 0 ; 0x150: 0x00000000
kfdhdb.ub4spare[26]: 0 ; 0x154: 0x00000000
kfdhdb.ub4spare[27]: 0 ; 0x158: 0x00000000
kfdhdb.ub4spare[28]: 0 ; 0x15c: 0x00000000
kfdhdb.ub4spare[29]: 0 ; 0x160: 0x00000000
kfdhdb.ub4spare[30]: 0 ; 0x164: 0x00000000
kfdhdb.ub4spare[31]: 0 ; 0x168: 0x00000000
kfdhdb.ub4spare[32]: 0 ; 0x16c: 0x00000000
kfdhdb.ub4spare[33]: 0 ; 0x170: 0x00000000
kfdhdb.ub4spare[34]: 0 ; 0x174: 0x00000000
kfdhdb.ub4spare[35]: 0 ; 0x178: 0x00000000
kfdhdb.ub4spare[36]: 0 ; 0x17c: 0x00000000
kfdhdb.ub4spare[37]: 0 ; 0x180: 0x00000000
kfdhdb.ub4spare[38]: 0 ; 0x184: 0x00000000
kfdhdb.ub4spare[39]: 0 ; 0x188: 0x00000000
kfdhdb.ub4spare[40]: 0 ; 0x18c: 0x00000000
kfdhdb.ub4spare[41]: 0 ; 0x190: 0x00000000
kfdhdb.ub4spare[42]: 0 ; 0x194: 0x00000000
kfdhdb.ub4spare[43]: 0 ; 0x198: 0x00000000
kfdhdb.ub4spare[44]: 0 ; 0x19c: 0x00000000
kfdhdb.ub4spare[45]: 0 ; 0x1a0: 0x00000000
kfdhdb.ub4spare[46]: 0 ; 0x1a4: 0x00000000
kfdhdb.ub4spare[47]: 0 ; 0x1a8: 0x00000000
kfdhdb.ub4spare[48]: 0 ; 0x1ac: 0x00000000
kfdhdb.ub4spare[49]: 0 ; 0x1b0: 0x00000000
kfdhdb.ub4spare[50]: 0 ; 0x1b4: 0x00000000
kfdhdb.ub4spare[51]: 0 ; 0x1b8: 0x00000000
kfdhdb.ub4spare[52]: 0 ; 0x1bc: 0x00000000
kfdhdb.ub4spare[53]: 0 ; 0x1c0: 0x00000000
kfdhdb.ub4spare[54]: 0 ; 0x1c4: 0x00000000
kfdhdb.ub4spare[55]: 0 ; 0x1c8: 0x00000000
kfdhdb.ub4spare[56]: 0 ; 0x1cc: 0x00000000
kfdhdb.ub4spare[57]: 0 ; 0x1d0: 0x00000000
kfdhdb.acdb.aba.seq: 0 ; 0x1d4: 0x00000000
kfdhdb.acdb.aba.blk: 0 ; 0x1d8: 0x00000000
kfdhdb.acdb.ents: 0 ; 0x1dc: 0x0000
kfdhdb.acdb.ub2spare: 0 ; 0x1de: 0x0000


Case 2:
Logical device id or LUN id is different across the nodes?:

Please refer to above session j for fix


Case 3:
Missing disk(can't mount diskgroup/drop diskgroup/delete disk)?:

Problem:
Why am I getting following errors, when I am doing "alter diskgroup ORADB_DATA_DG mount;" command?


ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "7" is missing


Consider all the above mentioned missing disk possibilities to fix the problem. If it's permanent disk failures, please refer below "Recovering the disks from permanent disk failures? session"


I
Recovering the disks from permanent disk failures?


10g:
Method 1.

Use DD command to clear disk header and delete the disk and create the disk, which will make the disks as candidates and add back to ASM diskgroup.

dd if=/dev/zero of=[raw device] bs=1024 count=4 -- for raw device
or
dd if=/dev/zero of=[emcpower device] bs=1024 count=4 -- for emcpower device

as root only!!!
/etc/init.d/oracleasm deletedisk [ASM disk name]
/etc/init.d/oracleasm createdisk [ASM disk name] [emcpower device]

/etc/init.d/oracleasm scandisks

Method 2.
create diskgroup EXTERNAL REDUNDANCY
disk '<1'st missing disk name>' force, '<2'nd missing disk name>' force;

Now the question is how do I drop the original diskgroup?
When you dd or force all the disks to use new diskgroup, old ASM diskgroup will not exist any longer. byeee, byee bye


11g:
drop diskgroup force including contents; -- Drop the diskgroup completely, to recreate the diskgroup.

alter diskgroup mount force; -- This is useful when you have missing disk or disks


Side Note: You can not use the FORCE flag when dropping a disk from an external redundancy diskgroup.

4 comments:

Binh Pham said...

Great blog. Thanks!

Anurag V said...

Good ones...

Nana said...

Your blog is coming on the top searches in Google. This is great and the contents are very good. Keep going Vijay

Anonymous said...

Please, can you put the same answers but to disk with format /dev/sda1..

Thanks