Friday, May 25, 2012

Exadata Flash Cache Maintenance:

Exadata Flash Cache Maintenance:
This is one of the common maintenance task on Exadata cell nodes; just want to publish the steps to correct Flash Cache issue.

-- Pre re-create flashcache steps
1.
-- Run the following command to check if there are other offline disks
CellCLI> LIST GRIDDISK ATTRIBUTES name WHERE asmdeactivationoutcome != 'Yes'

If any grid disks are returned, then it is not safe to take the storage server offline because proper Oracle ASM disk group redundancy will not be intact.

Taking the storage server offline when one or more grid disks are in this state will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.

Output:
CellCLI> LIST GRIDDISK ATTRIBUTES name WHERE asmdeactivationoutcome != 'Yes' -- No records selected. -- Safe
CellCLI>

CellCLI> LIST GRIDDISK ATTRIBUTES name WHERE asmdeactivationoutcome = 'Yes'
DATA_CD_00_atl02cel06
DATA_CD_01_atl02cel06
DATA_CD_02_atl02cel06
DATA_CD_03_atl02cel06
DATA_CD_04_atl02cel06
DATA_CD_05_atl02cel06
DATA_CD_06_atl02cel06
DATA_CD_07_atl02cel06
DATA_CD_08_atl02cel06
DATA_CD_09_atl02cel06
DATA_CD_10_atl02cel06
DATA_CD_11_atl02cel06
RECO_CD_00_atl02cel06
RECO_CD_01_atl02cel06
RECO_CD_02_atl02cel06
RECO_CD_03_atl02cel06
RECO_CD_04_atl02cel06
RECO_CD_05_atl02cel06
RECO_CD_06_atl02cel06
RECO_CD_07_atl02cel06
RECO_CD_08_atl02cel06
RECO_CD_09_atl02cel06
RECO_CD_10_atl02cel06
RECO_CD_11_atl02cel06
SYSTEMDG_CD_02_atl02cel06
SYSTEMDG_CD_03_atl02cel06
SYSTEMDG_CD_04_atl02cel06
SYSTEMDG_CD_05_atl02cel06
SYSTEMDG_CD_06_atl02cel06
SYSTEMDG_CD_07_atl02cel06
SYSTEMDG_CD_08_atl02cel06
SYSTEMDG_CD_09_atl02cel06
SYSTEMDG_CD_10_atl02cel06
SYSTEMDG_CD_11_atl02cel06


2.
-- Get a listing of physical disks and celldisk. You can use this later to match up the WWN's to ensure you are removing the correct module.
CellCLI> list physicaldisk
CellCLI> list celldisk

Output:
-- Missing disk output: - missing few disks from 3 and 4 in this output.
CellCLI> list physicaldisk
16:0 JK1130YAHARN4T normal
16:1 JK1130YAHBWWZT normal
16:2 JK1130YAHBV76T normal
16:3 JK1130YAHBV7TT normal
16:4 JK1130YAH49UJT normal
16:5 JK1130YAHBJ5HT normal
16:6 JK1130YAHBJNHT normal
16:7 JK1130YAHATA3T normal
16:8 JK1130YAHAJX7T normal
16:9 JK1130YAHBJNJT normal
16:10 JK1130YAHBSBZT normal
16:11 JK1130YAHBV7PT normal
[1:0:0:0] 5080020000c47c0FMOD0 normal
[1:0:1:0] 5080020000c47c0FMOD1 normal
[1:0:2:0] 5080020000c47c0FMOD2 normal
[1:0:3:0] 5080020000c47c0FMOD3 normal
[2:0:0:0] 5080020000c47eaFMOD0 normal
[2:0:1:0] 5080020000c47eaFMOD1 normal
[2:0:2:0] 5080020000c47eaFMOD2 normal
[2:0:3:0] 5080020000c47eaFMOD3 normal

-- Some times output may look like this. -- This output is from different server
CellCLI> list physicaldisk
24:0 JK1130YAG52NAT normal
24:1 JK1130YAG4EPZT normal
24:2 JK1130YAG0VABT normal
24:3 JK1130YAG536XT normal
24:4 JK1130YAG51TVT normal
24:5 JK1130YAG0VA9T normal
24:6 JK1130YAG51TUT normal
24:7 JK1130YAG0VRKT normal
24:8 JK1130YAG51ZYT normal
24:9 JK1130YAG0VD4T normal
24:10 JK1130YAG0VN3T normal
24:11 JK1130YAG52NKT normal
[1:0:0:0] 5080020000c4262FMOD0 normal
[1:0:1:0] 5080020000c4262FMOD1 normal
[1:0:2:0] 5080020000c4262FMOD2 normal
[1:0:3:0] 5080020000c4262FMOD3 normal
[2:0:0:0] 5080020000c421cFMOD0 normal
[2:0:1:0] 5080020000c421cFMOD1 normal
[2:0:2:0] 5080020000c421cFMOD2 normal
[3:0:0:0] 5080020000c422eFMOD0 normal
[3:0:1:0] 5080020000c422eFMOD1 poor performance
[3:0:2:0] 5080020000c422eFMOD2 poor performance
[3:0:3:0] 5080020000c422eFMOD3 poor performance
[4:0:0:0] 5080020000c427cFMOD0 normal
[4:0:1:0] 5080020000c427cFMOD1 normal
[4:0:2:0] 5080020000c427cFMOD2 normal
[4:0:3:0] 5080020000c427cFMOD3 normal

CellCLI> list celldisk
CD_00_atl02cel06 normal
CD_01_atl02cel06 normal
CD_02_atl02cel06 normal
CD_03_atl02cel06 normal
CD_04_atl02cel06 normal
CD_05_atl02cel06 normal
CD_06_atl02cel06 normal
CD_07_atl02cel06 normal
CD_08_atl02cel06 normal
CD_09_atl02cel06 normal
CD_10_atl02cel06 normal
CD_11_atl02cel06 normal
FD_00_atl02cel06 normal
FD_01_atl02cel06 normal
FD_02_atl02cel06 normal
FD_03_atl02cel06 normal
FD_04_atl02cel06 not present
FD_05_atl02cel06 not present
FD_06_atl02cel06 not present
FD_07_atl02cel06 not present
FD_08_atl02cel06 normal
FD_09_atl02cel06 normal
FD_10_atl02cel06 normal
FD_11_atl02cel06 normal
FD_12_atl02cel06 not present
FD_13_atl02cel06 not present
FD_14_atl02cel06 not present
FD_15_atl02cel06 not present


3.
-- Inactivate all the grid disks when Oracle Exadata Storage Server is safe to take offline using the following command:
CellCLI> ALTER GRIDDISK ALL INACTIVE

The preceding command will complete once all disks are inactive and offline.
Depending on the storage server activity, it may take several minutes for this command to complete.


4.
-- Verify all grid disks are INACTIVE to allow safe storage server shut down by running the following command:
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
CellCLI> LIST GRIDDISK

If all grid disks are INACTIVE, then the storage server can be shutdown without affecting database availability

Output:
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
DATA_CD_00_atl02cel06 UNKNOWN
DATA_CD_01_atl02cel06 UNKNOWN
DATA_CD_02_atl02cel06 UNKNOWN
DATA_CD_03_atl02cel06 UNKNOWN
DATA_CD_04_atl02cel06 UNKNOWN
DATA_CD_05_atl02cel06 UNKNOWN
DATA_CD_06_atl02cel06 UNKNOWN
DATA_CD_07_atl02cel06 UNKNOWN
DATA_CD_08_atl02cel06 UNKNOWN
DATA_CD_09_atl02cel06 UNKNOWN
DATA_CD_10_atl02cel06 UNKNOWN
DATA_CD_11_atl02cel06 UNKNOWN
RECO_CD_00_atl02cel06 UNKNOWN
RECO_CD_01_atl02cel06 UNKNOWN
RECO_CD_02_atl02cel06 UNKNOWN
RECO_CD_03_atl02cel06 UNKNOWN
RECO_CD_04_atl02cel06 UNKNOWN
RECO_CD_05_atl02cel06 UNKNOWN
RECO_CD_06_atl02cel06 UNKNOWN
RECO_CD_07_atl02cel06 UNKNOWN
RECO_CD_08_atl02cel06 UNKNOWN
RECO_CD_09_atl02cel06 UNKNOWN
RECO_CD_10_atl02cel06 UNKNOWN
RECO_CD_11_atl02cel06 UNKNOWN
SYSTEMDG_CD_02_atl02cel06 UNKNOWN
SYSTEMDG_CD_03_atl02cel06 UNKNOWN
SYSTEMDG_CD_04_atl02cel06 UNKNOWN
SYSTEMDG_CD_05_atl02cel06 UNKNOWN
SYSTEMDG_CD_06_atl02cel06 UNKNOWN
SYSTEMDG_CD_07_atl02cel06 UNKNOWN
SYSTEMDG_CD_08_atl02cel06 UNKNOWN
SYSTEMDG_CD_09_atl02cel06 UNKNOWN
SYSTEMDG_CD_10_atl02cel06 UNKNOWN
SYSTEMDG_CD_11_atl02cel06 UNKNOWN

-- This should be INACTIVE, when you do "ALTER GRIDDISK ALL INACTIVE"
CellCLI> LIST GRIDDISK
DATA_CD_00_atl02cel06 active
DATA_CD_01_atl02cel06 active
DATA_CD_02_atl02cel06 active
DATA_CD_03_atl02cel06 active
DATA_CD_04_atl02cel06 active
DATA_CD_05_atl02cel06 active
DATA_CD_06_atl02cel06 active
DATA_CD_07_atl02cel06 active
DATA_CD_08_atl02cel06 active
DATA_CD_09_atl02cel06 active
DATA_CD_10_atl02cel06 active
DATA_CD_11_atl02cel06 active
RECO_CD_00_atl02cel06 active
RECO_CD_01_atl02cel06 active
RECO_CD_02_atl02cel06 active
RECO_CD_03_atl02cel06 active
RECO_CD_04_atl02cel06 active
RECO_CD_05_atl02cel06 active
RECO_CD_06_atl02cel06 active
RECO_CD_07_atl02cel06 active
RECO_CD_08_atl02cel06 active
RECO_CD_09_atl02cel06 active
RECO_CD_10_atl02cel06 active
RECO_CD_11_atl02cel06 active
SYSTEMDG_CD_02_atl02cel06 active
SYSTEMDG_CD_03_atl02cel06 active
SYSTEMDG_CD_04_atl02cel06 active
SYSTEMDG_CD_05_atl02cel06 active
SYSTEMDG_CD_06_atl02cel06 active
SYSTEMDG_CD_07_atl02cel06 active
SYSTEMDG_CD_08_atl02cel06 active
SYSTEMDG_CD_09_atl02cel06 active
SYSTEMDG_CD_10_atl02cel06 active
SYSTEMDG_CD_11_atl02cel06 active


5.
-- Check the status of FLASHCACHE
CellCLI> LIST FLASHCACHE detail

Output: -- When have issue.
CellCLI> LIST FLASHCACHE detail
name: atl02cel06_FLASHCACHE
cellDisk: FD_10_atl02cel06,FD_09_atl02cel06,FD_01_atl02cel06,FD_11_atl02cel06,FD_03_atl02cel06,FD_08_atl02cel06,FD_00_atl02cel06,FD_02_atl02cel06
creationTime: 2010-07-14T19:51:22-04:00
degradedCelldisks: FD_13_atl02cel06,FD_14_atl02cel06,FD_07_atl02cel06,FD_15_atl02cel06,FD_05_atl02cel06,FD_06_atl02cel06,FD_04_atl02cel06,FD_12_atl02cel06
effectiveCacheSize: 182.625G
id: 60a179f7-2fdf-44f4-a63d-9cfdd03d42cc
size: 365.25G
status: warning

-- Status should be normal, but here it's warning...


6.
-- We can now drop the flashcache here using the command

CellCLI> drop flashcache all

This will ensure that on startup the flashcache does not start


7.
-- Stop the cell services using the following command:
CellCLI> ALTER CELL SHUTDOWN SERVICES ALL


8.
Unix SA -- REBOOT atl02cel06.(root) for replacing the FLASHCACHE cards

#shutdown -h -y now

The cell services will be started automatically.

---------------------------------------------------

a. Replace the failed flash disk based on the PCI number and FDOM number.
NOTE: You can also use the WWN from the "list physicaldisk" command earlier.
Example: [2:0:0:0] 5080020000fcf12FMOD0 normal
               [2:0:0:0] 5080020000f "cf12" FMOD0 normal
The number "cf12" above should be on an orange sticker on the back of each flash pci card.

b. Power up the cell. The cell services will be started automatically.

c. After boot, verify the replaced disks have the same firmware...

/opt/oracle.SupportTools/CheckHWnFWProfile -c strict
-OR-
dmesg | grep -i marvell # Verify all revisions are the same

-- If firmware is not the same across all disks, then do the following...
rm /opt/oracle.cellos/TRIED_FW_UPDATE_ONCE
reboot
---------------------------------------------------



-- Re-create flashcache Steps
9.
-- Configure Flash Cache

CellCLI> drop celldisk all flashdisk force
CellCLI> create celldisk all flashdisk
CellCLI> create flashcache all
CellCLI> LIST FLASHCACHE detail


10.
-- Bring all grid disks online using the following command:

CellCLI> ALTER GRIDDISK ALL ACTIVE

Note:
-- When the grid disks become active, Oracle ASM will automatically synchronize the gird disks to bring them back into the disk group.
-- If ASM is down, need to bring up ASM on atleast one node to see the celldisk "ONLINE", if not celldisk will NOT be in ONLINE until the ASM is up!!




--Validation steps
11.
-- Verify all grid disks have been successfully put online using the following command:
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus

Wait until asmmodestatus is ONLINE for all grid disks.

Output:
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
DATA_CD_00_atl02cel06 ONLINE
DATA_CD_01_atl02cel06 ONLINE
DATA_CD_02_atl02cel06 ONLINE
DATA_CD_03_atl02cel06 ONLINE
DATA_CD_04_atl02cel06 ONLINE
DATA_CD_05_atl02cel06 ONLINE
DATA_CD_06_atl02cel06 ONLINE
DATA_CD_07_atl02cel06 ONLINE
DATA_CD_08_atl02cel06 ONLINE
DATA_CD_09_atl02cel06 ONLINE
DATA_CD_10_atl02cel06 ONLINE
DATA_CD_11_atl02cel06 ONLINE
RECO_CD_00_atl02cel06 ONLINE
RECO_CD_01_atl02cel06 ONLINE
RECO_CD_02_atl02cel06 ONLINE
RECO_CD_03_atl02cel06 ONLINE
RECO_CD_04_atl02cel06 ONLINE
RECO_CD_05_atl02cel06 ONLINE
RECO_CD_06_atl02cel06 ONLINE
RECO_CD_07_atl02cel06 ONLINE
RECO_CD_08_atl02cel06 ONLINE
RECO_CD_09_atl02cel06 ONLINE
RECO_CD_10_atl02cel06 ONLINE
RECO_CD_11_atl02cel06 ONLINE
SYSTEMDG_CD_02_atl02cel06 ONLINE
SYSTEMDG_CD_03_atl02cel06 ONLINE
SYSTEMDG_CD_04_atl02cel06 ONLINE
SYSTEMDG_CD_05_atl02cel06 ONLINE
SYSTEMDG_CD_06_atl02cel06 ONLINE
SYSTEMDG_CD_07_atl02cel06 ONLINE
SYSTEMDG_CD_08_atl02cel06 ONLINE
SYSTEMDG_CD_09_atl02cel06 ONLINE
SYSTEMDG_CD_10_atl02cel06 ONLINE
SYSTEMDG_CD_11_atl02cel06 ONLINE


12.
-- Oracle ASM synchronization is only complete when all grid disks show asmmodestatus=ONLINE.
-- Before taking another storage server offline, Oracle ASM synchronization must complete on the restarted Oracle Exadata Storage Server.
-- If synchronization is not complete, then the check performed on another storage server will fail.

CellCLI> list griddisk attributes name where asmdeactivationoutcome != 'Yes'


13.
-- Wait until asmmodestatus shows ONLINE for all grid disks.
-- List the disks and confirm they are all normal.

CellCLI> LIST GRIDDISK ATTRIBUTES asmmodestatus
CellCLI> list physicaldisk
CellCLI> list celldisk
CellCLI> list griddisk

-- All disks are visable after re-creating the flashcache/ compare the output from Step 2
CellCLI> list physicaldisk
16:0 JK1130YAHARN4T normal
16:1 JK1130YAHBWWZT normal
16:2 JK1130YAHBV76T normal
16:3 JK1130YAHBV7TT normal
16:4 JK1130YAH49UJT normal
16:5 JK1130YAHBJ5HT normal
16:6 JK1130YAHBJNHT normal
16:7 JK1130YAHATA3T normal
16:8 JK1130YAHAJX7T normal
16:9 JK1130YAHBJNJT normal
16:10 JK1130YAHBSBZT normal
16:11 JK1130YAHBV7PT normal
[1:0:0:0] 5080020000c47c0FMOD0 normal
[1:0:1:0] 5080020000c47c0FMOD1 normal
[1:0:2:0] 5080020000c47c0FMOD2 normal
[1:0:3:0] 5080020000c47c0FMOD3 normal
[2:0:0:0] 5080020000c47eaFMOD0 normal
[2:0:1:0] 5080020000c47eaFMOD1 normal
[2:0:2:0] 5080020000c47eaFMOD2 normal
[2:0:3:0] 5080020000c47eaFMOD3 normal
[3:0:0:0] 5080020000c47feFMOD0 normal
[3:0:1:0] 5080020000c47feFMOD1 normal
[3:0:2:0] 5080020000c47feFMOD2 normal
[3:0:3:0] 5080020000c47feFMOD3 normal
[4:0:0:0] 5080020000c47b2FMOD0 normal
[4:0:1:0] 5080020000c47b2FMOD1 normal
[4:0:2:0] 5080020000c47b2FMOD2 normal
[4:0:3:0] 5080020000c47b2FMOD3 normal

--
CellCLI> list celldisk
CD_00_atl02cel06 normal
CD_01_atl02cel06 normal
CD_02_atl02cel06 normal
CD_03_atl02cel06 normal
CD_04_atl02cel06 normal
CD_05_atl02cel06 normal
CD_06_atl02cel06 normal
CD_07_atl02cel06 normal
CD_08_atl02cel06 normal
CD_09_atl02cel06 normal
CD_10_atl02cel06 normal
CD_11_atl02cel06 normal
FD_00_atl02cel06 normal
FD_01_atl02cel06 normal
FD_02_atl02cel06 normal
FD_03_atl02cel06 normal
FD_04_atl02cel06 normal
FD_05_atl02cel06 normal
FD_06_atl02cel06 normal
FD_07_atl02cel06 normal
FD_08_atl02cel06 normal
FD_09_atl02cel06 normal
FD_10_atl02cel06 normal
FD_11_atl02cel06 normal
FD_12_atl02cel06 normal
FD_13_atl02cel06 normal
FD_14_atl02cel06 normal
FD_15_atl02cel06 normal

--
CellCLI> list griddisk
DATA_CD_00_atl02cel06 active
DATA_CD_01_atl02cel06 active
DATA_CD_02_atl02cel06 active
DATA_CD_03_atl02cel06 active
DATA_CD_04_atl02cel06 active
DATA_CD_05_atl02cel06 active
DATA_CD_06_atl02cel06 active
DATA_CD_07_atl02cel06 active
DATA_CD_08_atl02cel06 active
DATA_CD_09_atl02cel06 active
DATA_CD_10_atl02cel06 active
DATA_CD_11_atl02cel06 active
RECO_CD_00_atl02cel06 active
RECO_CD_01_atl02cel06 active
RECO_CD_02_atl02cel06 active
RECO_CD_03_atl02cel06 active
RECO_CD_04_atl02cel06 active
RECO_CD_05_atl02cel06 active
RECO_CD_06_atl02cel06 active
RECO_CD_07_atl02cel06 active
RECO_CD_08_atl02cel06 active
RECO_CD_09_atl02cel06 active
RECO_CD_10_atl02cel06 active
RECO_CD_11_atl02cel06 active
SYSTEMDG_CD_02_atl02cel06 active
SYSTEMDG_CD_03_atl02cel06 active
SYSTEMDG_CD_04_atl02cel06 active
SYSTEMDG_CD_05_atl02cel06 active
SYSTEMDG_CD_06_atl02cel06 active
SYSTEMDG_CD_07_atl02cel06 active
SYSTEMDG_CD_08_atl02cel06 active
SYSTEMDG_CD_09_atl02cel06 active
SYSTEMDG_CD_10_atl02cel06 active
SYSTEMDG_CD_11_atl02cel06 active


14.
-- Go to all the cells and run this command:
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus

--
CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
DATA_CD_00_atl02cel06 ONLINE
DATA_CD_01_atl02cel06 ONLINE
DATA_CD_02_atl02cel06 ONLINE
DATA_CD_03_atl02cel06 ONLINE
DATA_CD_04_atl02cel06 ONLINE
DATA_CD_05_atl02cel06 ONLINE
DATA_CD_06_atl02cel06 ONLINE
DATA_CD_07_atl02cel06 ONLINE
DATA_CD_08_atl02cel06 ONLINE
DATA_CD_09_atl02cel06 ONLINE
DATA_CD_10_atl02cel06 ONLINE
DATA_CD_11_atl02cel06 ONLINE
RECO_CD_00_atl02cel06 ONLINE
RECO_CD_01_atl02cel06 ONLINE
RECO_CD_02_atl02cel06 ONLINE
RECO_CD_03_atl02cel06 ONLINE
RECO_CD_04_atl02cel06 ONLINE
RECO_CD_05_atl02cel06 ONLINE
RECO_CD_06_atl02cel06 ONLINE
RECO_CD_07_atl02cel06 ONLINE
RECO_CD_08_atl02cel06 ONLINE
RECO_CD_09_atl02cel06 ONLINE
RECO_CD_10_atl02cel06 ONLINE
RECO_CD_11_atl02cel06 ONLINE
SYSTEMDG_CD_02_atl02cel06 ONLINE
SYSTEMDG_CD_03_atl02cel06 ONLINE
SYSTEMDG_CD_04_atl02cel06 ONLINE
SYSTEMDG_CD_05_atl02cel06 ONLINE
SYSTEMDG_CD_06_atl02cel06 ONLINE
SYSTEMDG_CD_07_atl02cel06 ONLINE
SYSTEMDG_CD_08_atl02cel06 ONLINE
SYSTEMDG_CD_09_atl02cel06 ONLINE
SYSTEMDG_CD_10_atl02cel06 ONLINE
SYSTEMDG_CD_11_atl02cel06 ONLINE

2 comments:

PC said...

Hello Vijay,

Thanks for the nice post.
When I tried this on my server, it gave me the error that "Command not found".
# cellcli
-bash: cellcli: command not found

# which cellcli
which: no cellcli in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin)

By any chance do you have any idea why this is happening?

Vijay R. Dumpa said...

try adding cellsrv bin path:

$ which cellcli
/opt/oracle/cell11.2.3.1.1_LINUX.X64_120607/cellsrv/bin/cellcli