I follow below steps while changing and faulty disk on Exadata. We are using Exadata storage SW version 11.2.2.4.0.
1. You have identified following disk to be replaced:
$ cellcli -e list physicaldisk detail
name: 18:4
deviceId: 10
diskType: HardDisk
enclosureDeviceId: 18
errMediaCount: 471
errOtherCount: 17
foreignState: false
luns: 0_4
makeModel: "SEAGATE ST360057SSUN600G"
physicalFirmware: 0805
physicalInsertTime: 2010-03-26T05:05:46+06:00
physicalInterface: sas
physicalSerial: E08WD8
physicalSize: 558.9109999993816G
slotNumber: 4
status: critical
2. You need to make sure that there is no other disks are offline/missing in other cells(failgroup) :
select count(*),failgroup from gv$asm_disk group by failgroup;
3. Check ASM_POWER_LIMIT and if it is 0 then set it to non zero value so that after the disk replacement rebalance can happen and the newly added disks will have data written on them.
This parameter can be set to any value between 0 to 11.
--on asm instance
alter system set asm_power_limit=10;
4. Make sure there are no rebalance operation running on any instance on any node:
select * from gv$asm_operation;
5. Identify the ASM disks which are existing on the physical disk to be replaced:
a. On the cell node execute following:
- Find the lun number "0_11" in the list physicaldisk detail output
# cellcli -e list physicaldisk detail
.....
luns: 0_4
.....
- Find the celldisk created for this lun (physicaldisk)
# cellcli -e list celldisk where lun=0_4
CD_04_axdwcel06 not present
- Find the griddisks created for this celldisk
# cellcli -e list griddisk where celldisk=CD_04_axdwcel06
DATA1_CD_04_axdwcel06 not present
DATA2_CD_04_axdwcel06 not present
RECO_CD_04_axdwcel06 not present
SYSTEMDG_CD_04_axdwcel06 not present
b. Execute following on any db node connecting to the ASM instance:
SQL> select name, substr(path,1,50) , mount_status, header_status, state from gv$asm_disk where name in ('DATA1_CD_04_AXDWCEL06','DATA2_CD_04_AXDWCEL06','SYSTEMDG_CD_04_AXDWCEL06','RECO_CD_04_AXDWCEL06');
NAME SUBSTR (PATH,1,50) MOUNT_S HEADER_STA STATE
------------------------------ ---------------------------------------- ------
DATA_CD_11_DMORLCEL02 o/192.168.10.6/DATA_CD_11_dmorlcel02 CACHED MEMBER NORMAL
RECO_CD_11_DMORLCEL02 o/192.168.10.6/RECO_CD_11_dmorlcel02 CACHED MEMBER NORMAL
6. Now drop the disk from the diskgroup(s):
alter diskgroup DATA1 drop disk 'DATA1_CD_04_AXDWCEL06';
alter diskgroup DATA2 drop disk 'DATA2_CD_04_AXDWCEL06';
alter diskgroup SYSTEMDG drop disk 'SYSTEMDG_CD_04_AXDWCEL06';
alter diskgroup RECO drop disk 'RECO_CD_04_AXDWCEL06';
7. Wait until the Oracle ASM disks associated with the grid disks on the bad disk have been successfully dropped by querying the V$ASM_DISK_STAT view.
V$ASM_DISK_STAT displays disk statistics in the same way that V$ASM_DISK does, but without performing discovery of new disks. This results in a less expensive operation. However, since discovery is not performed, the output of this view does not include any data about disks that are new to the system.
8. Mark this disk at the hardware level to be serviced:
cellcli -e 'alter physicaldisk 18:4 serviceled on'
Here disk_name should be replaced by the name of the physical disk '18:4'.
This step will help in identifying the disk to be replaced at the h/w level. Please note that this command may not work in version less than 11.2.2.2. but it is not mandatory. it just helps in identifying the disk to be replaced.
9. Now replace the disk at the h/w level. The service LED will glow for this disk as done in the previous step.
10. After replacing the disk Exadata s/w will automatically create celldisk, griddisks. You can tail the cell alert.log and ASM alert.log to see this activity. But the disks will not be added back to the ASM as we did not use FORCE option while dropping them. So we need to add them manually:
alter diskgroup DATA1 add disk 'o/10.10.0.10/DATA1_CD_04_AXDWCEL06';
alter diskgroup DATA2 add disk 'o/10.10.0.10/DATA2_CD_04_AXDWCEL06';
alter diskgroup RECO add disk 'o/10.10.0.10/RECO_CD_04_AXDWCEL06';
alter diskgroup SYSTEMDG add disk 'o/10.10.0.10/SYSTEMDG_CD_04_AXDWCEL06';
-- Location of the cell alert.log file:
/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/log/diag/asm/cell/axdwcel06/trace/alert.log
-- Location of the ASM alert.log
/u01/app/oracle/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
11. Check until the rebalance completes using following sql:
select * from gv$asm_operation;
12. Once the rebalance is over check the status of the ASM disks again and verify that everything is fine:
SQL> select name, substr(path,1,50) , mount_status, header_status, state from gv$asm_disk where name in ('DATA1_CD_04_AXDWCEL06','DATA2_CD_04_AXDWCEL06','SYSTEMDG_CD_04_AXDWCEL06','RECO_CD_04_AXDWCEL06');
select * from gv$asm_disk
13. Verify all grid disks have been successfully put online using the following command from the cell where disk change is done:
(Wait until asmmodestatus is ONLINE for all grid disks
and Oracle ASM synchronization is only complete when all grid disks show asmmodestatus=ONLINE.)
cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
14. also check the alert history from the cell where disk change is done
cellcli -e list alerthistory