Saturday, January 12, 2013

Exadata Storage Cell Disk Change


I follow below steps while changing and faulty disk on Exadata. We are using Exadata storage SW version 11.2.2.4.0.

1. You have identified following disk to be replaced:
       
$ cellcli -e list physicaldisk detail
         name:                   18:4
         deviceId:               10
         diskType:               HardDisk
         enclosureDeviceId:      18
         errMediaCount:          471
         errOtherCount:          17
         foreignState:           false
         luns:                   0_4
         makeModel:              "SEAGATE ST360057SSUN600G"
         physicalFirmware:       0805
         physicalInsertTime:     2010-03-26T05:05:46+06:00
         physicalInterface:      sas
         physicalSerial:         E08WD8
         physicalSize:           558.9109999993816G
         slotNumber:             4
         status:                 critical

2. You need to make sure that there is no other disks are offline/missing in other cells(failgroup) :

select count(*),failgroup from gv$asm_disk group by failgroup;


3. Check ASM_POWER_LIMIT and if it is 0 then set it to non zero value so that after the disk replacement rebalance can happen and the newly added disks will have data written on them.
This parameter can be set to any value between 0 to 11.
--on asm instance
alter system set asm_power_limit=10;

4. Make sure there are no rebalance operation running on any instance on any node:

select * from gv$asm_operation;

5. Identify the ASM disks which are existing on the physical disk to be replaced:

a. On the cell node execute following:

- Find the lun number "0_11" in the list physicaldisk detail output
# cellcli -e list physicaldisk detail
.....
luns: 0_4
.....

- Find the celldisk created for this lun (physicaldisk)
# cellcli -e list celldisk where lun=0_4
         CD_04_axdwcel06         not present

- Find the griddisks created for this celldisk
# cellcli -e list griddisk where celldisk=CD_04_axdwcel06
         DATA1_CD_04_axdwcel06           not present
         DATA2_CD_04_axdwcel06           not present
         RECO_CD_04_axdwcel06            not present
         SYSTEMDG_CD_04_axdwcel06        not present



b. Execute following on any db node connecting to the ASM instance:

SQL> select name, substr(path,1,50) , mount_status, header_status, state from gv$asm_disk where name in ('DATA1_CD_04_AXDWCEL06','DATA2_CD_04_AXDWCEL06','SYSTEMDG_CD_04_AXDWCEL06','RECO_CD_04_AXDWCEL06');

NAME SUBSTR           (PATH,1,50)                        MOUNT_S HEADER_STA STATE
------------------------------ ----------------------------------------   ------
DATA_CD_11_DMORLCEL02 o/192.168.10.6/DATA_CD_11_dmorlcel02 CACHED  MEMBER  NORMAL
RECO_CD_11_DMORLCEL02 o/192.168.10.6/RECO_CD_11_dmorlcel02 CACHED  MEMBER  NORMAL



6. Now drop the disk from the diskgroup(s):

alter diskgroup DATA1 drop disk 'DATA1_CD_04_AXDWCEL06';
alter diskgroup DATA2 drop disk 'DATA2_CD_04_AXDWCEL06';
alter diskgroup SYSTEMDG drop disk 'SYSTEMDG_CD_04_AXDWCEL06';
alter diskgroup RECO drop disk 'RECO_CD_04_AXDWCEL06';

7. Wait until the Oracle ASM disks associated with the grid disks on the bad disk have been successfully dropped by querying the V$ASM_DISK_STAT view.

V$ASM_DISK_STAT displays disk statistics in the same way that V$ASM_DISK does, but without performing discovery of new disks. This results in a less expensive operation. However, since discovery is not performed, the output of this view does not include any data about disks that are new to the system.


8. Mark this disk at the hardware level to be serviced:

cellcli -e 'alter physicaldisk 18:4 serviceled on'

Here disk_name should be replaced by the name of the physical disk '18:4'.

This step will help in identifying the disk to be replaced at the h/w level. Please note that this command may  not work in version less than 11.2.2.2. but it is not mandatory. it just helps in identifying the disk to be replaced.

9. Now replace the disk at the h/w level. The service LED will glow for this disk as done in the previous step.

10. After replacing the disk Exadata s/w will automatically create celldisk, griddisks. You can tail the cell alert.log and ASM alert.log to see this activity. But the disks will not be added back to the ASM as we did not use FORCE option while dropping them. So we need to add them manually:

alter diskgroup DATA1 add  disk 'o/10.10.0.10/DATA1_CD_04_AXDWCEL06';
alter diskgroup DATA2 add  disk 'o/10.10.0.10/DATA2_CD_04_AXDWCEL06';
alter diskgroup RECO add  disk 'o/10.10.0.10/RECO_CD_04_AXDWCEL06';
alter diskgroup SYSTEMDG add  disk 'o/10.10.0.10/SYSTEMDG_CD_04_AXDWCEL06';

-- Location of the cell alert.log file:
/opt/oracle/cell11.2.2.4.0_LINUX.X64_110929/log/diag/asm/cell/axdwcel06/trace/alert.log

-- Location of the ASM alert.log
/u01/app/oracle/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log


11. Check until the rebalance completes using following sql:
select * from gv$asm_operation;

12. Once the rebalance is over check the status of the ASM disks again and verify that everything is fine:

SQL> select name, substr(path,1,50) , mount_status, header_status, state from gv$asm_disk where name in ('DATA1_CD_04_AXDWCEL06','DATA2_CD_04_AXDWCEL06','SYSTEMDG_CD_04_AXDWCEL06','RECO_CD_04_AXDWCEL06');

select * from gv$asm_disk

13. Verify all grid disks have been successfully put online using the following command from the cell where disk change is done:
    (Wait until asmmodestatus is ONLINE for all grid disks
      and Oracle ASM synchronization is only complete when all grid disks show asmmodestatus=ONLINE.)
     cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome

14. also check the alert history from the cell where disk change is done
cellcli -e list alerthistory

No comments:

Post a Comment