Tuesday, March 29, 2011

Exadata Storage Software Version upgrade from 11.2.1.3.1 to 11.2.2.2.0

We have just upgraded our Exadata Storage Software Version upgrade from 11.2.1.3.1 to 11.2.2.2.0 (patch: 10356485):
Before that we have done below:
1. implement Exadata Database Machine Bundle Patch 8
2. implement 10110978 above BP 8
3. implement 11661824 above BP 8

Normally Exadata SW patches are needs to e applied on both Cell Servers and DB Machines.
For supported patches for exadata machines, please monitor oracle MOS Doc ID: 888828.1

According to the readme and other related companion documents we prepared below action plans which is some like below:

---------------
PATCHING CELLS:
---------------

- Find the model of the cell or database host using command

dmidecode -s system-product-name

DB Machines : SUN FIRE X4170 SERVER
Cell Servers: SUN FIRE X4275 SERVER


PREPARING FOR PATCHING:
----------------------

- Check for clean file systems on cells at version less than 11.2.2.1.0

Check only one cell at a time(for all 7 cells) when using rolling updates

- Monitor the cells on their LO web console during this process.

- If you are doing this check in rolling fashion with no deployment downtime, then
ensure the grid disks on the cell are all offlined.

- From ASM execute,

ALTER DISKGROUP DATA1 OFFLINE DISKS IN FAILGROUP [FAIL_GROUP/CELL NAME];
ALTER DISKGROUP DATA2 OFFLINE DISKS IN FAILGROUP
[FAIL_GROUP/CELL NAME];
ALTER DISKGROUP SYSTEMDG OFFLINE DISKS IN FAILGROUP
[FAIL_GROUP/CELL NAME];
ALTER DISKGROUP RECO OFFLINE DISKS IN FAILGROUP
[FAIL_GROUP/CELL NAME];

- Shut down cell services on the cell to avoid automatic online of grid disks:

cellcli -e 'alter cell shutdown services all'

- Login as root user and reboot cell using following command.
You can do this on all cells together if you are using non
rolling update. This forces a file system check at boot.

sync
sync
shutdown -F -r now

- If any cell encounters corrupt file system journal, then it will
not complete the reboot. If you encounter such cell contact Oracle
Support for guidance or refer to bug 9752748 to recover the cell.

- If you are doing this check in rolling fashion with no deployment downtime, then
ensure the grid disks are all online before starting this procedure on another cell.

cellcli -e 'list griddisk attributes name,asmmodestatus'

should show all grid disks that were online before step 02 are all back online.

- Ensure the various cell network configurations are consistent.


- One way to help verify the values are consistent is to run ipconf with -dry option.

- CAUTION: ipconf -dry, does not flag all inconsistencies. Inspection to compare with what the system is really using is needed.

- Any values marked as (custom) during such run indicate that the cell configurations were manually modified instead of using ipconf.

- All other values should be carefully inspected and compared to the actual values in use to ensure they do not differ from each other.

- Setting up for using patchmgr

- patchmgr must be launched as root user from a system that has root user

- ssh equivalence set up to the root user on all the cells to be patched
*** we are ging to use a DB machine for this

- To set up root user ssh equivalence from a Exadata DB Machine follw below:

- Prepare the list of cells file say called cell_group that has one cell host name or ip address per line for each cell to be patched.

- Check for existing root ssh equivalence. Following command should
require no password prompts and no interaction. It should return
the list of hostnames in the cell_group file.

dcli -g cell_group -l root 'hostname -i'

- Set up ssh root equivalence if not already done from the launch host
NOTE: Do not do this step if you already have root ssh equivalence.

- Generate root ssh keys.
a. ssh-keygen -t dsa
b. ssh-keygen -t rsa

Accept defaults so the ssh keys are created for root user

- Push ssh keys to set up ssh equivalence. Enter root password
when prompted for.

dcli -g cell_group -l root -k

- Check prerequisites


* If you are using rolling patch, then

./patchmgr -cells cell_group -patch_check_prereq -rolling

* If you are using rolling rollback, then

./patchmgr -cells cell_group -rollback_check_prereq -rolling


- For rolling patching

- Ensure the database is up and running

- Adjust the disk_repair_time for ASM(value should be 3.6h)
sqlplus> select dg.name,a.value from gv$asm_diskgroup
dg, gv$asm_attribute a where dg.group_number=a.group_number and
a.name='disk_repair_time';


PATCHING CELLS:
---------------

- Use the LO web based console to monitor the cell during patch

- Use a fresh login session for each patch or rollback execution

- Do not reboot or power cycle cells in the middle of applying the patch.
You may be left with an unbootable cell.

- Do not edit any log file or open them in writable mode.

- Monitoring patch activity

- Use 'less -rf patchmgr.stdout' from another terminal session or window
to see raw log details from patchmgr.

- In addition on each cell being patched, you can tail the file,

/root/_patch_hctap_/_p_/wait_out

- Use tail -200f to monitor the ASM alert logs on
all ASM instances to keep a watch on ASM activity as the patch or rollback
progresses.

- Patching steps using patchmgr

- Do not run more than one instance of patchmgr in the deployment. It
can cause a serious irrecoverable breakdown.

- Login to a system that has ssh equivalence set up for root to all cells that are to be patched.

- Unzip the patch, it will extract into the patch_11.2.2.2.0.101206.2 directory.

- Change to the patch_11.2.2.2.0.101206.2 directory.

- [according to DOC ID 1270634.1]Download any work around helpers attached to the MOS note 1270634.1, and replace dostep.sh in patch_11.2.2.2.0.101206.2 directory.

- Prepare a file listing the cells with one cell host name or ip address per
line. Say the file name is cell_group.

- Verify that the cells meet prerequisite checks. Use -rolling option if you plan to use rolling updates.

./patchmgr -cells cell_group -patch_check_prereq -rolling

- If the prerequisite checks pass, then start the patch application. Use -rolling option if you plan to use rolling updates.

./patchmgr -cells cell_group -patch -rolling

- Monitor the log files and cells being patched

See the "Monitoring patch activity" section earlier in this file.

- Verify the patch status

A. Check image status and history

Assuming the patching is successful, check the imageinfo output and
imagehistory output on each cell. A successful patch application will
show output similar to the following.

Kernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Cell version: OSS_MAIN_LINUX.X64_101206.2
Cell rpm version: cell-11.2.2.2.0_LINUX.X64_101206.2-1

Active image version: 11.2.2.2.0.101206.2
Active image activated:
Active image status: success
Active system partition on device: < / file system device after successful patch e.g. /dev/md5>
Active software partition on device:

In partition rollback: Impossible

Cell boot usb partition:
Cell boot usb version: 11.2.2.2.0.101206.2

Inactive image version:
Inactive image activated:
Inactive image status: success
Inactive system partition on device:
Inactive software partition on device:

Boot area has rollback archive for the version:
Rollback to the inactive partitions: Possible

B. Only if you used -rolling option to patchmgr:

Ensure all grid disks that were active before patching started are
active and their asmmodestatus is ONLINE. You will find the list of
grid disks inactivated by the patch or rollback in the following file
on the respective storage cell

/root/attempted_deactivated_by_patch_griddisks.txt

cellcli -e 'list griddisk attributes name,status,asmmodestatus'

- if imagehistory shows failure and misceachboot shows FAILED in file "/var/log/cellos/validations.log" [the most recent run of validations]
of target CELL then do as below:

01. Log in to the cell as root user
02. Stop cell services: cellcli -e 'alter cell shutdown services all'
03. Set the LSI card to factory defaults: /opt/MegaRAID/MegaCli/MegaCli64 -AdpFacDefSet -a0
04. Reboot the cell.[shutdown -F -r now]
05. Ensure that there are no failed validations in /var/log/cellos/validations.log for
the most recent run of validations. Specifically there is no misceachboot validation failure.
06. /opt/oracle.cellos/imagestatus -set success
07. Verify imageinfo now shows success


- Clean up

Use -cleanup to clean up all the temporary patch or rollback files on the
cells. It will free up close to 900 MB disk space.

Also use this before retrying an aborted or failed run of patchmgr.

./patchmgr -cells cell_group -cleanup

- Optionally remove the root ssh equivalence:

This is an optional step.

dcli -g cell_group -l root --unkey


- Rolling back successfully patched cells using patchmgr

- Do not run more than one instance of patchmgr at a time in the deployment. It
can cause a serious irrecoverable breakdown.

- Check the prerequisites:

./patchmgr -cells cell_group -rollback_check_prereq -rolling

- Execute the rollback:

./patchmgr -cells cell_group -rollback -rolling


---------------------
PATCHING DB MACHINES:
---------------------

- The database host convenience pack is distributed as db_patch_11.2.2.2.0.101206.2.zip inside the cell patch.

- [according to DOC ID 1270634.1] Before copying the db_patch_11.2.2.2.0.101206.2.zip file from cell patch, do below:

- Completely "shut down Oracle" components on the database host and disable clusterware (CRS). For example,
/u01/app/11.2.0/grid/bin/crsctl stop crs -f
/u01/app/11.2.0/grid/bin/crsctl disable crs
- Do a clean reboot of the database host: As root user execute
# reboot

- Shut down oswatcher:
As root user:
a. cd /opt/oracle.oswatcher/osw
b. ./stopOSW.sh
- Turn on the controller cache
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -Lall -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp NoCachedBadBBU -Lall -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp NORA -Lall -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp Direct -Lall -a0

- Add the following lines before the last exit 0 line in the file /opt/oracle.cellos/validations/init.d/misceachboot:
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -Lall -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp NoCachedBadBBU -Lall -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp NORA -Lall -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp Direct -Lall -a0

- Apply the LSI disk controller firmware manually if it is not already at 12.9.0 [Currently 12.0.1-0081]
As root user:
a. Unzip the database convenience pack and extract the disk controller firmare
unzip db_patch_11.2.2.2.0.101206.2.zip
mkdir /tmp/tmpfw
tar -pjxvf db_patch_11.2.2.2.0.101206.2/11.2.2.2.0.101206.2.tbz -C /tmp/tmpfw \
opt/oracle.cellos/iso/cellbits/dbfw.tbz
tar -pjxvf /tmp/tmpfw/opt/oracle.cellos/iso/cellbits/dbfw.tbz -C /tmp/tmpfw \
ActualFirmwareFiles/12.9.0.0049_26Oct2010_2108_FW_Image.rom
b. Flush file system data to disks:
sync
/opt/MegaRAID/MegaCli/MegaCli64 -AdpCacheFlush -a0
c. Execute firmware update:
/opt/MegaRAID/MegaCli/MegaCli64 -AdpFwFlash -f \
/tmp/tmpfw/ActualFirmwareFiles/12.9.0.0049_26Oct2010_2108_FW_Image.rom -a0
e. Flush file system data to disks:
sync
/opt/MegaRAID/MegaCli/MegaCli64 -AdpCacheFlush -a0
f. Reboot the system: reboot

-Now proceed to applying the rest of the database host convenience pack


PREPARING FOR DATABASE HOST PATCHING:
-------------------------------------

* Run all steps as root user.

* Obtain Lights Out (LO) and serial console access for the database hosts to
be patched. It is useful if an issue needs fixing during the patch or if
some firmware update does not go through properly.

* For Exadata V1 (HP hardware) the serial console can be accessed by telnet
or ssh to the LO host name or ip address of by the administrative user
such as admin. To start serial console,

VSP

To stop it press escape key (ESC) followed by (

ESC (
stop

* For Exadata Oracle-Sun hardware ssh to the ILOM host name or IP
address as root user. To start the serial console,

start /SP/console

To stop it press escape key (ESC) followed by (

ESC (
stop /SP/console

* Check output of /usr/local/bin/imagehistory.

DB host must be at release 11.2.1.2.x or 11.2.1.3.1 or 11.2.2.1.x for Exadata V2.
DB host must be at release 11.2.1.2.x for Exadata machine V1.

* Unmount all NFS and DBFS file systems

* All Oracle software must be shut down before patching.

/u01/app/11.2.0/grid/bin/crsctl stop crs -f

Verify no cluster, ASM or Database processes are running:

ps -ef | grep -i 'grid\|ASM\|ora'

Shut down oswatcher:

cd /opt/oracle.oswatcher/osw
./stopOSW.sh

Oracle recommends that you do not interrupt the procedure with control-C.
-------------------------------------------------------------------

PATCHING DATABASE HOSTS:
-----------------------

00. Repeat the steps 01 onwards for EACH database host. If you are taking
deployment wide downtime for the patching, then these steps
may be performed in parallel on all database hosts.

01. Update the resource limits for the database and the grid users:

NOTE: This step does NOT apply if you have customized the values
for your specific deployment and database requirements.

WARNING: Do NOT execute this step, if you do have specific
customized values in use for your deployment.

01.a Calculate 75% of the physical memory on the machine.

let -i x=($((`cat /proc/meminfo | grep 'MemTotal:' | awk '{print $2}'` * 3 / 4)))
echo $x

01.b Edit the /etc/security/limits.conf file to update or add following
limits for the database owner (orauser) and the grid infrastructure
user (griduser). Your deployment may use the same operating system
user for both and it may be named as oracle user. Adjust the following
appropriately.

########### BEGIN DO NOT REMOVE Added by Oracle ###########

orauser soft core unlimited
orauser hard core unlimited
orauser soft nproc 131072
orauser hard nproc 131072
orauser soft nofile 131072
orauser hard nofile 131072
orauser soft memlock
orauser hard memlock

griduser soft core unlimited
griduser hard core unlimited
griduser soft nproc 131072
griduser hard nproc 131072
griduser soft nofile 131072
griduser hard nofile 131072
griduser soft memlock
griduser hard memlock

########### END DO NOT REMOVE Added by Oracle ###########

02. Login as root user to a the database host and copy over the
db_11.2.2.2.0.101206.2_patch.zip file from unzipped cell patch.

03. Unzip db_11.2.2.2.0.101206.2_patch.zip file. It will create directory:

db_patch_11.2.2.2.0.101206.2

04. Change to the db_patch_11.2.2.2.0.101206.2 directory.

05. Run ./install.sh [this should be run locally from the host which is being patched]

It will return to the prompt immediately after submitting the patch
in background. Return to the prompt does NOT mean the patch is
complete.

NOTE: The install.sh will submit the patch process in the background
to prevent interruption of the patch in case the login session
gets terminated due to network connection break. The database host
will reboot as part of the patch process after a while.

NOTE: The database host will reboot as part of this update.
You will lose your connections including the ssh connection
and the database host may appear to hang which the ssh connection
eventually times out. If the LO gets updated, then the same
connection loss or freeze will be experienced.
After 10 minutes retry connection.

06. Verify the patch status:

After the system is rebooted and up,

06.1. /usr/local/bin/imageinfo should show Image version as 11.2.2.2.0.
For example,

Image version: 11.2.2.2.0.101206.2
Image activated: 2010-12-06 11:35:59 -0700
Image status: success

06.2 Verify the ofa rpm version

Find the kernel value: uname -r
Find the ofa rpm version: rpm -qa | grep ofa

06.2.a ofa versions should match following table,

kernel ofa

2.6.18-194.3.1.0.3 1.5.1-4.0.28
2.6.18-194.3.1.0.2 None, the ofa is part of the kernel
2.6.18-128.1.16.0.1 1.4.2-14
2.6.18-53.1.21.2.1 1.4.2-14

06.3 Only for database host models X4170, X4170M2:

Verify the LSI disk controller firmware is now 12.9.0-0049

/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll | grep 'FW Package Build'

06.4. Only for the model X4170 database hosts:

06.4.a Verify the version of ILOM firmware: ipmitool sunoem version

Version: 3.0.9.19.a r55943

06.4.b Verify the thermal profile to ensure appropriate fan speeds.

ipmitool sunoem cli "show /SP/policy"

Should display: FLASH_ACCELERATOR_CARD_INSTALLED=enabled

Post Installation
-----------------
- Now proceed to applying the rest of the database host convenience pack.

- After the database host reboots at the end of the convenience pack, enable the
clusterware and bring up rest of the Oracle stack.
/u01/app/11.2.0/grid/bin/crsctl enable crs
- Verify that the controller cache is on by running,

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -a0

The output for each Logical Drive should show following lines:

Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
- Reboot and mount unmounted ASM Diskgroups [alter diskgroup ... mount;]



No comments:

Post a Comment