There is no definitive error message for these issues. Most problems occur when upgrading existing systems to Oracle 11gR2 if the recommendation to create a new operating system user (usually "grid") for the Grid Infrastructure is followed.
In itself this is a good idea as it provides separation between Grid Infrastructure (Clusterware + ASM) and RAC (Database software).A really useful benefit is that the $ORACLE_HOME environment variable for the Grid Infrastructure can be hard coded into the profile of the grid user. For example on Linux:
export ORACLE_HOME=/u01/app/11.2.0/grid export PATH=$ORACLE_HOME:bin;$PATH
You can also hard-code the ORACLE_SID environment variable in the profile as the ASM instance is fixed on each node
There are some security benefits in configuring separate grid and oracle user particularly where different teams admininster Grid Infrastructure and the database.
The grid user can easily be added to an existing system. Initially I generally specify the grid user with the same groups as the oracle user i.e. an initial operating system group of oinstall and an additional group of dba.
A second $ORACLE_BASE directory will be created for the grid user (e.g. /u01/app/grid) which will contain the diagnostic areas for the SCAN listeners and the local listener. Setting permissions prior to installation can be a bit messy; particularly for the inventory which must be updatable by both the grid and oracle users. Experience has shown it is generally best to pre-create the $ORACLE_BASE directories (e.g. /u01/app/oracle and /u01/app/grid) to ensure that they have the correct permissions. Remember to repeat these steps on every node in the cluster (shared Oracle homes are not supported for the Grid Infrastructure).
If you configure a grid user in Linux, remember to update the security limits (/etc/security/limits.conf). In 18.104.22.168 and above the CVU fixup scripts will generate any additonal settings required. For example the following settings may be required in /etc/security/limits.conf.
grid soft nproc 2047 grid hard nproc 16384 grid soft nofile 1024 grid hard nofile 65536 oracle soft nproc 2047 oracle hard nproc 16384 oracle soft nofile 1024 oracle hard nofile 65536In addition /etc/profile should be updated to include the grid user. For example:
if [ $USER = "grid" -o | $USER = "oracle" ]; then if [ $SHELL = "/bin/ksh" ]; then ulimit -p 16384 ulimit -n 65536 else ulimit -u 16384 -n 65536 fi umask 022 fiSo we get to the one nasty issue waiting to bite you. I think this problem will only occur for upgrades where a previous version of Clusterware has been installed under the oracle user. When you run the "root.sh" script for the first time it may hang just after starting the ohasd daemon. If you leave the script for long enough it may continue for a few more steps before failing completely, usually because it has been unable to configure ASM. Digging around in the error output this may be because the ocssd daemon has failed to start. If this is the case take a look in the /var/tmp/.oracle directory (or equivalent on non-Linux platforms). The /var/tmp/.oracle directory contains the socket files for various Oracle daemons and background processes. If Clusterware has already run on the node under the oracle user, it will have left a few socket files in this directory. For reasons that are not immediately apparant, these socket files cannot be used by other processes in the same operating system group (even though the permissions suggest otherwise).
In Oracle 11.2 and above root.sh has an internal checkpoint mechanism is therefore restartable. I generallly use this mechanism to troubleshoot initial builds. However, I like to revert to a known state and to perform a clean installation in order to produce full step-by-step installation documentation for my customers.
If I need to remove a failed Grid Infrastructure installation from a test system I perform the following actions. This is not the recommended support procedure, but is generally sufficient for test systems.
This list is for Enterprise Linux 5 - the location of some of the files vary, particularly on Solaris.
Delete the following files and directories
rm -rf /etc/oracle rm -f /etc/oratab rm -f /etc/oraInst.loc
Delete the contents of the Grid Infrastructure base e.g.
rm -rf /u01/app/grid/*
Check the permissions for the above are still grid:oinstall
Delete the contents of the Grid Infrastructure home e.g.
rm -rf /u01/app/11.2.0
Delete the inventory e.g.
rm -rf /u01/app/oraInventory
Initialize any ASM diskgroups assigned to CRS
dd if=/dev/zero of=/dev/mapper/crs1p1 bs=1M count=1000 dd if=/dev/zero of=/dev/mapper/crs2p1 bs=1M count=1000
Take care to initialize partitions (if created) rather than the devices.
If you are using Enterprise Linux 5 then remove the ohasd entry from /etc/inittab. I always use vi to do this; if you are not confident with this tool, the OUI makes a copy of /etc/inittab prior to adding the ohasd entry; you can use the copy to revert /etc/inittab to its original state.
The init file structure has been redesigned for Enterprise Linux 6. It is not longer necessary to update /etc/inittab.
Remove the following files:
At this stage reboot the server. Although a number of files remain such as the /etc/init.d/* files and the /etc/rc*.d symbolic links the files these reference have been deleted and the reboot will clear any remainig files.
Finally delete all of the files in /var/tmp/.oracle.
rm -f /var/tmp/.oracle
Finally I always reboot all the nodes to ensure that any failed daemons or locks have been removed and that I have a known starting point for the next attempt at the installation.