Investigating CVU Failures

The page describes an investigation into an error reported by the CVU. Oracle 11.2 includes an embedded version of the Cluster Verification Utility (CVU) which is executed at the end of the interview process. Since the CVU can fix many problems that it encounters without restarting the installer, I have got out of the habit of running the stand-alone CVU prior to an installation. This time the OUI failed reporting that shared disks were not sharable. This page describes the investigative process that we followed to resolve the issue.

In this example the unsharable multipath device was /dev/mapper/data (partition /dev/mapper/datap1). The servers are server14 and server15.

The first step was to reproduce the problem using the standalone version of the CVU. In Oracle 11.2 the runcluvfy.sh script is in the same directory as the runInstaller script. We ran this as follows:

./runcluvfy.sh comp ssa -n server14,server15 -s /dev/mapper/datap1

This utility produced the following output:

Verifying shared storage accessibility
Checking shared storage accessibility...
"/dev/mapper/datap1" is not shared
Shared storage check failed on nodes "server15,server14"
Verification of shared storage accessibility was unsuccessful on all the specified nodes.

We tried appending the -verbose parameter, but this did not produce any more informative output.

The next step was to enable CVU trace. On Linux this can be enabled using the CV_TRACELOC parameter. For example:

mkdir /tmp/cvutrace
export CV_TRACELOC=/tmp/cvutrace

The environment variable specifies that trace should be written to a file called cvutrace.log.0 in the /tmp/cvutrace directory.

The trace output allowed us to identify exactly where the failure was occuring, but not the actual cause. By searching the output around the failure we discovered that it was occuring following a call to exectask.

Exectask is an executable that is copied to the /tmp/CVU_11.2.0.2.0_grid directory. It appears to contain all of the port specific system calls required by the CVU. It takes around 20 options and each option returns an XML document containing output and status/error messages. From the CVU trace we determined that exectask was being called with the following parameters:

/tmp/CVU_11.2.0.2.0_grid/exectask -getstinfo \
  -getdiskinfo /dev/mapper/data%/dev/mapper/datap1

The CVU is calling exectask with the -getstinfo (get storage information) parameter and the -getdiskinfo (get disk information) subparameter. Note that the format of the disk name is device%partition.

Executed stand-alone, we observed that exectask returned different XML documents for each node.

On the first node:

<CV_VAL>
  <disk>
    <disk_name>/dev/mapper/data</diskname>
    <disksignature>PAFUCC29SZ4028</disk_signature>
    <NUMPARTS>0</NUMPARTS>
    <disk_state>0</disk_state>
    <disk_size>5362849792</disk_size>
    <disk_owner>grid</disk_owner>
    <disk_group>asmadmin</disk_group>
    <disk_permissions>0660</disk_permissions>
  </disk>
</CV_VAL>
<CV_VRES>0</CV_VRES>
<CV_LOG>Exectask:getDiskInfo success</CV_LOG>
<CV_ERES>0</CV_ERES>

On the second node:\n

<CV_VAL>
  <disk>
    <disk_name>/dev/mapper/data</diskname>
    <disksignature>PAFUCC29SZ4064</disk_signature>
    <NUMPARTS>0</NUMPARTS>
    <disk_state>0</disk_state>
    <disk_size>5362849792</disk_size>
    <disk_owner>grid</disk_owner>
    <disk_group>asmadmin</disk_group>
    <disk_permissions>0660</disk_permissions>
  </disk>
</CV_VAL>
<CV_VRES>0</CV_VRES>
<CV_LOG>Exectask:getDiskInfo success</CV_LOG>
<CV_ERES>0</CV_ERES>

The output has been reformatted to improve readability.

On closer inspection, the only difference is the disk signature field.

We used the multpath utility to determine the device names of the SCSI disks (there are probably lots of ways to achieve this)

[root@server14]# multipath -ll /dev/mapper/data\ndata {36001438005dedd1b0000300002260000} dm-4 HP,HSV400
[size=5.0G][features=0][hwhandler=0][rw]
_ round-robin 0 [prio=0][active]
_ 1:0:0:5 sda 8:64 [actve][ready]
_ round-robin 0 [prio=0][enabled]
_ 1:0:1:5 sdf 8:144 [actve][ready]

In the above example the devices are /dev/sda and /dev/sdf. We will use /dev/sda.

We then used the scsi_id utility, first to verify the WWID. The default output is:

[root@server14]# /sbin/scsi_id -g -u -s /block/sda
36001438005dedd1b0000300002260000

The -p option of scsi_id specifies the SCSI field returned in the output. The names are not very user-friendly. The default is 0x83. So the following command is identical to the above:

[root@server14]#/sbin/scsi_id -p 0x83 -g -u -s /block/sda
36001438005dedd1b0000300002260000

In this case we are not interested in the WWID, but in the disk identifier. For an HP EVA SAN this turns out to be derived from the controller ID and is stored in SCSI field 0x80. We can use the following command to obtain this value:

[root@server14]# /sbin/scsi_id -p 0x80 -g -u -s /block/sda
SHV_HSV400_PAFUCC29SZ4028

If we execute the same command on the second node we get the following output:

[root@server15]# /sbin/scsi_id -p 0x80 -g -u -s /block/sda
SHV_HSV400_PAFUCC29SZ4064

So the disk identifiers reported by /sbin/scsi_id are different. If the storage is shared they should be identical.

At this point we have proved that this problem is at storage/operating system level and we can reproduce it without Oracle.

After drawing a diagram of the storage we concluded that the problem was with one of the fabric switches, so we disabled the offender. The scsi_id command now returned the same output for both nodes:

[root@server14]# /sbin/scsi_id -p 0x80 -g -u -s /block/sda
SHV_HSV400_PAFUCC29SZ4028
[root@server15]# /sbin/scsi_id -p 0x80 -g -u -s /block/sda
SHV_HSV400_PAFUCC29SZ4028

We could now retry the CVU with more confidence:

./runcluvfy.sh comp ssa -n server14,server15 -s /dev/mapper/data
Verifying shared storage accessibility
Checking shared storage accessibility...
"/dev/mapper/EOrac1_CRS1p1" shared
Shared storage check was successful on nodes "server15,server14"
Verification of shared storage accessibility was successful

Following the successful execution of the CVU, the Grid Infrastructure installation proceeded without any further problems.