Troubleshooting the lost diskset on Sun Cluster

In middle of night I check my cluster labs, and it show my apache resource group is not running… I re-check it, and I found that the cluster node didn’t mount the webds /global/web. My webds diskset is gone, I don’t know the root cause of this problem… 😀 maybe I’m doing another lab in the same node, and did not consciously change its configuration.

if i check the metaset of the webds status, it show no node that own this diskset

bash-3.00# metaset -s webds

Set name = webds, Set number = 2

Host                Owner

Mediator Host(s)    Aliases

Driv Dbase

d6   Yes

and if I running the metastat status for webds it coming up with error :

bash-3.00# metastat -s webds
metastat: clnode-01: webds: must be owner of the set for this command

The resolution is simple, below my troubleshooting :
1. boot your node-1 & node-2 in non cluster mode
2. comment out the share device at /etc/vfstab
3. boot your node-1 & node-2 in cluster mode
4. on node-2 :
force purge the lost disket :

metaset -s <setname> -P -f

5. on node-1 :
force purge the lost disket :

metaset -s <setname> -P -f

re-recreate your metaset disket :

metaset -s <setname> -a -h NodeA NodeB
metaset -s <setname> -a <diskpath0> <diskpath1> ... <diskpathN>
metaset -s <setname> -a -m NodeA NodeB

(should show new set and ownership)

Note : because my webds disket is set of the svm disk, I re-create the soft partition on it..

bash-3.00# metainit -s webds d1 1 1 /dev/did/rdsk/d6s0
webds/d1: Concat/Stripe is setup
bash-3.00# metainit -s webds d200 -p d1 3g
d200: Soft Partition is setup
bash-3.00# metastat -s webds
webds/d200: Soft Partition
    Device: webds/d1
    State: Okay
    Size: 6291456 blocks (3.0 GB)
        Extent              Start Block              Block count
             0                       32                  6291456

webds/d1: Concat/Stripe
    Size: 10457088 blocks (5.0 GB)
    Stripe 0:
        Device   Start Block  Dbase        State Reloc Hot Spare
        d6s0            0     No            Okay   No

Device Relocation Information:
Device   Reloc  Device ID
d6   No         -

testing mount, and ls direcory :

bash-3.00# mount /dev/md/webds/dsk/d200 /global/web
bash-3.00# ls -l /global/web
total 24
drwxr-xr-x   2 root     root         512 May 31 21:55 bin
drwxr-xr-x   2 root     bin          512 May 25 06:22 cgi-bin
drwxr-xr-x   2 root     root         512 May 31 21:59 conf
drwxr-xr-x   2 root     bin         1024 May 25 06:22 htdocs
drwx------   2 root     root        8192 May 31 20:56 lost+found

in theory, And you should be happy, your cluster resource group is running again.

bash-3.00# clrg status

=== Cluster Resource Groups ===

Group Name    Node Name             Suspended   Status
----------    ---------             ---------   ------
nfs-rg        clnode-01             No          Online
              clnode-02             No          Offline

apache-rg     clnode-01:webapp-01   No          Online
              clnode-01:webapp-02   No          Offline

bash-3.00# clrs status

=== Cluster Resources ===

Resource Name      Node Name             State     Status Message
-------------      ---------             -----     --------------
nfs-res            clnode-01             Online    Online - Service is online.
                   clnode-02             Offline   Offline

nfs-stor           clnode-01             Online    Online
                   clnode-02             Offline   Offline

mycluster-nfs      clnode-01             Online    Online - LogicalHostname online.
                   clnode-02             Offline   Offline

apache-res         clnode-01:webapp-01   Online    Online - Service is online.
                   clnode-01:webapp-02   Offline   Offline

apache-stor        clnode-01:webapp-01   Online    Online
                   clnode-01:webapp-02   Offline   Offline

mycluster-webapp   clnode-01:webapp-01   Online    Online - LogicalHostname online.
                   clnode-01:webapp-02   Offline   Offline

reboot your node if needed. 🙂


