Why do you want to join IOCL

Chapter 3 Key Concepts - Administration and Application Development

This chapter describes the key concepts related to the software components of a SunPlex system. Topics covered include:

This information is primarily intended for system administrators and application developers who work with the SunPlex API and SDK. Cluster system administrators can use this information in preparation for installing, configuring, and administering cluster software. Application developers can use the information to understand the cluster environment in which they operate.

You can choose how to install, configure, and administer the SunPlex system from various user interfaces. You can perform system administration tasks using either the SunPlex Manager graphical user interface (GUI) or the documented command line interface. In addition to the command line interface, there are several utilities, such as and, to simplify certain installation and configuration tasks. The SunPlex system also includes a module that runs as part of the Sun Management Center and provides a GUI for certain cluster tasks. This module is only available for SPARC-based clusters. Under “Management Tools” in Sun Cluster Administration Guide for Solaris OS see full descriptions of the administrative interface.

The time must be synchronized between all nodes of a cluster. Whether the time synchronization of the cluster nodes takes place with an external time source is irrelevant for the cluster operation. The SunPlex system uses the Network Time Protocol (NTP) to synchronize clocks between nodes.

In general, changing the system clock by a fraction of a second does not cause any problems. However, if you run, or (interactively or within scripts) on an active cluster, you can force a time change that is well over a fraction of a second in order to synchronize the system clock with the time source. This forced change can cause file change timestamps or NTP service issues.

When you install the Solaris Operating Environment on each cluster node, you can change the default time and date for the node. In general, you can accept the factory default settings.

When you install the Sun Cluster software, the NTP for the cluster is also configured in one process step. The Sun Cluster software provides a template file (see on an installed cluster node) that defines a peer relationship between all cluster nodes with a node as the “preferred” node. Nodes are identified by their private host names and time synchronization takes place over the cluster interconnect. For instructions on configuring the cluster for NTP, see “Installing and Configuring Sun Cluster Software” in the Sun Cluster Software Installation Guide for Solaris OS.

Alternatively, you can set up one or more NTP servers outside of the cluster and modify the file to reflect this configuration.

In normal operation, you should never have to adjust the cluster time. However, if the time was not set correctly when you installed the operating environment and you want to change it, see “Managing the Cluster” in the Sun Cluster System Administration Guide for Solaris OShow to do this.

The SunPlex system makes all the components on the "path" between users and data highly available, including network interfaces, the applications themselves, the file system, and multihost devices. In general, a cluster component is highly available if it remains in operation despite a single error (software or hardware) in the system.

The following table shows the types of SunPlex component failure (hardware and software) and the corresponding form of recovery that is built into the framework.

Table 3–1 SunPlex Fault Detection and Recovery Levels

Faulty cluster component

Software recovery

Hardware recovery

Data service

HA API, HA framework

N / A

Public network adapter

IP network multipathing

Multiple public network adapter cards

Cluster file system

Primary and secondary node replicas

Multihost devices

Mirrored multihost device

Disk Management (Solaris Volume Manager and VERITAS Volume Manager, which is only available in SPARC-based clusters)

RAID-5 hardware (for example Sun StorEdgeTM A3x00)

Global device

Primary and secondary node replicas

Multiple paths to the device, cluster transport connection points

Private network

HA transport software

Multiple private, hardware-independent networks


CMM, Failfast driver

Multiple knots

The high availability framework in Sun Cluster software quickly detects a node failure and creates a new, equivalent server for the framework resources on a different node in the cluster. At no time are all framework resources unavailable. Framework resources that are not affected by a crashed node are fully available during the recovery. In addition, the framework resources of the failed node are available again as soon as they are restored. A restored framework resource does not have to wait for all other framework resources to be restored.

Most of the highly available framework resources are restored transparently to the applications (data services) that use them. The semantics of the framework resource access are fully preserved in the event of a node failure. The applications simply do not notice that the framework resource server has been moved to another node. The failure of a single node is completely transparent to the programs on the remaining nodes as long as there is an alternate hardware path to the disks of another node that is used by the files, devices, and volumes connected to that node. An example is using multihost devices with ports for multiple nodes.

Cluster Members Monitor

All nodes must reach a consistent cluster membership agreement to protect the data from corruption. If necessary, the CMM coordinates a cluster reconfiguration of the cluster services (applications) as a result of an error.

The CMM receives information about connectivity with other nodes from the cluster transport layer. During a reconfiguration, the CMM uses the cluster interconnect to exchange status information.

After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster, which may redistribute cluster resources based on the new membership in the cluster.

Unlike previous Sun Cluster software versions, the CMM runs entirely in the kernel.

For more information about how the cluster protects itself from partitioning into separate clusters, see About Error Protection.

Failfast mechanism

When the CMM detects a critical node problem, it asks the cluster framework to force shutdown (panic) the node and remove it from cluster membership. The mechanism for this process is called Failfast designated. Failfast causes a node to shut down in two ways.

  • If a node leaves the cluster and then tries to start a new cluster without quorum, it is "protected" from access to the shared disks. For more details on using Failfast, see About Failure Protection.

  • If one or more cluster-specific daemons fail (,, or), the error is detected by the CMM and the node panics.

If the failure of a cluster daemon panics a node, a message similar to this is displayed in the console for that node.

panic [cpu0] / thread = 40e60: Failfast: Aborting because "pmfd" died 35 seconds ago. 409b8 cl_runtime: __ 0FZsc_syslog_msg_log_no_argsPviTCPCcTB + 48 (70f900, 30, 70df54, 407acc, 0)% l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

After the panic, the node can reboot and try to rejoin the cluster or in clusters of SPARC-based systems at the OpenBootTM PROM (OBP) prompt will remain. The setting of the parameter determines which action is initiated. You can set using eeprom (1M) at the OpenBoot PROM prompt.

Cluster Configuration Repository (CCR)

The CCR uses a two-phase commit algorithm for updates: an update must be successfully completed on all cluster members, otherwise the update will be withdrawn. The CCR uses the cluster interconnect to implement the distributed updates.

Caution -

The CCR is made up of text files, but under no circumstances should you edit the CCR files manually. Each file contains a checksum entry to ensure consistency between the nodes. Manually updating the CCR files can cause a node or the entire cluster to stop working.

The CCR uses the CMM to ensure that a cluster only runs with a specified quorum. The CCR is responsible for checking data consistency across the cluster, performing any necessary recovery and providing updates for the data.

Used the SunPlex system global devicesto provide cluster-wide, highly available access to all devices within a cluster from every node, regardless of where the device is actually connected. In general, if a node fails while it is providing access to a global device, Sun Cluster software automatically detects another path to the device and redirects access to that path. Global devices at SunPlex include disks, CD-ROMs, and tapes. However, the disks are the only global multiport devices that are supported. This means that CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiport devices and are therefore not highly available.

The cluster automatically assigns a unique ID to each disk, CD-ROM drive, and tape device in the cluster. This assignment enables consistent access to each device from each cluster node. The namespace of global devices is contained in the directory. For more information, see Global Namespace.

Global multiport devices provide several paths to a device. Because multihost disks are part of a disk device group hosted by multiple nodes, they are made highly available.

Device ID (DID)

Sun Cluster software manages global devices through a structure called a DID pseudo-driver (device ID, DID). This driver is used to automatically assign a unique ID to each device in the cluster, including multi-host disks, tape drives, and CD-ROM drives.

The DID pseudo driver is an integral part of the access function to global devices in the cluster. The DID driver tests all nodes in the cluster, creates a list of unique disk devices, and assigns each disk a unique device class and device number that is consistent across all nodes of the cluster. Global devices are accessed using the unique device ID assigned by the DID driver instead of the traditional Solaris device IDs as for a disk.

This approach ensures that any disk-accessing application (such as volume managers or applications that use raw devices) use a consistent path across the cluster. This consistency is particularly important with multihost disks, because the local device class and device numbers for each device can be different from node to node and the Solaris conventions for device naming change as a result. For example, a multihost disk can run as, while node2 runs the same disk completely differently, namely as The DID driver assigns a global name, such as d10, that the node uses instead, and gives each node a consistent association with the multihost disk.

You update and manage the device IDs using and. For more information, see the following man pages:

In the SunPlex system, all multihost devices must be controlled by the Sun Cluster software. First, you create the Volume Manager disk groups - either Solaris Volume Manager disk sets or VERITAS Volume Manager disk groups (only available in SPARC-based clusters) - on the multihost disks. Then register the Volume Manager disk groups as Disk device groups. A disk device group is a global device type. In addition, Sun Cluster software automatically creates a raw disk device group for each disk and tape device in the cluster. However, these cluster device groups remain offline until you access them as global devices.

The registry provides the SunPlex system with information about the paths between nodes and volume manager disk groups. At this point, Volume Manager disk groups within the cluster become globally accessible. When multiple nodes can write to a disk device group (which group can support), the data stored in that disk device group becomes highly available. The highly available disk device group can be used to host cluster file systems.

Note -

Disk device groups are independent of resource groups. One node can support a resource group (representing a group of data service processes) while another can support the disk group (s) accessed by the data services. However, the best practice is to keep the disk device group that stores certain application data and the resource group that contains the application resources (application daemon) on the same node. For more information about the mapping between resource groups and disk device groups, see “Relationship Between Resource Groups and Disk Device Groups” in the Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

With a disk device group, the volume manager disk group becomes “global” because it supports multipathing for the associated disks. Each cluster node actually connected to the multihost disks provides a path to the disk device group.

Disk device group failover

All disk device groups in an enclosure can be accessed through an alternate path if the current node master of the device group fails because a disk enclosure is attached to multiple nodes. The failure of the node acting as the device group master affects access to the device group only during the time required to run the recovery and consistency check. During this time, all requests are blocked (transparent to the application) until the system makes the device group available.

Figure 3–1 Disk Device Group Failover

Multiport disk device groups

This section describes disk group properties that you can use to balance performance and availability in a multiport disk configuration. The Sun Cluster software provides two properties that you can use to configure a multiport disk configuration: and. The property allows you to control the order in which the nodes attempt to take control in the event of failover. With the property you specify a desired number of secondary nodes for a device group.

A highly available service is considered to have failed if the primary node fails and no secondary node can take over the role of primary node. If a service fails over and the property is set to, the nodes follow the order in the node list when selecting a secondary node. The node list set up with the preferenced property defines the order in which the nodes attempt to take control as primary or transition from spare to secondary. You can dynamically change the precedence of a device service using the scsetup (1M) utility. The priority of the dependent services, for example a global file system, corresponds to that of the device service.

During normal operation, the primary node performs checkpoint operations for secondary nodes. In a multi-port disk configuration, performing the checkpoint operation on each secondary node degrades the cluster and causes memory overload. Spare node support was implemented to minimize checkpoint-related performance degradation and memory overload. By default, your disk device group has a primary and a secondary node.The remaining available provider nodes are brought online in the spare state. In the event of a failover, the secondary node becomes the primary node and the node with the highest priority in the node list becomes the secondary node.

The desired number of secondary nodes can be set to any integer between one and the number of operational provider nodes in the device group that are not primary nodes.

Note -

If you are using Solaris Volume Manager, you must first create the disk device group before you can set the property to anything other than the default.

The default desired number of secondary nodes for device services is one. The actual number of secondary providers managed by the replica framework is the desired number, unless the number of operational non-primary providers is less than the desired number. You may want to change the property and cross-check the node list when adding or removing nodes from the configuration. The management of the node list and the desired number of secondary nodes prevents conflicts between the configured and the actual number of secondary nodes allowed by the framework. You manage the addition or removal of nodes from your configuration with the metaset (1M) command for Solaris Volume Manager- Device groups or with the scconf (1M) command for VxVM disk device groups if you are using Veritas Volume Manager, along with setting the properties and. For procedural information about changing disk device group properties, see Administering Cluster File Systems Overview in Sun Cluster System Administration Guide for Solaris OS.

The Sun Cluster software mechanism for activating global devices is the global namespace. The global namespace includes the hierarchy as well as the volume manager namespaces. The global namespace reflects both multihost disks and local disks (and any other cluster device such as CD-ROM drives and tapes) and provides several failover paths for the multihost disks. Each node that is actually connected to the multihost disks provides a path to the storage for all nodes in the cluster.

Typically, the volume manager namespaces for Solaris Volume Manager are in the (and) directories. The media manager namespaces for Veritas VxVM are located in the directories and. These namespaces consist of directories for each Solaris Volume Manager disk set or VxVM disk group across the cluster. Each of these directories contains a device node for each metadevice or volume in that disk set or group.

In the SunPlex system, each device node in the local Volume Manager namespace is symbolically linked to a device node in the file system, where the integer represents the node in the cluster. Sun Cluster software continues to display the volume manager devices as symbolic links in their default locations. Both the global namespace and the volume manager namespace are available on each cluster node.

The advantages of the global namespace include:

  • Each node remains fairly independent and there are only minor changes in the device management model.

  • Devices can be selectively made global.

  • Third party link generators continue to work.

  • After assigning a local device name, a simple assignment takes place in order to obtain the corresponding global name.

Example of local and global namespaces

The table below shows the mappings between local and global namespaces for a multihost disk,.

Table 3–2 Assignments of local and global namespaces

Component / path

Local node namespace

Global namespace

Solaris logical name

DID name

Solaris Volume Manager

SPARC: VERITAS Volume Manager

The global namespace is generated automatically during installation and updated each time the reconfiguration is rebooted. You can also generate the global namespace by executing the command.

The cluster file system has the following characteristics:

  • The storage locations for file access are transparent. A process can open a file anywhere on the system, and the processes can locate the file on all nodes using the same path name.

    Note -

    When the cluster file system reads files, the access time for those files is not updated.

  • With coherency protocols, UNIX file access semantics are preserved even when multiple nodes are accessing the file at the same time.

  • Extensive caching is used in conjunction with zero-copy bulk I / O movements to efficiently move file data.

  • With the interfaces, the cluster file system provides a highly available, cooperative file locking function. Applications running on multiple nodes can synchronize data access using collaborative file locking for a cluster file system file. File locks are immediately released from all nodes leaving the cluster and from all locked and failed applications.

  • Continuous data access is guaranteed even in the event of a failure. Applications are not affected by the failures as long as a path to the disks is still functional. This guarantee applies to disk access in raw mode and to all file system operations.

  • Cluster file systems are independent of the underlying file system and disk management. Cluster file systems make each supported file system on the disks global.

You can mount a file system with global or with local on a global device.

Programs can access the files in a cluster file system from any node in the cluster and with the same file name (for example).

A cluster file system is mounted on all cluster members. You cannot mount a cluster file system on a subset of cluster members.

A cluster file system is no other type of file system. This means that the client sees the underlying file system (for example UFS).

Using Cluster File Systems

In the SunPlex system, all multihost disks are integrated in disk device groups. These can be Solaris Volume Manager disk sets, VxVM disk groups, or individual disks that are not controlled by a software-based volume manager.

For a cluster file system to be highly available, the underlying disk storage must be connected to multiple nodes. Because of this, a local file system (a file system stored on the local disk of a node) that is made into a cluster file system is not highly available.

As with normal file systems, you can mount cluster file systems in two ways:

  • Manually - Use the command with the mount options or to mount the cluster file system from the command line. Example:

    SPARC: # mount -g / dev / global / dsk / d0s0 / global / oracle / data
  • Automatically - Create an entry in the file with a mount option to mount the cluster file system on boot. Then you can create a mount point in the directory on all nodes. The directory is a recommended location, not a requirement. Here is an example line for a cluster file system from a file:

    SPARC: / dev / md / oracle / dsk / d1 / dev / md / oracle / rdsk / d1 / global / oracle / data ufs 2 yes global, logging

Note -

As long as the Sun Cluster software does not provide a naming scheme for cluster file systems, you can simplify administration by creating a mount point for all cluster file systems under the same directory as. For more information, see the Sun Cluster 3.1 9/04 Software Collection for Solaris OS (SPARC Platform Edition) and in Sun Cluster System Administration Guide for Solaris OS.

Resource type

The resource type is designed to make non-global file system configurations such as UFS and VxFS highly available. With you integrate the local file system into the Sun Cluster environment and make the file system highly available. provides additional file system features, such as scan, mount, and force unmount, that Sun Cluster can use to fail over local file systems. To fail over, the local file system must be on global disk groups with affinity switchover enabled.

Under “Enabling Highly Available Local File Systems” in Sun Cluster Data Services Planning and Administration Guide for Solaris OS see how to configure the resource type.

can also be used to synchronize at startup resources and disk device groups on which the resources depend. For more information, see Resources, Resource Groups, and Resource Types.

Syncdir mount option

The mount option can be used for UFS-based cluster file systems. However, you will get much better performance if you don't show off. If you specify, the entries are guaranteed to be POSIX-compliant. If the option is not specified, the system behaves like an NFS file system. In some cases, for example, you would not notice a lack of memory until the file was closed. With (and POSIX behavior) the lack of storage space would have been detected during the write process. Problems rarely arise if you don't show yourself off. We therefore recommend that you refrain from specifying and use the gain in performance.

If you are using a SPARC-based cluster, Veritas VxFS does not have a mount option that is the same as the mount option for UFS. The VxFS behavior corresponds to the UFS behavior if the mount option was not specified.

For frequently asked questions about global devices and cluster file systems, see Common questions about file systems.

The current version of Sun Cluster software supports Disk Path Monitoring (DPM). This section provides conceptual information about DPM, the DPM daemon, and the management tools for disk path monitoring. For procedural information about monitoring, unmonitoring, and checking disk path status, see the Sun Cluster System Administration Guide for Solaris OS.

Note -

DPM is not supported on nodes with versions prior to Sun Cluster 3.1 4/04 software. Do not use DPM commands during an upgrade. After the upgrade, all nodes must be online to use the DPM commands.


DPM improves the overall reliability of failover and switchover operations by monitoring the availability of the disk path for the secondary node. Use the command to check the availability of the disk path used by a resource before the resource is switched. You can use the options provided by the command to monitor disk paths to a single node or to all nodes in the cluster. For more information on command line options, see scdpm (1M).

The DPM components are installed from the package. The package is installed as part of the standard Sun Cluster installation. Details on the installation interface can be found in the online documentation scinstall (1M). The following table describes the default location for installing the DPM components.




Command line interface

Shared libraries

Daemon status file (created at runtime)

A multithreaded DPM daemon runs on each node. The DPM daemon () is started by a script when a node boots. If a problem occurs, the daemon is managed by and automatically restarts. The following list describes how it works when started for the first time.

Note -

When starting, the status for each disk path is initialized to.

  1. The DPM daemon collects disk path and node name information from the previous state file or from the CCR database. For more information about the CCR, see Cluster Configuration Repository (CCR). After starting a DPM daemon, you can force the daemon to read the list of monitored disks from a specified file.

  2. The DPM daemon initializes the communication interface to answer requests from components outside the daemon, such as the command line interface.

  3. The DPM daemon pings each disk path of the monitored list every 10 minutes with the command. The entries are locked to prevent the communication interface from accessing an entry that is currently being modified.

  4. The DPM daemon notifies the Sun Cluster event framework and logs the new path status through the UNIX syslogd (1M) mechanism.

Note -

All errors related to the daemon are recorded by (1M). All API functions return for success and for failure.

The DPM daemon monitors the availability of the logical path that is visible via multipath drivers such as MPxIO, HDLM and PowerPath. The respective real paths that are managed by these drivers are not monitored because the multipath driver hides individual failures from the DPM daemon.

Monitoring disk paths

This section describes two methods of monitoring disk paths in your cluster. The first method is provided with the command. Use this command to monitor, unmonitor, or view the status of disk paths in the cluster. This command is also used to print the list of failed disks and monitor disk paths from a file.

The second method of monitoring disk paths in the cluster is provided by the graphical user interface (GUI) of the SunPlex Manager. SunPlex Manager gives you a topological overview of the monitored disk paths in the cluster. The view is updated every 10 minutes to provide information about the number of failed pings. Use the SunPlex Manager GUI information along with the (1M) command to manage the disk paths. For information about SunPlex managers, see “Administering Sun Cluster With the Graphical User Interfaces” in Sun Cluster System Administration Guide for Solaris OS.

Use the disk path monitoring command

The (1M) command provides administrative DMP commands that you can use to do the following:

  • Monitoring a new disk path,

  • Unmonitoring a disk path,

  • New reading of the configuration data from the CCR database,

  • Reading the disks to be monitored or unmonitored from a specified file,

  • Reporting the status of a disk path or all disk paths in the cluster,

  • Print all disk paths that can be accessed from a node.

Issue the (1M) command with the disk path argument from any active node to perform DPM administration tasks on the cluster. The disk path argument always consists of a node name and a disk name. The node name is not required and is set to by default if no name is specified. The following table describes the naming conventions for the disk path.

Note -

Use of the global disk path name is strongly recommended because the global disk path name is consistent across the cluster. The UNIX disk path name is not consistent across the cluster. The UNIX disk path for a disk can differ from one cluster node to another. The disk path can be on one node and on another node. For UNIX disk path names, use the command to map the UNIX disk path names to the global disk path names before issuing DPM commands. For more information, see the online documentation scdidadm (1M).

Table 3–3 Examples of disk path names

Name type

Example disk path name


Global disk path

Disk path on the node

Disk path on all cluster nodes

UNIX disk path

Disk path on the node

All disk paths on the node

All disk paths

All disk paths on all cluster nodes

Using SunPlex Manager to Monitor Disk Paths

You can use SunPlex Manager to perform the following basic DPM administration tasks:

  • Monitoring a disk path,

  • Unmonitoring a disk path,

  • View the status of all disk paths in the cluster.

For procedural information about disk path management with SunPlex Manager, see the SunPlex Manager online help.

This chapter covers the following topics:

Note -

Contact your Sun service provider for a list of devices that Sun Cluster software supports as quorum devices.

Since cluster nodes share data and resources, a cluster must never be divided into separate partitions that are active at the same time. Multiple active partitions can cause data corruption.The cluster member monitor (CMM) and the quorum algorithm ensure that no more than one instance of the same cluster is in operation, even if the cluster interconnect is partitioned.

For more information on the CMM, see “Cluster Membership” in the Sun Cluster Overview for Solaris OS.

Two types of problems can arise due to the partitioning of clusters:

Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned. Each partition behaves as if it were the only existing partition because the nodes in one partition cannot communicate with the nodes in other partitions.

Amnesia occurs when the cluster restarts after shutdown with cluster configuration data that is older than the data at the time of shutdown. This problem can occur if you start the cluster on a node that is not in the last functioning cluster partition.

Sun Cluster software avoids split brain and amnesia by doing the following:

  • Allocation of a vote for each individual node

  • Establishing a majority of votes for an operational cluster

A partition with a majority of the votes receives one quorum and is given permission to work. This majority vote mechanism avoids split brain and amnesia in the event that more than two nodes are configured in a cluster. However, counting the node votes alone is not sufficient if more than two nodes have been configured in a cluster. In a two-node cluster, two is a majority. When such a two-node cluster is partitioned, both partitions each need an external vote to obtain quorum. That external voice comes from a Quorum device contributed.

Information on the quorum number of votes

Use the command to set the following information:

  • Total configured voices

  • Currently available voices

  • Votes required for quorum

For more information on this command, see scstat (1M).

Both nodes and quorum devices contribute votes for the cluster to form a quorum.

A node contributes votes based on its status:

  • A node has a vote count of onewhen it starts and becomes a cluster member.

  • One node has zero Votes when it is installed.

  • When a system administrator puts a node into maintenance state, the node has a vote count of zero.

Quorum devices contribute votes based on the number of votes associated with the device. When you configure a quorum device, the Sun Cluster software assigns the quorum device a vote count of -1, where is the number of votes associated with the quorum device. For example, a quorum device connected to two nodes with non-zero vote counts has a quorum count of one (two minus one).

A quorum device contributes votes when a of the following two conditions is met:

  • One or more of the nodes to which the quorum device is currently attached is a member of the cluster.

  • One or more of the nodes to which the quorum device is currently attached is starting up and the corresponding node was a member of the cluster partition that last owned the quorum device.

You configure quorum devices during cluster installation. For a later installation, follow the instructions under “Managing Quorum Devices” in the Sun Cluster System Administration Guide for Solaris OS described procedure.

Information on error protection

An important issue with clusters is a bug that causes the cluster to be partitioned (as a Split brain designated). In this case, not all nodes can communicate with one another, so that individual nodes or node subsets may try to form single or subset clusters. Each subset or partition can be convinced that it has exclusive access to the multihost devices and ownership. Attempting multiple nodes to write to the disks can corrupt data.

The error protection restricts node access to the multihost devices by actually preventing access to the disks. If a node leaves the cluster (due to a failure or partitioning), error protection ensures that the node no longer has access to the disk. Only current member nodes have access to the disks. This ensures data integrity.

Disk device services provide failover capabilities for services that use multihost devices. If a cluster member currently acting as the primary (owner) of the disk group fails or can no longer be reached, a new primary node is selected and the disk device group can be accessed again after a minor interruption. During this process, the old primary node must relinquish access to the devices before the new primary node can be started. However, if a member leaves the cluster and is no longer reachable, the cluster cannot ask that node to share the devices for which it was the primary node. So you need a means by which the functioning members can take control and access to the global devices of the failed members.

The SunPlex system uses SCSI disk reservations to implement error protection. With the SCSI reservations, the multihost devices are “protected” from the failed nodes and access to these disks is prevented.

SCSI-2 disk reservations support a form of reservation that either grants access to all nodes connected to the disk (if there is no reservation) or restricts access to a single node (the node to which the reservation applies).

When a cluster member detects that another node is no longer communicating over the cluster interconnect, it initiates fault protection procedures to prevent the other node from accessing the shared disks. With this type of error protection, it is normal for the protected node to panic and react with a message “reservation conflict” in the console.

The reservation conflict occurs after it is determined that a node is no longer a member of the cluster because a SCSI reservation has been made on all disks that node shares with the rest of the nodes. The protected node may not be aware of the protection. If he tries to access one of the shared disks, he recognizes the reservation and panics.

Failfast mechanism for error protection

The mechanism by which the Cluster Framework ensures that a failed node cannot reboot and write to shared memory is called Failfast designated.

Nodes that are cluster members continuously activate a specific ioctl, MHIOCENFAILFAST, for the disks they are accessing. This also includes the quorum plates. This ioctl is an instruction for the disk driver. This means that the node can put itself into a panic state if it cannot access the disk because it has been reserved by other nodes.

The MHIOCENFAILFAST ioctl triggers a check of the error returns from each read and write, which are returned to disk by a node for the error code. The ioctl runs regular tests on the disk in the background to check for it. Both the foreground and background control flow paths panic when returning.

With SCSI-2 disks, the reservations are not permanent - they are deleted when the node is rebooted. In the case of SCSI-3 disks with PGR (Persistent Group Reservation), the reservation information is saved on the disk and is retained even after the node has booted. The failfast mechanism always works in the same way, regardless of whether you are using SCSI-2 or SCSI-3 disks.

If a node loses connectivity to other nodes in the cluster and does not belong to a partition that can achieve quorum, it will be forcibly removed from the cluster by another node. Another node is making reservations on the shared disks as part of the partition that achieves quorum. If the node without quorum tries to access the shared disks, it receives a reservation conflict in response and panic as a result of the failfast mechanism.

After the panic, the node can reboot and try to rejoin the cluster or in clusters of SPARC-based systems at the OpenBootTM PROM (OBP) prompt will remain. The setting of the parameter determines which action is initiated. You can set with eeprom (1M) in a SPARC-based cluster at the OpenBoot PROM prompt or with the SCSI utility that you optionally run after starting the BIOS in an x86-based cluster.

Information about quorum configurations

The following list provides information about quorum configurations:

  • Quorum devices can contain user data.

  • In an N + 1 configuration with quorum devices connected to either one of the bis nodes and one of the nodes, the cluster survives the failure of either all bis nodes or any one of the 2 Node. This availability assumes that the quorum device is working properly.

  • In a node configuration where a single quorum device is connected to all nodes, the cluster can prevent all of the nodes from failing. 1 Survive knot. This availability assumes that the quorum device is working properly.

  • In a -node configuration where a single quorum device is connected to all nodes, the cluster can survive the failure of the quorum device as long as all cluster nodes are available.

For examples of quorum configurations that should be avoided, see Prohibited Quorum Configurations. For examples of recommended quorum configurations, see Recommended quorum configurations.

Meet the requirements for quorum devices

You must meet the following requirements. Otherwise the availability of the cluster could be affected.

  • Make sure that the Sun Cluster software supports your device as a quorum device.

    Note -

    Contact your Sun service provider for a list of devices that Sun Cluster software supports as quorum devices.

    The Sun Cluster software supports two types of quorum devices:

    • Shared multihost disks that support SCSI-3 PGR reservations

    • Shared dual host disks that support SCSI-2 reservations

  • In a two-node configuration, you must configure at least one quorum device to ensure that a single node will survive if the other node fails. See Figure 3–2.

For examples of quorum configurations that should be avoided, see Prohibited Quorum Configurations. For examples of recommended quorum configurations, see Recommended quorum configurations.

Apply the recommendations for dealing with quorum devices

Use the following information to determine which quorum configuration is best for your topology:

  • Do you have a device that can be attached to all nodes in the cluster?

    • If so, configure this device as your quorum device. You need to no Configure another quorum device because your configuration is already optimal.

      Danger -

      If you ignore this requirement and configure another quorum device, the availability of the cluster will be reduced by the additional quorum device.

    • If not, configure your device or devices with two connections.

  • Make sure that the total number of votes provided by quorum devices is generally less than the total number of votes provided by nodes. If this is not the case, the nodes cannot cluster if all the disks are available - even if all the nodes are healthy.

    Note -

    In certain environments it sometimes makes sense to lower your overall cluster availability to meet your needs. In these cases, ignore the recommendation mentioned. However, if this recommendation is not followed up, overall availability will be affected. For example, in the configuration described under Non-standard quorum configurations, the availability of the cluster is lower. The quorum votes here exceed the node votes. The property of the cluster is that if access to the shared storage between node A and node B is lost, the entire cluster will fail.

    The exception to this recommendation is described under Atypical quorum configurations.

  • Specify a quorum device for all pairs of nodes that access a shared storage device. This quorum configuration speeds up the error protection process. See quorum in configurations with more than two nodes.

  • In general, cluster availability increases if adding a quorum device makes the total number of cluster votes even.

  • Quorum devices slightly slow down reconfiguration after a node joins or a node fails. For this reason, do not add more quorum devices than required.

For examples of quorum configurations that should be avoided, see Prohibited Quorum Configurations. For examples of recommended quorum configurations, see Recommended quorum configurations.

Recommended quorum configurations

For examples of quorum configurations that should be avoided, see Prohibited Quorum Configurations.

Quorum in two-node configurations

Two quorum votes are required for a two-node cluster to be formed. These two voices can come from the two cluster nodes or from just one of the nodes and a quorum device.

Figure 3–2 Two-Node Configuration

Quorum in configurations with more than two nodes

It is permissible to configure a cluster with more than two nodes without a quorum device. However, if you do this, you will only be able to start the cluster if the cluster has the majority of the nodes.

Atypical quorum configurations

Figure 3–3 assumes that you are running mission-critical applications (e.g. Oracle database) on and. If and are unavailable and cannot access shared data, you should shut down the entire cluster. Otherwise, this configuration is not optimal because it does not offer high availability.

For information about the recommendation that this exception applies to, see Applying Recommendations for Handling Quorum Devices.

Figure 3–3 Atypical configuration

Invalid quorum configurations

For examples of recommended quorum configurations, see Recommended quorum configurations.

The term Data service describes a third-party application such as Sun Java System Web Server (formerly Sun Java System Web Server) or Oracle for SPARC-based clusters that has been configured to run on a cluster rather than a single server. A data service consists of an application, specialized Sun Cluster configuration files, and Sun Cluster administration methods that control subsequent application actions.

Figure 3-4 compares an application running on a single application server (single server model) to the same application running on a cluster (cluster server model). Note that from a user perspective, there is no difference between the two configurations. The cluster application may run faster and show better high availability.

Figure 3–4 Standard configuration versus client / server configuration as a cluster

In the single server model, you configure the application to access the server through a specific public network interface (a host name). The host name is assigned to this real server.

In the clustered server model, the public network interface is a logical host name or one shared address. The term Network resources denotes both the logical host names and the shared addresses.

Certain data services require you to specify either logical host names or shared addresses as network interfaces - they are not interchangeable. For other data services, you can specify either logical host names or shared addresses. You can find details on the required interface type in the information on installation and configuration for the respective data service.

A network resource is not assigned to a specific real server - it can migrate between the real servers.

A network resource is initially assigned to a node, the Primary node. If the primary node fails, the network resource and application resource are rerouted to another cluster node (a secondary node) with a failover operation.If the network resource fails over, the application resource continues to run on the secondary node after a short delay.

Figure 3–5 compares the single server model with the clustered server model. Note that a network resource (in this example a logical host name) can switch between two or more cluster nodes in a clustered server model. The application is configured to use this logical host name instead of a host name associated with a specific server.

Figure 3–5 Fixed host name versus logical host name

A shared address is also initially assigned to a node. This node is known as the global interface node (GIF node). A shared address is used as the only network interface to the cluster. It is called global interface designated.

The difference between the logical host name model and the scalable service model is that in the latter, the shared address on the loop interface is also actively configured on each node. Thanks to this configuration, several instances of a data service can be active on several nodes at the same time. The term “scalable service” means that by adding additional cluster nodes, you can increase the CPU performance for the application and the performance will be improved.

If the global interface node fails, the shared address can be moved to another node that is also running an instance of the application (with that other node becoming the global interface node). Or the shared address is failovered to another cluster node that the application was not previously running on.

Figure 3–6 compares the single server configuration with the cluster configuration with scalable services. Note that the shared address is present in the scalable service configuration on all nodes. Similar to using a logical host name for failover data services, the application is configured to use this shared address instead of a host name associated with a specific server.

Figure 3–6 Fixed host name versus shared address

Data service methods

Sun Cluster software provides a variety of service management methods. These methods are controlled and used by the Resource Group Manager (RGM) to start, stop and monitor the application on the cluster nodes. With these methods, the cluster framework software and the multihost devices, the applications can be used as failover or scalable data services.

The RGM also manages resources in the cluster including the application instances and network resources (logical host names and shared addresses).

In addition to the methods of the Sun Cluster software, the SunPlex system also provides an API and various data service development tools. These tools enable application programmers to develop the data service methods they need to run applications other than high availability data services with Sun Cluster software.

Failover data services

If the node running the data service fails (the primary node), the service migrates to another healthy node without user intervention. Failover services use a Failover resource groupthat contains resources for application instances and network resources (logical host names). Logical host names are IP addresses that can be configured as active on a node. They are later automatically configured as inactive on the original node and as active on another node.

For failover data services, the application instances only run on a single node. When the error monitor detects an error, it either tries to restart the instance on the same node or to start the instance on a different node (failover), depending on how the data service has been configured.

Scalable data services

The scalable data service can run active instances on multiple nodes. Scalable services use two groups of resources: One Scalable resource group with the application resources and a failover resource group with the network resources (shared addresses) on which the scalable service depends. The Scalable Resource Group can be online on multiple nodes, so multiple instances of this service can run concurrently. The failover resource group that hosts the shared addresses is only online on one node at a time. All nodes hosting a scalable service use the same shared address to host the service.

Service requests reach the cluster via a single network interface (the global interface) and are distributed to the nodes using different, predefined algorithms, which are part of the Load balancing were discontinued. The cluster can use load balancing to balance the service load between multiple nodes. Note that multiple global interfaces on other nodes can host additional, shared addresses.

With scalable services, the application instances run on multiple nodes at the same time. If the node with the global interface fails, the global interface is moved to another node. When an application instance fails, the instance tries to restart on the same node.

If an application instance cannot be restarted on the same node and another unused node is configured to run the service, the service will fail over to the unused node. Otherwise it will continue to run on the rest of the nodes and may reduce service throughput.

Note -

The TCP state for each application instance remains on the node with the instance, not on the node with the global interface. Therefore a failure of the node with the global interface has no effect on the connection.

Figure 3–7 shows an example of a failover and a Scalabe resource group and the dependencies for Scalable services. This example contains three resource groups. The failover resource group contains application resources for highly available DNS and network resources that are used by both the highly available DNS and the highly available Apache web server (only available for SPARC-based clusters). The scalable resource groups only contain application instances of the Apache web server. Note that there are resource group dependencies between the scalable and failover resource groups (solid lines), and that all Apache application resources depend on the network resource, which is a shared address (dashed lines).

Figure 3–7 SPARC: Example of Failover and Scalable Resource Groups

Load balancing process

Load balancing improves the performance of the scalable services in terms of both response time and throughput.

There are two classes of scalable data services: pure and sticky. A service-only is a service where any instance can answer client requests. A sticky service is a service where a client sends requests to the same instance. These requests are not redirected to other instances.

A pure service uses weighted load balancing. With this load balancing approach, client requests are, by default, evenly distributed among the server instances in the cluster. Assume that each node in a three-node cluster has a weight of 1. Each node serves 1/3 of the requests from any client on behalf of the service. Weights can be changed at any time by the administrator through the command interface or the SunPlex Manager GUI.

A sticky service has two types, normal-sticky and Placeholder sticky. Sticky services allow application level sessions to run concurrently over multiple TCP connections to share the state memory (application session state).

With normal sticky services, a client can share the status between several simultaneous TCP connections. The client is called “sticky” because the server instance is listening on a single port. The client can be sure that all of its requests go to the same server instance. The prerequisite for this is that the instance remains active and accessible and the load balancing procedure is not changed as long as the service is online.

For example, a web browser on the client connects to a shared IP address on port 80 using three different TCP connections, but the connections exchange cached session information with the service.

A generalization of a sticky method extends to several scalable services that exchange session information in the background with the same instance. When these services exchange session information in the background with the same instance, the client is called “sticky” because several server instances on the same node are listening on different ports.

For example, an e-commerce customer fills their shopping cart with items and uses normal HTTP on port 80, but then switches to port 443 to send secure data with SSL in order to pay for the items in the shopping cart with a credit card.

Wildcard sticky services use dynamically assigned port numbers but still expect client requests to be directed to the same node. The client is “sticky placeholder” for the ports with respect to the same IP address.

A good example of this technique is passive FTP mode. A client connects to an FTP server on port 21 and is instructed by the server to reconnect to a listener port server in the dynamic port range. All requests to this IP address are forwarded to the same node that the server specified to the client with the control information.

Note that for each of these stickies, the weighted load balancing policy is applied by default, so the original client request is directed to the instance provided by the load balancer. After the client has established an affinity for the node with the running instance, future requests will be directed to that instance as long as the node is accessible and the load balancing process is not changed.

Further details on specific load balancing procedures are provided below.

  • Weighted. The load is distributed over several nodes according to the specified weighting values. This procedure is set with the value for the property. If the weight for a node is not specifically specified, the default weight for that node is one.

    The weighted method directs a certain percentage of the data traffic from clients to a specific node. With X = weighting and A = total weighting of all active nodes, it can be assumed that about X / A of all new connections are routed to an active node if the total number of connections is large enough. This procedure does not address individual requirements.

    Note that this is not a round robin process. In a round robin process, each request from a client would be directed to a different node: the first request to node 1, the second request to node 2, and so on.

  • Sticky. With this technique, the set of ports is known when the application resources are configured. This procedure is set with the value of the resource property.

  • Sticky placeholder. This process is a superset of the normal “sticky” process. In the case of a scalable service identified by the IP address, the ports are assigned by the server (and are not known beforehand). The ports can change. This procedure is set with the value of the resource property.

Failback settings

Resource groups switch from one node to another during failover. In this case, the original secondary node becomes the new primary node. The failback settings determine the actions taken when the original primary node comes back online. The options are either to make the original primary node the primary node again (failback) or to leave the current primary node as such. You specify the option you want with the resource group property setting.

For certain instances, the failback setting can reduce the availability of the resource group if the original node with the resource fails and reboots several times.

Error monitors of the data services

Every SunPlex data service provides an error monitor that regularly checks that the data service is working properly. An error monitor checks whether the application daemon (s) are running and the clients are being served. Based on the information returned by the test signals, predefined actions such as restarting a daemon or initiating a failover can be triggered.