| |
The scds_fm_action() function uses the probe_status of the data service in conjunction with the past history
of failures to take one of the following actions:
- Restart the application.
- Fail over the resource group.
- Do nothing.
Use the value of the input probe_status argument
to indicate the severity of the failure. For example, you might consider a
failure to connect to an application as a complete failure, but a failure
to disconnect as a partial failure. In the latter case you would have to specify
a value for probe_status between 0 and SCDS_PROBE_COMPLETE_FAILURE.
The DSDL defines SCDS_PROBE_COMPLETE_FAILURE as 100. For partial probe success or failure, use a value between 0 and SCDS_PROBE_COMPLETE_FAILURE.
Successive calls to scds_fm_action() compute a failure
history by summing the value of the probe_status input
parameter over the time interval defined by the Retry_interval
property of the resource. Any failure history older than Retry_interval is purged from memory and is not used towards making the restart
or failover decision.
The scds_fm_action() function uses the following
algorithm to choose the action to take:
-
Restart
- If the accumulated history of failures reaches SCDS_PROBE_COMPLETE_FAILURE, scds_fm_action() restarts the resource by calling
the STOP method of the resource followed by the START method. It ignores any PRENET_START or POSTNET_STOP methods defined for the resource type.
The status of the resource is set to SCHA_RSSTATUS_DEGRADED by making a scha_resource_setstatus() call,
unless it is already set.
If the restart attempt fails because the START or STOP methods of the resource fail, a scha_control()
is called with the GIVEOVER option to fail the resource
group over to another node. If the scha_control() call
succeeds, the resource group is failed over to another cluster node and the
call to scds_fm_action() never returns.
Upon a successful restart, failure history is purged. Another restart
is attempted if and only if the failure history again accumulates to SCDS_PROBE_COMPLETE_FAILURE.
-
Failover
- If the number of restarts attempted by successive calls to scds_fm_action() reaches the Retry_count value
defined on the resource, a failover is attempted by making a call to scha_control() with the GIVEOVER option.
The status of the resource is set to SCHA_RSSTATUS_FAULTED by making a scha_resource_setstatus() call,
unless it is already set.
If the scha_control() call fails, the entire failure
history maintained by scds_fm_action() is purged.
If the scha_control() call succeeds, the resource
group is failed over to another cluster node and the call to scds_fm_action() never returns.
-
No action
- If the accumulated history of failures remains below SCDS_PROBE_MAX_THRESOLD, no action is taken. In addition, if the probe_status value is 0, which indicates a successful health
check of the service, no action is taken, irrespective of the failure history.
The status of the resource is set to SCHA_RSSTATUS_OK
by making a scha_resource_setstatus() call. unless it is
already set.
|