QDisk(5) Cluster Quorum Disk QDisk(5)NAMEqdisk - a disk-based quorum daemon for CMAN / Linux-Cluster
1. Overview1.1 Problem
In some situations, it may be necessary or desirable to sustain a
majority node failure of a cluster without introducing the need for
asymmetric cluster configurations (e.g. client-server, or heavily-
weighted voting nodes).
1.2. Design Requirements
* Ability to sustain 1..(n-1)/n simultaneous node failures, without the
danger of a simple network partition causing a split brain. That is,
we need to be able to ensure that the majority failure case is not
merely the result of a network partition.
* Ability to use external reasons for deciding which partition is the
the quorate partition in a partitioned cluster. For example, a user
may have a service running on one node, and that node must always be
the master in the event of a network partition. Or, a node might lose
all network connectivity except the cluster communication path - in
which case, a user may wish that node to be evicted from the cluster.
* Integration with CMAN. We must not require CMAN to run with us (or
without us). Linux-Cluster does not require a quorum disk normally -
introducing new requirements on the base of how Linux-Cluster operates
is not allowed.
* Data integrity. In order to recover from a majority failure, fencing
is required. The fencing subsystem is already provided by Linux-Clus‐
ter.
* Non-reliance on hardware or protocol specific methods (i.e. SCSI
reservations). This ensures the quorum disk algorithm can be used on
the widest range of hardware configurations possible.
* Little or no memory allocation after initialization. In critical
paths during failover, we do not want to have to worry about being
killed during a memory pressure situation because we request a page
fault, and the Linux OOM killer responds...
1.3. Hardware Considerations and Requirements1.3.1. Concurrent, Synchronous, Read/Write Access
This quorum daemon requires a shared block device with concurrent
read/write access from all nodes in the cluster. The shared block
device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
RAIDed iSCSI target, or even GNBD. The quorum daemon uses O_DIRECT to
write to the device.
1.3.2. Bargain-basement JBODs need not apply
There is a minimum performance requirement inherent when using disk-
based cluster quorum algorithms, so design your cluster accordingly.
Using a cheap JBOD with old SCSI2 disks on a multi-initiator bus will
cause problems at the first load spike. Plan your loads accordingly; a
node's inability to write to the quorum disk in a timely manner will
cause the cluster to evict the node. Using host-RAID or multi-initia‐
tor parallel SCSI configurations with the qdisk daemon is unlikely to
work, and will probably cause administrators a lot of frustration.
That having been said, because the timeouts are configurable, most
hardware should work if the timeouts are set high enough.
1.3.3. Fencing is Required
In order to maintain data integrity under all failure scenarios, use of
this quorum daemon requires adequate fencing, preferably power-based
fencing. Watchdog timers and software-based solutions to reboot the
node internally, while possibly sufficient, are not considered 'fenc‐
ing' for the purposes of using the quorum disk.
1.4. Limitations
* At this time, this daemon supports a maximum of 16 nodes. This is
primarily a scalability issue: As we increase the node count, we
increase the amount of synchronous I/O contention on the shared quorum
disk.
* Cluster node IDs must be statically configured in cluster.conf and
must be numbered from 1..16 (there can be gaps, of course).
* Cluster node votes must all be 1.
* CMAN must be running before the qdisk program can operate in full
capacity. If CMAN is not running, qdisk will wait for it.
* CMAN's eviction timeout should be at least 2x the quorum daemon's to
give the quorum daemon adequate time to converge on a master during a
failure + load spike situation. See section 3.3.1 for specific
details.
* For 'all-but-one' failure operation, the total number of votes
assigned to the quorum device should be equal to or greater than the
total number of node-votes in the cluster. While it is possible to
assign only one (or a few) votes to the quorum device, the effects of
doing so have not been explored.
* For 'tiebreaker' operation in a two-node cluster, unset CMAN's
two_node flag (or set it to 0), set CMAN's expected votes to '3', set
each node's vote to '1', and leave qdisk's vote count unset. This will
allow the cluster to operate if either both nodes are online, or a sin‐
gle node & the heuristics.
* Currently, the quorum disk daemon is difficult to use with CLVM if
the quorum disk resides on a CLVM logical volume. CLVM requires a quo‐
rate cluster to correctly operate, which introduces a chicken-and-egg
problem for starting the cluster: CLVM needs quorum, but the quorum
daemon needs CLVM (if and only if the quorum device lies on CLVM-man‐
aged storage). One way to work around this is to *not* set the clus‐
ter's expected votes to include the quorum daemon's votes. Bring all
nodes online, and start the quorum daemon *after* the whole cluster is
running. This will allow the expected votes to increase naturally.
2. Algorithms2.1. Heartbeating & Liveliness Determination
Nodes update individual status blocks on the quorum disk at a user-
defined rate. Each write of a status block alters the timestamp, which
is what other nodes use to decide whether a node has hung or not. If,
after a user-defined number of 'misses' (that is, failure to update a
timestamp), a node is declared offline. After a certain number of
'hits' (changed timestamp + "i am alive" state), the node is declared
online.
The status block contains additional information, such as a bitmask of
the nodes that node believes are online. Some of this information is
used by the master - while some is just for performance recording, and
may be used at a later time. The most important pieces of information
a node writes to its status block are:
- Timestamp
- Internal state (available / not available)
- Score
- Known max score (may be used in the future to detect invalid
configurations)
- Vote/bid messages
- Other nodes it thinks are online
2.2. Scoring & Heuristics
The administrator can configure up to 10 purely arbitrary heuristics,
and must exercise caution in doing so. At least one administrator-
defined heuristic is required for operation, but it is generally a good
idea to have more than one heuristic. By default, only nodes scoring
over 1/2 of the total maximum score will claim they are available via
the quorum disk, and a node (master or otherwise) whose score drops too
low will remove itself (usually, by rebooting).
The heuristics themselves can be any command executable by 'sh -c'.
For example, in early testing the following was used:
<heuristic program="[ -f /quorum ]" score="10" interval="2"/>
This is a literal sh-ism which tests for the existence of a file called
"/quorum". Without that file, the node would claim it was unavailable.
This is an awful example, and should never, ever be used in production,
but is provided as an example as to what one could do...
Typically, the heuristics should be snippets of shell code or commands
which help determine a node's usefulness to the cluster or clients.
Ideally, you want to add traces for all of your network paths (e.g.
check links, or ping routers), and methods to detect availability of
shared storage.
2.3. Master Election
Only one master is present at any one time in the cluster, regardless
of how many partitions exist within the cluster itself. The master is
elected by a simple voting scheme in which the lowest node which
believes it is capable of running (i.e. scores high enough) bids for
master status. If the other nodes agree, it becomes the master. This
algorithm is run whenever no master is present.
If another node comes online with a lower node ID while a node is still
bidding for master status, it will rescind its bid and vote for the
lower node ID. If a master dies or a bidding node dies, the voting
algorithm is started over. The voting algorithm typically takes two
passes to complete.
Master deaths take marginally longer to recover from than non-master
deaths, because a new master must be elected before the old master can
be evicted & fenced.
2.4. Master Duties
The master node decides who is or is not in the master partition, as
well as handles eviction of dead nodes (both via the quorum disk and
via the linux-cluster fencing system by using the cman_kill_node()
API).
2.5. How it All Ties Together
When a master is present, and if the master believes a node to be
online, that node will advertise to CMAN that the quorum disk is avail‐
able. The master will only grant a node membership if:
(a) CMAN believes the node to be online, and (b) that node has
made enough consecutive, timely writes
to the quorum disk, and
(c) the node has a high enough score to consider itself online.
3. Configuration3.1. The <quorumd> tag
This tag is a child of the top-level <cluster> tag.
<quorumd
interval="1"
This is the frequency of read/write cycles, in seconds.
tko="10"
This is the number of cycles a node must miss in order to be
declared dead. The default for this number is dependent on the
configured token timeout.
tko_up="X"
This is the number of cycles a node must be seen in order to be
declared online. Default is floor(tko/3).
upgrade_wait="2"
This is the number of cycles a node must wait before initiating a
bid for master status after heuristic scoring becomes sufficient.
The default is 2. This can not be set to 0, and should not exceed
tko.
master_wait="X"
This is the number of cycles a node must wait for votes before
declaring itself master after making a bid. Default is
floor(tko/2). This can not be less than 2, must be greater than
tko_up, and should not exceed tko.
votes="3"
This is the number of votes the quorum daemon advertises to CMAN
when it has a high enough score. The default is the number of
nodes in the cluster minus 1. For example, in a 4 node cluster,
the default is 3. This value may change during normal operation,
for example when adding or removing a node from the cluster.
log_level="4"
This controls the verbosity of the quorum daemon in the system
logs. 0 = emergencies; 7 = debug. This option is deprecated.
log_facility="daemon"
This controls the syslog facility used by the quorum daemon when
logging. For a complete list of available facilities, see sys‐
log.conf(5). The default value for this is 'daemon'. This option
is deprecated.
status_file="/foo"
Write internal states out to this file periodically ("-" = use
stdout). This is primarily used for debugging. The default value
for this attribute is undefined. This option can be changed while
qdiskd is running.
min_score="3"
Absolute minimum score to be consider one's self "alive". If
omitted, or set to 0, the default function "floor((n+1)/2)" is
used, where n is the total of all of defined heuristics' score
attribute. This must never exceed the sum of the heuristic
scores, or else the quorum disk will never be available.
reboot="1"
If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
sition as a result in a change in score (see section 2.2). The
default for this value is 1 (on). This option can be changed
while qdiskd is running.
master_wins="0"
If set to 1 (on), only the qdiskd master will advertise its votes
to CMAN. In a network partition, only the qdisk master will pro‐
vide votes to CMAN. Consequently, that node will automatically
"win" in a fence race.
This option requires careful tuning of the CMAN timeout, the
qdiskd timeout, and CMAN's quorum_dev_poll value. As a rule of
thumb, CMAN's quorum_dev_poll value should be equal to Totem's
token timeout and qdiskd's timeout (interval*tko) should be less
than half of Totem's token timeout. See section 3.3.1 for more
information.
This option only takes effect if there are no heuristics config‐
ured and it is valid only for 2 node cluster. This option is
automatically disabled if heuristics are defined or cluster has
more than 2 nodes configured.
In a two-node cluster with no heuristics and no defined vote count
(see above), this mode is turned by default. If enabled in this
way at startup and a node is later added to the cluster configura‐
tion or the vote count is set to a value other than 1, this mode
will be disabled.
allow_kill="1"
If set to 0 (off), qdiskd will *not* instruct to kill nodes it
thinks are dead (as a result of not writing to the quorum disk).
The default for this value is 1 (on). This option can be changed
while qdiskd is running.
paranoid="0"
If set to 1 (on), qdiskd will watch internal timers and reboot the
node if it takes more than (interval * tko) seconds to complete a
quorum disk pass. The default for this value is 0 (off). This
option can be changed while qdiskd is running.
io_timeout="0"
If set to 1 (on), qdiskd will watch internal timers and reboot the
node if qdisk is not able to write to disk after (interval * tko)
seconds. The default for this value is 0 (off). If io_timeout is
active max_error_cycles is overridden and set to off.
scheduler="rr"
Valid values are 'rr', 'fifo', and 'other'. Selects the schedul‐
ing queue in the Linux kernel for operation of the main & score
threads (does not affect the heuristics; they are always run in
the 'other' queue). Default is 'rr'. See sched_setscheduler(2)
for more details.
priority="1"
Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid val‐
ues for 'other' are -20..20 inclusive. Sets the priority of the
main & score threads. The default value is 1 (in the RR and FIFO
queues, higher numbers denote higher priority; in OTHER, lower
values denote higher priority). This option can be changed while
qdiskd is running.
stop_cman="0"
Ordinarily, cluster membership is left up to CMAN, not qdisk. If
this parameter is set to 1 (on), qdiskd will tell CMAN to leave
the cluster if it is unable to initialize the quorum disk during
startup. This can be used to prevent cluster participation by a
node which has been disconnected from the SAN. The default for
this value is 0 (off). This option can be changed while qdiskd is
running.
use_uptime="1"
If this parameter is set to 1 (on), qdiskd will use values from
/proc/uptime for internal timings. This is a bit less precise
than gettimeofday(2), but the benefit is that changing the system
clock will not affect qdiskd's behavior - even if paranoid is
enabled. If set to 0, qdiskd will use gettimeofday(2), which is
more precise. The default for this value is 1 (on / use uptime).
device="/dev/sda1"
This is the device the quorum daemon will use. This device must
be the same on all nodes.
label="mylabel"
This overrides the device field if present. If specified, the
quorum daemon will read /proc/partitions and check for qdisk sig‐
natures on every block device found, comparing the label against
the specified label. This is useful in configurations where the
block device name differs on a per-node basis.
cman_label="mylabel"
This overrides the label advertised to CMAN if present. If speci‐
fied, the quorum daemon will register with this name instead of
the actual device name.
max_error_cycles="0"/>
If we receive an I/O error during a cycle, we do not poll CMAN and
tell it we are alive. If specified, this value will cause qdiskd
to exit after the specified number of consecutive cycles during
which I/O errors occur. The default is 0 (no maximum). This
option can be changed while qdiskd is running. This option is
ignored if io_timeout is set to 1.
/>
3.3.1. Quorum Disk Timings
Qdiskd should not be used in environments requiring failure detection
times of less than approximately 10 seconds.
Qdiskd will attempt to automatically configure timings based on the
totem timeout and the TKO. If configuring manually, Totem's token
timeout must be set to a value at least 1 interval greater than the the
following function:
interval * (tko + master_wait + upgrade_wait)
So, if you have an interval of 2, a tko of 7, master_wait of 2 and
upgrade_wait of 2, the token timeout should be at least 24 seconds
(24000 msec).
It is recommended to have at least 3 intervals to reduce the risk of
quorum loss during heavy I/O load. As a rule of thumb, using a totem
timeout more than 2x of qdiskd's timeout will result in good behavior.
An improper timing configuration will cause CMAN to give up on qdiskd,
causing a temporary loss of quorum during master transition.
3.2. The <heuristic> tag
This tag is a child of the <quorumd> tag. Heuristics may not be
changed while qdiskd is running.
<heuristic
program="/test.sh"
This is the program used to determine if this heuristic is alive.
This can be anything which may be executed by /bin/sh -c. A
return value of zero indicates success; anything else indicates
failure. This is required.
score="1"
This is the weight of this heuristic. Be careful when determining
scores for heuristics. The default score for each heuristic is 1.
interval="2"
This is the frequency (in seconds) at which we poll the heuristic.
The default interval is determined by the qdiskd timeout.
tko="1"
After this many failed attempts to run the heuristic, it is con‐
sidered DOWN, and its score is removed. The default tko for each
heuristic is determined by the qdiskd timeout.
/>
3.3. Examples3.3.1. 3 cluster nodes & 3 routers
<cman expected_votes="6" .../>
<clusternodes>
<clusternode name="node1" votes="1" ... />
<clusternode name="node2" votes="1" ... />
<clusternode name="node3" votes="1" ... />
</clusternodes>
<quorumd interval="1" tko="10" votes="3" label="testing">
<heuristic program="ping A -c1 -w1" score="1" interval="2"
tko="3"/>
<heuristic program="ping B -c1 -w1" score="1" interval="2"
tko="3"/>
<heuristic program="ping C -c1 -w1" score="1" interval="2"
tko="3"/>
</quorumd>
3.3.2. 2 cluster nodes & 1 IP tiebreaker
<cman two_node="0" expected_votes="3" .../>
<clusternodes>
<clusternode name="node1" votes="1" ... />
<clusternode name="node2" votes="1" ... />
</clusternodes>
<quorumd interval="1" tko="10" votes="1" label="testing">
<heuristic program="ping A -c1 -w1" score="1" interval="2"
tko="3"/>
</quorumd>
3.4. Heuristic score considerations
* Heuristic timeouts should be set high enough to allow the previous
run of a given heuristic to complete.
* Heuristic scripts returning anything except 0 as their return code
are considered failed.
* The worst-case for improperly configured quorum heuristics is a race
to fence where two partitions simultaneously try to kill each other.
3.5. Creating a quorum disk partition
The mkqdisk utility can create and list currently configured quorum
disks visible to the local node; see mkqdisk(8) for more details.
SEE ALSOmkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)
12 Oct 2011 QDisk(5)