torus-2QoS man page on SuSE

Man page or keyword search:  
man Server   14857 pages
apropos Keyword Search (all sections)
Output format
SuSE logo
[printable version]

TORUS-2QOS(8)		       OpenIB Management		 TORUS-2QOS(8)

NAME
       torus-2QoS - Routing engine for OpenSM subnet manager

DESCRIPTION
       Torus-2QoS  is  routing	algorithm designed for large-scale 2D/3D torus
       fabrics.	 The torus-2QoS routing engine can provide the following func‐
       tionality on a 2D/3D torus:

	 – Routing that is free of credit loops.
	 – Two levels of Quality of Service (QoS), assuming switches and chan‐
	   nel adapters support eight data VLs.
	 – The ability to route around a single failed switch, and/or multiple
	   failed links, without
	   – introducing credit loops, or
	   – changing path SL values.
	 – Very short run times, with good scaling properties as fabric size
	   increases.

UNICAST ROUTING
       Unicast routing in torus-2QoS  is  based	 on  Dimension	Order  Routing
       (DOR).	It  avoids  the deadlocks that would otherwise occur in a DOR-
       routed torus using the concept of a dateline for each torus  dimension.
       It encodes into a path SL which datelines the path crosses, as follows:

	   sl = 0;
	   for (d = 0; d < torus_dimensions; d++) {
	    /* path_crosses_dateline(d) returns 0 or 1 */
	    sl |= path_crosses_dateline(d) << d;
	   }

       On  a  3D torus this consumes three SL bits, leaving one SL bit unused.
       Torus-2QoS uses this SL bit to implement two QoS levels.

       Torus-2QoS also makes use of the output port dependence of switch SL2VL
       maps  to	 encode	 into  one  VL bit the information encoded in three SL
       bits.  It computes in which  torus  coordinate  direction  each	inter-
       switch link "points", and writes SL2VL maps for such ports as follows:

	   for (sl = 0; sl < 16; sl++) {
	    /* cdir(port) computes which torus coordinate direction
	     * a switch port "points" in; returns 0, 1, or 2
	     */
	    sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
	   }

       Thus,  on  a  pristine  3D torus, i.e., in the absence of failed fabric
       switches, torus-2QoS consumes eight SL values (SL bits 0-2) and two  VL
       values (VL bit 0) per QoS level to provide deadlock-free routing.

       Torus-2QoS  routes  around link failure by "taking the long way around"
       any 1D ring interrupted by link failure.	 For example, consider the  2D
       6x5 torus below, where switches are denoted by [+a-zA-Z]:
			       |    |	 |    |	   |	|
			  4  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|
			  3  --+----+----+----D----+----+--
			       |    |	 |    |	   |	|
			  2  --+----+----I----r----+----+--
			       |    |	 |    |	   |	|
			  1  --m----S----n----T----o----p--
			       |    |	 |    |	   |	|
			y=0  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|

			     x=0    1	 2    3	   4	5

       For  a pristine fabric the path from S to D would be S-n-T-r-D.	In the
       event that either link S-n or n-T has failed, torus-2QoS would use  the
       path S-m-p-o-T-r-D.  Note that it can do this without changing the path
       SL value; once the 1D ring m-S-n-T-o-p-m has been  broken  by  failure,
       path  segments using it cannot contribute to deadlock, and the x-direc‐
       tion dateline (between, say, x=5 and x=0) can be ignored for path  seg‐
       ments on that ring.

       One  result of this is that torus-2QoS can route around many simultane‐
       ous link failures, as long as no 1D ring is broken into	disjoint  seg‐
       ments.	For  example, if links n-T and T-o have both failed, that ring
       has  been  broken  into	two  disjoint  segments,  T   and   o-p-m-S-n.
       Torus-2QoS  checks  for	such  issues,  reports	if they are found, and
       refuses to route such fabrics.

       Note that in the case where there are multiple parallel links between a
       pair  of switches, torus-2QoS will allocate routes across such links in
       a round-robin fashion, based on ports at the  path  destination	switch
       that  are  active  and  not used for inter-switch links.	 Should a link
       that is one of several such parallel  links  fail,  routes  are	redis‐
       tributed	 across	 the  remaining links.	When the last of such a set of
       parallel links fails, traffic is rerouted as described above.

       Handling a failed switch under DOR requires introducing into a path  at
       least  one turn that would be otherwise "illegal", i.e., not allowed by
       DOR rules.  Torus-2QoS will introduce such a turn as close as  possible
       to the failed switch in order to route around it.

       In  the	above  example,	 suppose switch T has failed, and consider the
       path from S to D.  Torus-2QoS will produce the path  S-n-I-r-D,	rather
       than  the  S-n-T-r-D path for a pristine torus, by introducing an early
       turn at n.  Normal DOR rules will cause traffic arriving at switch I to
       be  forwarded  to  switch  r;  for  traffic  arriving from I due to the
       "early" turn at n, this will generate an "illegal" turn at I.

       Torus-2QoS will also use the input port dependence of SL2VL maps to set
       VL bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns,
       i.e., those turns that are illegal under DOR.  This  causes  the	 first
       hop  after  any	such turn to use a separate set of VL values, and pre‐
       vents deadlock in the presence of a single failed switch.

       For any given path, only the hops after a turn that  is	illegal	 under
       DOR  can contribute to a credit loop that leads to deadlock.  So in the
       example above with failed switch T, the location of the illegal turn at
       I  in the path from S to D requires that any credit loop caused by that
       turn must encircle the failed switch at T.  Thus the second  and	 later
       hops after the illegal turn at I (i.e., hop r-D) cannot contribute to a
       credit loop because they cannot be used to construct a loop  encircling
       T.  The hop I-r uses a separate VL, so it cannot contribute to a credit
       loop encircling T.

       Extending this argument shows that in  addition	to  being  capable  of
       routing	around	a  single switch failure without introducing deadlock,
       torus-2QoS can also route around multiple failed switches on the condi‐
       tion  they are adjacent in the last dimension routed by DOR.  For exam‐
       ple, consider the following case on a 6x6 2D torus:
			       |    |	 |    |	   |	|
			  5  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|
			  4  --+----+----+----D----+----+--
			       |    |	 |    |	   |	|
			  3  --+----+----I----u----+----+--
			       |    |	 |    |	   |	|
			  2  --+----+----q----R----+----+--
			       |    |	 |    |	   |	|
			  1  --m----S----n----T----o----p--
			       |    |	 |    |	   |	|
			y=0  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|

			     x=0    1	 2    3	   4	5

       Suppose switches T and R have failed, and consider the path from	 S  to
       D.  Torus-2QoS will generate the path S-n-q-I-u-D, with an illegal turn
       at switch I, and with hop I-u using a VL with bit 1 set.

       As a further example, consider a	 case  that  torus-2QoS	 cannot	 route
       without	deadlock:  two failed switches adjacent in a dimension that is
       not the last dimension routed by DOR; here the failed  switches	are  O
       and T:
			       |    |	 |    |	   |	|
			  5  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|
			  4  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|
			  3  --+----+----+----+----D----+--
			       |    |	 |    |	   |	|
			  2  --+----+----I----q----r----+--
			       |    |	 |    |	   |	|
			  1  --m----S----n----O----T----p--
			       |    |	 |    |	   |	|
			y=0  --+----+----+----+----+----+--
			       |    |	 |    |	   |	|

			     x=0    1	 2    3	   4	5

       In a pristine fabric, torus-2QoS would generate the path from S to D as
       S-n-O-T-r-D.  With failed switches O and T,  torus-2QoS	will  generate
       the  path  S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q
       using a VL with bit 1 set.  In contrast to the  earlier	examples,  the
       second  hop  after  the	illegal	 turn, q-r, can be used to construct a
       credit loop encircling the failed switches.

MULTICAST ROUTING
       Since torus-2QoS uses all four available SL bits, and the three data VL
       bits  that are typically available in current switches, there is no way
       to use SL/VL values to separate multicast traffic from unicast traffic.
       Thus, torus-2QoS must generate multicast routing such that credit loops
       cannot arise from a combination of multicast and unicast path segments.

       It turns out that it is possible to construct spanning trees for multi‐
       cast  routing  that  have  that property.  For the 2D 6x5 torus example
       above, here is the full-fabric spanning tree that torus-2QoS will  con‐
       struct, where "x" is the root switch and each "+" is a non-root switch:
			  4    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			  3    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			  2    +----+----+----x----+----+
			       |    |	 |    |	   |	|
			  1    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			y=0    +    +	 +    +	   +	+

			     x=0    1	 2    3	   4	5

       For  multicast traffic routed from root to tip, every turn in the above
       spanning tree is a legal DOR turn.

       For traffic routed from tip to root, and some  traffic  routed  through
       the  root,  turns  are  not  legal  DOR turns.  However, to construct a
       credit loop, the union of multicast routing on this spanning tree  with
       DOR  unicast  routing  can only provide 3 of the 4 turns needed for the
       loop.

       In addition, if none of the above  spanning  tree  branches  crosses  a
       dateline used for unicast credit loop avoidance on a torus, and if mul‐
       ticast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS uses
       SL  bit 3 to differentiate QoS level), then multicast traffic also can‐
       not contribute to the "ring" credit loops that are  otherwise  possible
       in a torus.

       Torus-2QoS  uses	 these	ideas to create a master spanning tree.	 Every
       multicast group spanning tree will be constructed as a  subset  of  the
       master tree, with the same root as the master tree.

       Such  multicast group spanning trees will in general not be optimal for
       groups which are a subset of the full fabric. However, this  compromise
       must be made to enable support for two QoS levels on a torus while pre‐
       venting credit loops.

       In the presence of link or switch failures that result in a fabric  for
       which  torus-2QoS  can  generate credit-loop-free unicast routes, it is
       also possible to generate a master spanning  tree  for  multicast  that
       retains	the  required  properties.  For example, consider that same 2D
       6x5 torus, with the link from (2,2) to (3,2) failed.   Torus-2QoS  will
       generate the following master spanning tree:
			  4    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			  3    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			  2  --+----+----+    x----+----+--
			       |    |	 |    |	   |	|
			  1    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			y=0    +    +	 +    +	   +	+

			     x=0    1	 2    3	   4	5

       Two  things are notable about this master spanning tree.	 First, assum‐
       ing the x dateline was between x=5 and x=0, this spanning  tree	has  a
       branch that crosses the dateline.  However, just as for unicast, cross‐
       ing a dateline on a 1D ring (here, the ring for y=2) that is broken  by
       a failure cannot contribute to a torus credit loop.

       Second,	this  spanning	tree  is  no longer optimal even for multicast
       groups that encompass the entire fabric.	  That,	 unfortunately,	 is  a
       compromise  that	 must be made to retain the other desirable properties
       of torus-2QoS routing.

       In the event that a single switch fails,	 torus-2QoS  will  generate  a
       master spanning tree that has no "extra" turns by appropriately select‐
       ing a root switch.  In the 2D 6x5 torus example, assume	now  that  the
       switch  at  (3,2),  i.e.,  the  root  for  a  pristine  fabric,	fails.
       Torus-2QoS will generate the following master spanning  tree  for  that
       case:
					|
			  4    +    +	 +    +	   +	+
			       |    |	 |    |	   |	|
			  3    +    +	 +    +	   +	+
			       |    |	 |	   |	|
			  2    +    +	 +	   +	+
			       |    |	 |	   |	|
			  1    +----+----x----+----+----+
			       |    |	 |    |	   |	|
			y=0    +    +	 +    +	   +	+
					|

			     x=0    1	 2    3	   4	5

       Assuming the y dateline was between y=4 and y=0, this spanning tree has
       a branch that crosses a dateline.  However, again this cannot  contrib‐
       ute  to	credit loops as it occurs on a 1D ring (the ring for x=3) that
       is broken by a failure, as in the above example.

TORUS TOPOLOGY DISCOVERY
       The algorithm used by torus-2QoS to contruct the	 torus	topology  from
       the undirected graph representing the fabric requires that the radix of
       each dimension be configured via	 torus-2QoS.conf.   It	also  requires
       that  the torus topology be "seeded"; for a 3D torus this requires con‐
       figuring four switches that define the three coordinate	directions  of
       the torus.

       Given  this  starting information, the algorithm is to examine the cube
       formed by the eight switch locations bounded by the corners (x,y,z) and
       (x+1,y+1,z+1).	Based on switches already placed into the torus topol‐
       ogy at some of these  locations,	 the  algorithm	 examines  4-loops  of
       inter-switch  links  to	find the one that is consistent with a face of
       the cube of switch locations, and adds its swiches  to  the  discovered
       topology in the correct locations.

       Because	the  algorithm	is based on examing the topology of 4-loops of
       links, a torus with one or more radix-4 dimensions requires extra  ini‐
       tial   seed   configuration.    See   torus-2QoS.conf(5)	 for  details.
       Torus-2QoS will detect and report when it has  insufficient  configura‐
       tion for a torus with radix-4 dimensions.

       In  the event the torus is significantly degraded, i.e., there are many
       missing switches or links, it may happen that torus-2QoS is  unable  to
       place  into the torus some switches and/or links that were discoverd in
       the fabric, and will generate a warning in that case.  A similar condi‐
       tion  occurs if torus-2QoS is misconfigured, i.e., the radix of a torus
       dimension as configured does not match the radix of that	 torus	dimen‐
       sion as wired, and many switches/links in the fabric will not be placed
       into the torus.

QUALITY OF SERVICE CONFIGURATION
       OpenSM will not program switchs and channel adapters with SL2VL maps or
       VL  arbitration	configuration  unless  it  is  invoked with -Q.	 Since
       torus-2QoS depends on such functionality for correct operation,	always
       invoke  OpenSM  with  -Q	 when  torus-2QoS  is  in  the list of routing
       engines.

       Any quality of service configuration method supported  by  OpenSM  will
       work  with torus-2QoS, subject to the following limitations and consid‐
       erations.

       For all routing engines supported by OpenSM except torus-2QoS, there is
       a  one-to-one  correspondence between QoS level and SL.	Torus-2QoS can
       only support two quality of service levels, so only the high-order  bit
       of  any	SL value used for unicast QoS configuration will be honored by
       torus-2QoS.

       For multicast QoS configuration, only SL values 0 and 8 should be  used
       with torus-2QoS.

       Since  SL to VL map configuration must be under the complete control of
       torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc.,  must
       and  will be ignored, and a warning will be generated.

       Torus-2QoS  uses	 VL  values  0-3 to implement one of its supported QoS
       levels, and VL values 4-7 to  implement	the  other.   Hard-to-diagnose
       application  issues may arise if traffic is not delivered fairly across
       each of these two VL ranges.  Torus-2QoS will detect  and  warn	if  VL
       arbitration  is	configured  unfairly  across VLs in the range 0-3, and
       also in the range 4-7.  Note that the  default  OpenSM  VL  arbitration
       configuration  does  not	 meet this constraint, so all torus-2QoS users
       should configure VL arbitration via qos_vlarb_high, qos_vlarb_low, etc.

OPERATIONAL CONSIDERATIONS
       Any routing algorithm for a torus IB fabric must employ path SL	values
       to  avoid  credit  loops.   As a result, all applications run over such
       fabrics must perform a path record query to obtain the correct path  SL
       for  connection	setup.	 Applications  that use rdma_cm for connection
       setup will automatically meet this requirement.

       If a change in  fabric  topology	 causes	 changes  in  path  SL	values
       required	 to  route  without  credit loops, in general all applications
       would need to repath to avoid message deadlock.	Since  torus-2QoS  has
       the  ability  to reroute after a single switch failure without changing
       path SL values, repathing by running applications is not required  when
       the fabric is routed with torus-2QoS.

       Torus-2QoS  can	provide	 unchanging  path SL values in the presence of
       subnet manager failover provided that all  OpenSM  instances  have  the
       same idea of dateline location.	See torus-2QoS.conf(5) for details.

       Torus-2QoS will detect configurations of failed switches and links that
       prevent routing that is free of credit loops, and will log warnings and
       refuse to route.	 If "no_fallback" was configured in the list of OpenSM
       routing engines, then no other routing engine will attempt to route the
       fabric.	 In  that case all paths that do not transit the failed compo‐
       nents will continue to work, and the subset of  paths  that  are	 still
       operational  will continue to remain free of credit loops.  OpenSM will
       continue to attempt to route the fabric after every sweep interval, and
       after  any change (such as a link up) in the fabric topology.  When the
       fabric components are repaired, full functionality will be restored.

       In the event OpenSM was configured to allow some other engine to	 route
       the  fabric if torus-2QoS fails, then credit loops and message deadlock
       are likely if torus-2QoS had previously routed the fabric successfully.
       Even  if	 the other engine is capable of routing a torus without credit
       loops, applications that built connections with path SL values  granted
       under  torus-2QoS will likely experience message deadlock under routing
       generated by a different engine, unless they repath.

       To verify that a torus fabric is routed free of credit loops, use ibdm‐
       chk to analyze data collected via ibdiagnet -vlr.

FILES
       /etc/opensm/opensm.conf
	      default OpenSM config file.

       /etc/opensm/qos-policy.conf
	      default QoS policy config file.

       /etc/opensm/torus-2QoS.conf
	      default torus-2QoS config file.

SEE ALSO
       opensm(8), torus-2QoS.conf(5), ibdiagnet(1), ibdmchk(1), rdma_cm(7).

OpenIB			       November 10, 2010		 TORUS-2QOS(8)
[top]

List of man pages available for SuSE

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net