Pod Topology Spread Constraints (2024)

You can use topology spread constraints to control howPods are spread across your clusteramong failure-domains such as regions, zones, nodes, and other user-defined topologydomains. This can help to achieve high availability as well as efficient resourceutilization.

You can set cluster-level constraints as a default,or configure topology spread constraints for individual workloads.

Motivation

Imagine that you have a cluster of up to twenty nodes, and you want to run aworkloadthat automatically scales how many replicas it uses. There could be as few astwo Pods or as many as fifteen.When there are only two Pods, you'd prefer not to have both of those Pods run on thesame node: you would run the risk that a single node failure takes your workloadoffline.

In addition to this basic usage, there are some advanced usage examples thatenable your workloads to benefit on high availability and cluster utilization.

As you scale up and run more Pods, a different concern becomes important. Imaginethat you have three nodes running five Pods each. The nodes have enough capacityto run that many replicas; however, the clients that interact with this workloadare split across three different datacenters (or infrastructure zones). Now youhave less concern about a single node failure, but you notice that latency ishigher than you'd like, and you are paying for network costs associated withsending network traffic between the different zones.

You decide that under normal operation you'd prefer to have a similar number of replicasscheduled into each infrastructure zone,and you'd like the cluster to self-heal in the case that there is a problem.

Pod topology spread constraints offer you a declarative way to configure that.

topologySpreadConstraints field

The Pod API includes a field, spec.topologySpreadConstraints. The usage of this field looks likethe following:

---apiVersion: v1kind: Podmetadata: name: example-podspec: # Configure a topology spread constraint topologySpreadConstraints: - maxSkew: <integer> minDomains: <integer> # optional; beta since v1.25 topologyKey: <string> whenUnsatisfiable: <string> labelSelector: <object> matchLabelKeys: <list> # optional; alpha since v1.25 nodeAffinityPolicy: [Honor|Ignore] # optional; beta since v1.26 nodeTaintsPolicy: [Honor|Ignore] # optional; beta since v1.26 ### other Pod fields go here

You can read more about this field by running kubectl explain Pod.spec.topologySpreadConstraints orrefer to scheduling section of the API reference for Pod.

Spread constraint definition

You can define one or multiple topologySpreadConstraints entries to instruct thekube-scheduler how to place each incoming Pod in relation to the existing Pods acrossyour cluster. Those fields are:

  • maxSkew describes the degree to which Pods may be unevenly distributed. You mustspecify this field and the number must be greater than zero. Its semantics differaccording to the value of whenUnsatisfiable:

    • if you select whenUnsatisfiable: DoNotSchedule, then maxSkew defines themaximum permitted difference between the number of matching pods in the targettopology and the global minimum(the minimum number of matching pods in an eligible domain or zero if the number of eligible domains is less than MinDomains).For example, if you have 3 zones with 2, 2 and 1 matching pods respectively,MaxSkew is set to 1 then the global minimum is 1.
    • if you select whenUnsatisfiable: ScheduleAnyway, the scheduler gives higherprecedence to topologies that would help reduce the skew.
  • minDomains indicates a minimum number of eligible domains. This field is optional.A domain is a particular instance of a topology. An eligible domain is a domain whosenodes match the node selector.

    Note: The minDomains field is a beta field and disabled by default in 1.25. You can enable it by enabling theMinDomainsInPodTopologySpread feature gate.

    • The value of minDomains must be greater than 0, when specified.You can only specify minDomains in conjunction with whenUnsatisfiable: DoNotSchedule.
    • When the number of eligible domains with match topology keys is less than minDomains,Pod topology spread treats global minimum as 0, and then the calculation of skew is performed.The global minimum is the minimum number of matching Pods in an eligible domain,or zero if the number of eligible domains is less than minDomains.
    • When the number of eligible domains with matching topology keys equals or is greater thanminDomains, this value has no effect on scheduling.
    • If you do not specify minDomains, the constraint behaves as if minDomains is 1.
  • topologyKey is the key of node labels. Nodes that have a label with this keyand identical values are considered to be in the same topology.We call each instance of a topology (in other words, a <key, value> pair) a domain. The schedulerwill try to put a balanced number of pods into each domain.Also, we define an eligible domain as a domain whose nodes meet the requirements ofnodeAffinityPolicy and nodeTaintsPolicy.

  • whenUnsatisfiable indicates how to deal with a Pod if it doesn't satisfy the spread constraint:

    • DoNotSchedule (default) tells the scheduler not to schedule it.
    • ScheduleAnyway tells the scheduler to still schedule it while prioritizing nodes that minimize the skew.
  • labelSelector is used to find matching Pods. Podsthat match this label selector are counted to determine thenumber of Pods in their corresponding topology domain.See Label Selectorsfor more details.

  • matchLabelKeys is a list of pod label keys to select the pods over whichspreading will be calculated. The keys are used to lookup values from the pod labels, those key-value labels are ANDed with labelSelector to select the group of existing pods over which spreading will be calculated for the incoming pod. Keys that don't exist in the pod labels will be ignored. A null or empty list means only match against the labelSelector.

    With matchLabelKeys, users don't need to update the pod.spec between different revisions. The controller/operator just needs to set different values to the same label key for different revisions. The scheduler will assume the values automatically based on matchLabelKeys. For example, if users use Deployment, they can use the label keyed with pod-template-hash, which is added automatically by the Deployment controller, to distinguish between different revisions in a single Deployment.

     topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule matchLabelKeys: - app - pod-template-hash

    Note: The matchLabelKeys field is an alpha field added in 1.25. You have to enable theMatchLabelKeysInPodTopologySpread feature gatein order to use it.

  • nodeAffinityPolicy indicates how we will treat Pod's nodeAffinity/nodeSelectorwhen calculating pod topology spread skew. Options are:

    • Honor: only nodes matching nodeAffinity/nodeSelector are included in the calculations.
    • Ignore: nodeAffinity/nodeSelector are ignored. All nodes are included in the calculations.

    If this value is null, the behavior is equivalent to the Honor policy.

    Note: The nodeAffinityPolicy is a beta-level field and enabled by default in 1.26. You can disable it by disabling theNodeInclusionPolicyInPodTopologySpread feature gate.

  • nodeTaintsPolicy indicates how we will treat node taints when calculatingpod topology spread skew. Options are:

    • Honor: nodes without taints, along with tainted nodes for which the incoming podhas a toleration, are included.
    • Ignore: node taints are ignored. All nodes are included.

    If this value is null, the behavior is equivalent to the Ignore policy.

    Note: The nodeTaintsPolicy is a beta-level field and enabled by default in 1.26. You can disable it by disabling theNodeInclusionPolicyInPodTopologySpread feature gate.

When a Pod defines more than one topologySpreadConstraint, those constraints arecombined using a logical AND operation: the kube-scheduler looks for a node for the incoming Podthat satisfies all the configured constraints.

Node labels

Topology spread constraints rely on node labels to identify the topologydomain(s) that each node is in.For example, a node might have labels:

 region: us-east-1 zone: us-east-1a

Note:

For brevity, this example doesn't use thewell-known label keystopology.kubernetes.io/zone and topology.kubernetes.io/region. However,those registered label keys are nonetheless recommended rather than the private(unqualified) label keys region and zone that are used here.

You can't make a reliable assumption about the meaning of a private label keybetween different contexts.

Suppose you have a 4-node cluster with the following labels:

NAME STATUS ROLES AGE VERSION LABELSnode1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneAnode2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneAnode3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneBnode4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB

Then the cluster is logically viewed as below:

graph TBsubgraph "zoneB"n3(Node3)n4(Node4)endsubgraph "zoneA"n1(Node1)n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4 k8s;class zoneA,zoneB cluster;

Consistency

You should set the same Pod topology spread constraints on all pods in a group.

Usually, if you are using a workload controller such as a Deployment, the pod templatetakes care of this for you. If you mix different spread constraints then Kubernetesfollows the API definition of the field; however, the behavior is more likely to becomeconfusing and troubleshooting is less straightforward.

You need a mechanism to ensure that all the nodes in a topology domain (such as acloud provider region) are labelled consistently.To avoid you needing to manually label nodes, most clusters automaticallypopulate well-known labels such as topology.kubernetes.io/hostname. Check whetheryour cluster supports this.

Topology spread constraint examples

Example: one topology spread constraint

Suppose you have a 4-node cluster where 3 Pods labelled foo: bar are located innode1, node2 and node3 respectively:

graph BTsubgraph "zoneB"p3(Pod) --> n3(Node3)n4(Node4)endsubgraph "zoneA"p1(Pod) --> n1(Node1)p2(Pod) --> n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4,p1,p2,p3 k8s;class zoneA,zoneB cluster;

If you want an incoming Pod to be evenly spread with existing Pods across zones, youcan use a manifest similar to:

kind: PodapiVersion: v1metadata: name: mypod labels: foo: barspec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: foo: bar containers: - name: pause image: registry.k8s.io/pause:3.1

From that manifest, topologyKey: zone implies the even distribution will only be appliedto nodes that are labelled zone: <any value> (nodes that don't have a zone labelare skipped). The field whenUnsatisfiable: DoNotSchedule tells the scheduler to let theincoming Pod stay pending if the scheduler can't find a way to satisfy the constraint.

If the scheduler placed this incoming Pod into zone A, the distribution of Pods wouldbecome [3, 1]. That means the actual skew is then 2 (calculated as 3 - 1), whichviolates maxSkew: 1. To satisfy the constraints and context for this example, theincoming Pod can only be placed onto a node in zone B:

graph BTsubgraph "zoneB"p3(Pod) --> n3(Node3)p4(mypod) --> n4(Node4)endsubgraph "zoneA"p1(Pod) --> n1(Node1)p2(Pod) --> n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4,p1,p2,p3 k8s;class p4 plain;class zoneA,zoneB cluster;

OR

graph BTsubgraph "zoneB"p3(Pod) --> n3(Node3)p4(mypod) --> n3n4(Node4)endsubgraph "zoneA"p1(Pod) --> n1(Node1)p2(Pod) --> n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4,p1,p2,p3 k8s;class p4 plain;class zoneA,zoneB cluster;

You can tweak the Pod spec to meet various kinds of requirements:

  • Change maxSkew to a bigger value - such as 2 - so that the incoming Pod canbe placed into zone A as well.
  • Change topologyKey to node so as to distribute the Pods evenly across nodesinstead of zones. In the above example, if maxSkew remains 1, the incomingPod can only be placed onto the node node4.
  • Change whenUnsatisfiable: DoNotSchedule to whenUnsatisfiable: ScheduleAnywayto ensure the incoming Pod to be always schedulable (suppose other scheduling APIsare satisfied). However, it's preferred to be placed into the topology domain whichhas fewer matching Pods. (Be aware that this preference is jointly normalizedwith other internal scheduling priorities such as resource usage ratio).

Example: multiple topology spread constraints

This builds upon the previous example. Suppose you have a 4-node cluster where 3existing Pods labeled foo: bar are located on node1, node2 and node3 respectively:

graph BTsubgraph "zoneB"p3(Pod) --> n3(Node3)n4(Node4)endsubgraph "zoneA"p1(Pod) --> n1(Node1)p2(Pod) --> n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4,p1,p2,p3 k8s;class p4 plain;class zoneA,zoneB cluster;

You can combine two topology spread constraints to control the spread of Pods bothby node and by zone:

kind: PodapiVersion: v1metadata: name: mypod labels: foo: barspec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: foo: bar - maxSkew: 1 topologyKey: node whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: foo: bar containers: - name: pause image: registry.k8s.io/pause:3.1

In this case, to match the first constraint, the incoming Pod can only be placed ontonodes in zone B; while in terms of the second constraint, the incoming Pod can only bescheduled to the node node4. The scheduler only considers options that satisfy alldefined constraints, so the only valid placement is onto node node4.

Example: conflicting topology spread constraints

Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:

graph BTsubgraph "zoneB"p4(Pod) --> n3(Node3)p5(Pod) --> n3endsubgraph "zoneA"p1(Pod) --> n1(Node1)p2(Pod) --> n1p3(Pod) --> n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4,p1,p2,p3,p4,p5 k8s;class zoneA,zoneB cluster;

If you were to applytwo-constraints.yaml(the manifest from the previous example)to this cluster, you would see that the Pod mypod stays in the Pending state.This happens because: to satisfy the first constraint, the Pod mypod can onlybe placed into zone B; while in terms of the second constraint, the Pod mypodcan only schedule to node node2. The intersection of the two constraints returnsan empty set, and the scheduler cannot place the Pod.

To overcome this situation, you can either increase the value of maxSkew or modifyone of the constraints to use whenUnsatisfiable: ScheduleAnyway. Depending oncirc*mstances, you might also decide to delete an existing Pod manually - for example,if you are troubleshooting why a bug-fix rollout is not making progress.

Interaction with node affinity and node selectors

The scheduler will skip the non-matching nodes from the skew calculations if theincoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined.

Example: topology spread constraints with node affinity

Suppose you have a 5-node cluster ranging across zones A to C:

graph BTsubgraph "zoneB"p3(Pod) --> n3(Node3)n4(Node4)endsubgraph "zoneA"p1(Pod) --> n1(Node1)p2(Pod) --> n2(Node2)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n1,n2,n3,n4,p1,p2,p3 k8s;class p4 plain;class zoneA,zoneB cluster;

graph BTsubgraph "zoneC"n5(Node5)endclassDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;class n5 k8s;class zoneC cluster;

and you know that zone C must be excluded. In this case, you can compose a manifestas below, so that Pod mypod will be placed into zone B instead of zone C.Similarly, Kubernetes also respects spec.nodeSelector.

kind: PodapiVersion: v1metadata: name: mypod labels: foo: barspec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: foo: bar affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: NotIn values: - zoneC containers: - name: pause image: registry.k8s.io/pause:3.1

Implicit conventions

There are some implicit conventions worth noting here:

  • Only the Pods holding the same namespace as the incoming Pod can be matching candidates.

  • The scheduler bypasses any nodes that don't have any topologySpreadConstraints[*].topologyKeypresent. This implies that:

    1. any Pods located on those bypassed nodes do not impact maxSkew calculation - in theabove example, suppose the node node1 does not have a label "zone", then the 2 Pods willbe disregarded, hence the incoming Pod will be scheduled into zone A.
    2. the incoming Pod has no chances to be scheduled onto this kind of nodes -in the above example, suppose a node node5 has the mistyped label zone-typo: zoneC(and no zone label set). After node node5 joins the cluster, it will be bypassed andPods for this workload aren't scheduled there.
  • Be aware of what will happen if the incoming Pod'stopologySpreadConstraints[*].labelSelector doesn't match its own labels. In theabove example, if you remove the incoming Pod's labels, it can still be placed ontonodes in zone B, since the constraints are still satisfied. However, after thatplacement, the degree of imbalance of the cluster remains unchanged - it's still zone Ahaving 2 Pods labelled as foo: bar, and zone B having 1 Pod labelled asfoo: bar. If this is not what you expect, update the workload'stopologySpreadConstraints[*].labelSelector to match the labels in the pod template.

Cluster-level default constraints

It is possible to set default topology spread constraints for a cluster. Defaulttopology spread constraints are applied to a Pod if, and only if:

  • It doesn't define any constraints in its .spec.topologySpreadConstraints.
  • It belongs to a Service, ReplicaSet, StatefulSet or ReplicationController.

Default constraints can be set as part of the PodTopologySpread pluginarguments in a scheduling profile.The constraints are specified with the same API above, except thatlabelSelector must be empty. The selectors are calculated from the Services,ReplicaSets, StatefulSets or ReplicationControllers that the Pod belongs to.

An example configuration might look like follows:

apiVersion: kubescheduler.config.k8s.io/v1beta3kind: KubeSchedulerConfigurationprofiles: - schedulerName: default-scheduler pluginConfig: - name: PodTopologySpread args: defaultConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway defaultingType: List

Note: The SelectorSpread pluginis disabled by default. The Kubernetes project recommends using PodTopologySpreadto achieve similar behavior.

Built-in default constraints

FEATURE STATE: Kubernetes v1.24 [stable]

If you don't configure any cluster-level default constraints for pod topology spreading,then kube-scheduler acts as if you specified the following default topology constraints:

defaultConstraints: - maxSkew: 3 topologyKey: "kubernetes.io/hostname" whenUnsatisfiable: ScheduleAnyway - maxSkew: 5 topologyKey: "topology.kubernetes.io/zone" whenUnsatisfiable: ScheduleAnyway

Also, the legacy SelectorSpread plugin, which provides an equivalent behavior,is disabled by default.

Note:

The PodTopologySpread plugin does not score the nodes that don't havethe topology keys specified in the spreading constraints. This might resultin a different default behavior compared to the legacy SelectorSpread plugin whenusing the default topology constraints.

If your nodes are not expected to have both kubernetes.io/hostname andtopology.kubernetes.io/zone labels set, define your own constraintsinstead of using the Kubernetes defaults.

If you don't want to use the default Pod spreading constraints for your cluster,you can disable those defaults by setting defaultingType to List and leavingempty defaultConstraints in the PodTopologySpread plugin configuration:

apiVersion: kubescheduler.config.k8s.io/v1beta3kind: KubeSchedulerConfigurationprofiles: - schedulerName: default-scheduler pluginConfig: - name: PodTopologySpread args: defaultConstraints: [] defaultingType: List

Comparison with podAffinity and podAntiAffinity

In Kubernetes, inter-Pod affinity and anti-affinitycontrol how Pods are scheduled in relation to one another - either more packedor more scattered.

podAffinity
attracts Pods; you can try to pack any number of Pods into qualifyingtopology domain(s).
podAntiAffinity
repels Pods. If you set this to requiredDuringSchedulingIgnoredDuringExecution mode thenonly a single Pod can be scheduled into a single topology domain; if you choosepreferredDuringSchedulingIgnoredDuringExecution then you lose the ability to enforce theconstraint.

For finer control, you can specify topology spread constraints to distributePods across different topology domains - to achieve either high availability orcost-saving. This can also help on rolling update workloads and scaling outreplicas smoothly.

For more context, see theMotivationsection of the enhancement proposal about Pod topology spread constraints.

Known limitations

  • There's no guarantee that the constraints remain satisfied when Pods are removed. Forexample, scaling down a Deployment may result in imbalanced Pods distribution.

    You can use a tool such as the Deschedulerto rebalance the Pods distribution.

  • Pods matched on tainted nodes are respected.See Issue 80921.

  • The scheduler doesn't have prior knowledge of all the zones or other topologydomains that a cluster has. They are determined from the existing nodes in thecluster. This could lead to a problem in autoscaled clusters, when a node pool (ornode group) is scaled to zero nodes, and you're expecting the cluster to scale up,because, in this case, those topology domains won't be considered until there isat least one node in them.

    You can work around this by using an cluster autoscaling tool that is aware ofPod topology spread constraints and is also aware of the overall set of topologydomains.

What's next

  • The blog article Introducing PodTopologySpreadexplains maxSkew in some detail, as well as covering some advanced usage examples.
  • Read the scheduling section ofthe API reference for Pod.
Pod Topology Spread Constraints (2024)

References

Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 6565

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.