exhausted IP addresses from unbalanced zone distribution #7311

universam1 · 2024-11-01T15:31:09Z

Description

Observed Behavior:
IP range of one subnet is exhausted causing "dead" nodes while other zones are left empty.

This is a followup of #1810 #1292 as that topology-spread solution does not scale on large clusters, accross independent teams, namespaces and such.

On dozens of deployments we cannot instruct every developer to care for https://karpenter.sh/v0.10.0/tasks/scheduling/#topology-spread to match exactly accross all teams.

ClusterAutoscaler has this option for a reason https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#im-running-cluster-with-nodes-in-multiple-zones-for-ha-purposes-is-that-supported-by-cluster-autoscaler

Expected Behavior:

Karpenter will take balancing as a requirement for node scheduling

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

engedaam · 2024-11-04T19:33:38Z

Can you provide your Karpenter configuration? Karpenter should launch nodes into the IPs with the most available IP expect for affinity and topology-spread. Do you have any spread or affinity on your workloads currently?

universam1 · 2024-11-11T08:35:51Z

Can you provide your Karpenter configuration?

Sure @engedaam , please find the config below

configuration

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "5078040335181941408"
    karpenter.sh/nodepool-hash-version: v2
  name: al2023
spec:
  disruption:
    budgets:
    - nodes: 20%
    - duration: 55m
      nodes: "0"
      schedule: '@hourly'
    consolidationPolicy: WhenUnderutilized
    expireAfter: 168h
  limits:
    cpu: "125"
    memory: 1000Gi
  template:
    spec:
      nodeClassRef:
        name: al2023
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - r
        - t
      - key: karpenter.k8s.aws/instance-cpu
        operator: Gt
        values:
        - "3"
      - key: karpenter.k8s.aws/instance-cpu
        operator: Lt
        values:
        - "33"
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "4000"
      - key: karpenter.k8s.aws/instance-memory
        operator: Lt
        values:
        - "66000"
      - key: karpenter.k8s.aws/instance-ebs-bandwidth
        operator: Gt
        values:
        - "2000"
      - key: karpenter.k8s.aws/instance-hypervisor
        operator: In
        values:
        - nitro
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "3"
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready
        value: "true"
  weight: 90

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "11350300940085964065"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
  finalizers:
  - karpenter.k8s.aws/termination
  name: al2023
spec:
  amiFamily: AL2023
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      throughput: 125
      volumeSize: 200Gi
      volumeType: gp3
  instanceProfile: o11n-eks-xxx
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  securityGroupSelectorTerms:
  - id: sg-022e0610xxx
  subnetSelectorTerms:
  - id: subnet-0e0d9c1xx
  - id: subnet-0fff56bxx
  - id: subnet-0884d45xx
  tags:
    Name: kubernetes.io/cluster/o11n-eks-o11n-union
    System: o11n-eks-o11n-union
    jw:owner: eks
    jw:project: o11n/eks
    jw:stage: union
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      featureGates:
        InstanceIdNodeName: false # https://github.com/awslabs/amazon-eks-ami/issues/1821
      kubelet:
        config:
          featureGates:
            DisableKubeletCloudCredentialProviders: true
          registryPullQPS: 100
          serializeImagePulls: false
          shutdownGracePeriod: 30s
    --//

Do you have any spread or affinity on your workloads currently?

We did not - however as a dirty workaround we do now to force Karpenter into the other zones. That is not a solution.
I believe it is due to availabilities and costs that Karpenter becomes unbalanced, it is actually flapping. However since it causes downtime for us, it is a severe issue.

universam1 added bug Something isn't working needs-triage Issues that need to be triaged labels Nov 1, 2024

engedaam added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exhausted IP addresses from unbalanced zone distribution #7311

exhausted IP addresses from unbalanced zone distribution #7311

universam1 commented Nov 1, 2024 •

edited

Loading

engedaam commented Nov 4, 2024

universam1 commented Nov 11, 2024

exhausted IP addresses from unbalanced zone distribution #7311

exhausted IP addresses from unbalanced zone distribution #7311

Comments

universam1 commented Nov 1, 2024 • edited Loading

Description

engedaam commented Nov 4, 2024

universam1 commented Nov 11, 2024

universam1 commented Nov 1, 2024 •

edited

Loading