Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow container pod fails immediately upon unscheduled status waiting for node to provision #173

Open
jonathan-fileread opened this issue Jul 5, 2024 · 0 comments

Comments

@jonathan-fileread
Copy link

jonathan-fileread commented Jul 5, 2024

GHA jobs fail instantly if a pod is unscheduable due to waiting for another node to become available (if the resource request for CPU/Memory is high, waiting for the node autoscaler)

Screenshot 2024-07-05 at 5 13 03 PM

Warning OutOfcpu pod/arc-runner-set-productdev-b8rwm-runner-b5m9z-workflow Node didn't have enough resource: cpu, requested: 3000, used: 6560, capacity: 7820

What should be happening preferablly:

There should be a timeout field either in the runner set or container hooks podtemplate that allows the workflow pod to wait for x minutes till the pod is scheduled after another node is alive.

To Reproduce

Install ARC Controller + Runner set 0.9.2
define ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE with the podTemplate, and containerMode: "Kubernetes"
define a pod template like this

apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Additional Context

template:
  spec:
    initContainers:
      - name: kube-init
        image: ghcr.io/actions/actions-runner:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            sudo chown -R 1001:123 /home/runner/_work
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    securityContext:
      fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: /home/runner/pod-templates/default.yml
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
        volumeMounts:
        - name: pod-templates
          mountPath: /home/runner/pod-templates
          readOnly: true
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "managed-csi"
              resources:
                requests:
                  storage: ${local.volume_claim_size}
      - name: pod-templates
        configMap:
          name: "runner-pod-template"
    

containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "managed-csi"
    resources:
      requests:
        storage: 50Gi


Pod Template YAML:
apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant