Runner Registration Being Deleted In the Middle of Running Jobs #3748

wagenet · 2024-09-16T16:15:50Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This happens consistently on our setup but I have no idea how to reproduce elsewhere.

Describe the bug

In the middle of jobs running they often are unexpectedly canceled. The job logs will show a line like Error: Process completed with exit code 1. which is often preceded by context canceled. The Workflow Summary Annotations will also contain:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

In the runner logs we see:

Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.

It is very surprising that this happens while jobs are actively running.

Describe the expected behavior

I would expect that runners not lose their registration while actively running jobs.

Additional Context

ghArcRunners:
  enabled: true
  appKeyExternalSecret:
    enabled: true
    secretStoreRef:
      kind: ClusterSecretStore
      name: aws-secrets-manager
    remoteRef:
      - key: eks/build/gh_runner
        property: github_app_id
      - key: eks/build/gh_runner
        property: github_app_installation_id
      - key: eks/build/gh_runner
        property: github_app_private_key
        base64decode: true
  dockerhubSecret:
    secretStoreRef:
      kind: ClusterSecretStore
      name: aws-secrets-manager
    dockerhubSecret:
      remoteRef:
        key: eks/build/dockerhub
        property: DOCKER_CONFIG_SECRET

api-2xlarge-runner-scale-set:
  enabled: true
  githubConfigUrl: "https://github.com/soxhub"
  githubConfigSecret: "gh-arc-runner-appkey"
  maxRunners: 200
  minRunners: 1
  runnerGroup: "api-2xlarge-runner-scale-set-group"
  runnerScaleSetName: "api-2xlarge-runner-scale-set-group"
  listenerTemplate:
    metadata:
      annotations:
        k8s.grafana.com/scrape: "true"
        k8s.grafana.com/job: "api-2xlarge-runner-scale-set-group"
    spec:
      containers:
      - name: listener
  template:
    metadata:
      labels:
        runner-scale-set-group: "api-2xlarge-runner-scale-set-group"
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      securityContext:
        fsGroup: 123
      initContainers:
      - name: init-dind-externals
        image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
        command:
          ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
      containers:
      - name: runner
        image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///run/docker/docker.sock
        resources:
          limits:
            memory: "64Gi"
          requests:
            memory: "16Gi"
            cpu: "8.0"
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/lib/docker
            name: var-lib-docker
          - name: dind-sock
            mountPath: /run/docker
            readOnly: true
      - name: docker
        image: [SNIP].ecr.us-west-2.amazonaws.com/dkr-hub/library/docker:dind
        args:
          - dockerd
          - --host=unix:///run/docker/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        securityContext:
            privileged: true
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/lib/docker
            name: var-lib-docker
          - name: dind-sock
            mountPath: /run/docker
          - name: dind-externals
            mountPath: /home/runner/externals
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 30Gi
        - name: var-lib-docker
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 30Gi
        - name: dind-sock
          emptyDir: {}
        - name: dind-externals
          emptyDir: {}
      nodeSelector:
        auditboard.com/nodegroup: 2xlarge-general
  controllerServiceAccount:
    namespace: gh-arc-runner
    name: gha-runner-scale-set-controller

gha-runner-scale-set-controller:
  enabled: true
  labels: {}

  metrics:
    serviceMonitor:
      enable: true

  replicaCount: 2

  image:
    repository: "ghcr.io/actions/gha-runner-scale-set-controller"
    pullPolicy: IfNotPresent
    tag: ""

  imagePullSecrets: []
  nameOverride: ""
  fullnameOverride: ""

  env:

  serviceAccount:
    create: true
    annotations: {}
    name: "gha-runner-scale-set-controller"

  podAnnotations:
    k8s.grafana.com/scrape: "true"

  podLabels: {}

  podSecurityContext: {}

  securityContext: {}

  resources: {}

  nodeSelector: {}

  tolerations: []

  affinity: {}

  volumes: []
  volumeMounts: []

  priorityClassName: ""

  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

  flags:
    logLevel: "debug"
    logFormat: "text"

    updateStrategy: "immediate"

Controller Logs

https://gist.github.com/wagenet/ccae8e8a164e53587f978ccc53477772

Runner Pod Logs

https://gist.github.com/wagenet/65160702c38aada91cead50ece02c01a

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-16T16:16:21Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

aiell0 · 2024-09-17T00:33:48Z

@wagenet I was also experiencing this problem and the solution I found was to go to version 0.9.0 and it solved my problem.

wagenet · 2024-09-17T19:43:38Z

@aiell0 are you suggesting downgrading?

aiell0 · 2024-09-17T20:31:59Z

@aiell0 are you suggesting downgrading?

Yes, that's what I had to do unfortunately.

Karandash8 · 2024-09-25T04:44:57Z

I have the same problem running in kubernetes mode. Happens on both 0.9.3 and 0.9.0 versions. Container jobs randomly terminate without any obvious reason.

thomaschaplin · 2024-11-11T11:44:15Z

Any update on this? I'm seeing the same issue but cannot replicate on demand.
Running v0.9.3

ali-kafel · 2024-11-12T16:40:37Z

Running into this issue as well on v0.9.3

ewilkins-csi · 2024-11-12T23:15:13Z

I was running into this as well and it ended up being a node pressure issue. We were consuming all of the available node disk space and so k8s was evicting the runner.

wagenet added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner Registration Being Deleted In the Middle of Running Jobs #3748

Runner Registration Being Deleted In the Middle of Running Jobs #3748

wagenet commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

aiell0 commented Sep 17, 2024

wagenet commented Sep 17, 2024

aiell0 commented Sep 17, 2024

Karandash8 commented Sep 25, 2024

thomaschaplin commented Nov 11, 2024

ali-kafel commented Nov 12, 2024

ewilkins-csi commented Nov 12, 2024

Runner Registration Being Deleted In the Middle of Running Jobs #3748

Runner Registration Being Deleted In the Middle of Running Jobs #3748

Comments

wagenet commented Sep 16, 2024

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Sep 16, 2024

aiell0 commented Sep 17, 2024

wagenet commented Sep 17, 2024

aiell0 commented Sep 17, 2024

Karandash8 commented Sep 25, 2024

thomaschaplin commented Nov 11, 2024

ali-kafel commented Nov 12, 2024

ewilkins-csi commented Nov 12, 2024