Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner Registration Being Deleted In the Middle of Running Jobs #3748

Open
4 tasks done
wagenet opened this issue Sep 16, 2024 · 8 comments
Open
4 tasks done

Runner Registration Being Deleted In the Middle of Running Jobs #3748

wagenet opened this issue Sep 16, 2024 · 8 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@wagenet
Copy link

wagenet commented Sep 16, 2024

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This happens consistently on our setup but I have no idea how to reproduce elsewhere.

Describe the bug

In the middle of jobs running they often are unexpectedly canceled. The job logs will show a line like Error: Process completed with exit code 1. which is often preceded by context canceled. The Workflow Summary Annotations will also contain:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

In the runner logs we see:

Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.

It is very surprising that this happens while jobs are actively running.

Describe the expected behavior

I would expect that runners not lose their registration while actively running jobs.

Additional Context

ghArcRunners:
  enabled: true
  appKeyExternalSecret:
    enabled: true
    secretStoreRef:
      kind: ClusterSecretStore
      name: aws-secrets-manager
    remoteRef:
      - key: eks/build/gh_runner
        property: github_app_id
      - key: eks/build/gh_runner
        property: github_app_installation_id
      - key: eks/build/gh_runner
        property: github_app_private_key
        base64decode: true
  dockerhubSecret:
    secretStoreRef:
      kind: ClusterSecretStore
      name: aws-secrets-manager
    dockerhubSecret:
      remoteRef:
        key: eks/build/dockerhub
        property: DOCKER_CONFIG_SECRET

api-2xlarge-runner-scale-set:
  enabled: true
  githubConfigUrl: "https://github.com/soxhub"
  githubConfigSecret: "gh-arc-runner-appkey"
  maxRunners: 200
  minRunners: 1
  runnerGroup: "api-2xlarge-runner-scale-set-group"
  runnerScaleSetName: "api-2xlarge-runner-scale-set-group"
  listenerTemplate:
    metadata:
      annotations:
        k8s.grafana.com/scrape: "true"
        k8s.grafana.com/job: "api-2xlarge-runner-scale-set-group"
    spec:
      containers:
      - name: listener
  template:
    metadata:
      labels:
        runner-scale-set-group: "api-2xlarge-runner-scale-set-group"
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      securityContext:
        fsGroup: 123
      initContainers:
      - name: init-dind-externals
        image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
        command:
          ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
      containers:
      - name: runner
        image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///run/docker/docker.sock
        resources:
          limits:
            memory: "64Gi"
          requests:
            memory: "16Gi"
            cpu: "8.0"
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/lib/docker
            name: var-lib-docker
          - name: dind-sock
            mountPath: /run/docker
            readOnly: true
      - name: docker
        image: [SNIP].ecr.us-west-2.amazonaws.com/dkr-hub/library/docker:dind
        args:
          - dockerd
          - --host=unix:///run/docker/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        securityContext:
            privileged: true
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/lib/docker
            name: var-lib-docker
          - name: dind-sock
            mountPath: /run/docker
          - name: dind-externals
            mountPath: /home/runner/externals
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 30Gi
        - name: var-lib-docker
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 30Gi
        - name: dind-sock
          emptyDir: {}
        - name: dind-externals
          emptyDir: {}
      nodeSelector:
        auditboard.com/nodegroup: 2xlarge-general
  controllerServiceAccount:
    namespace: gh-arc-runner
    name: gha-runner-scale-set-controller

gha-runner-scale-set-controller:
  enabled: true
  labels: {}

  metrics:
    serviceMonitor:
      enable: true

  replicaCount: 2

  image:
    repository: "ghcr.io/actions/gha-runner-scale-set-controller"
    pullPolicy: IfNotPresent
    tag: ""

  imagePullSecrets: []
  nameOverride: ""
  fullnameOverride: ""

  env:

  serviceAccount:
    create: true
    annotations: {}
    name: "gha-runner-scale-set-controller"

  podAnnotations:
    k8s.grafana.com/scrape: "true"

  podLabels: {}

  podSecurityContext: {}

  securityContext: {}

  resources: {}

  nodeSelector: {}

  tolerations: []

  affinity: {}

  volumes: []
  volumeMounts: []

  priorityClassName: ""

  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

  flags:
    logLevel: "debug"
    logFormat: "text"

    updateStrategy: "immediate"

Controller Logs

https://gist.github.com/wagenet/ccae8e8a164e53587f978ccc53477772

Runner Pod Logs

https://gist.github.com/wagenet/65160702c38aada91cead50ece02c01a
@wagenet wagenet added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Sep 16, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@aiell0
Copy link

aiell0 commented Sep 17, 2024

@wagenet I was also experiencing this problem and the solution I found was to go to version 0.9.0 and it solved my problem.

@wagenet
Copy link
Author

wagenet commented Sep 17, 2024

@aiell0 are you suggesting downgrading?

@aiell0
Copy link

aiell0 commented Sep 17, 2024

@aiell0 are you suggesting downgrading?

Yes, that's what I had to do unfortunately.

@Karandash8
Copy link

I have the same problem running in kubernetes mode. Happens on both 0.9.3 and 0.9.0 versions. Container jobs randomly terminate without any obvious reason.

@thomaschaplin
Copy link

Any update on this? I'm seeing the same issue but cannot replicate on demand.
Running v0.9.3

@ali-kafel
Copy link

Running into this issue as well on v0.9.3

@ewilkins-csi
Copy link

I was running into this as well and it ended up being a node pressure issue. We were consuming all of the available node disk space and so k8s was evicting the runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

6 participants