Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an event when a WorkloadPriorityClass.kueue.x-k8s.io is not found #3439

Open
3 tasks
akram opened this issue Nov 4, 2024 · 8 comments
Open
3 tasks
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@akram
Copy link

akram commented Nov 4, 2024

What would you like to be added:
When a job is created with a specified kueue.x-k8s.io/priority-class that does not exist ; then an event should be raised with a Warning level in the namespace of the job and with the job as the involvedObject.name

Why is this needed:
When such a job is created nothing happens and it is difficult for the user to understand what is wrong except by reading the kueue controller manager logs. As this may require admin privileges, user may just remain blocked while searching for information.

Completion requirements:

The keueue logs show the following error message:

{"level":"error","ts":"2024-11-04T18:56:42.686021656Z","caller":"controller/controller.go:329","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"low-priority-job-tg6tr","namespace":"keue-viz"},"namespace":"keue-viz","name":"low-priority-job-tg6tr","reconcileID":"1e033921-c262-401c-a462-28b74583ee51","error":"WorkloadPriorityClass.kueue.x-k8s.io \"low\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

The relevant part being:

error":"WorkloadPriorityClass.kueue.x-k8s.io \"low\" not found",

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@akram akram added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 4, 2024
@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

+1,
I'm also wondering if we should consider deactivating such a workload as it will not get scheduled until some manual intervention. WDYT @tenzen-y ?

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

OTOH, maybe deactivating can be considered as a breaking change for some flows when the priority class is created after the workload or in parallel. So, I'm fine for just an event now.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 5, 2024

OTOH, maybe deactivating can be considered as a breaking change for some flows when the priority class is created after the workload or in parallel. So, I'm fine for just an event now.

I think that we should implement the validation webhook to verify if the specified PriorityClass exists in the cluster.
That is the same specifications as the core API PriorityClass.

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

I think that we should implement the validation webhook to verify if the specified PriorityClass exists in the cluster.

Technically we could but IIUC this means validation per Job CRD which is a lot of code.

That is the same specifications as the core API PriorityClass.

I just tested it using batch/Job and referenced a non-existing priorityClassName, but the Pod creation failed (Kueue was not installed). Then an event is sent for the owner Job:

Warning  FailedCreate  4s (x5 over 19s)  job-controller  Error creating: pods "sample-job-a-qsvvf-" is forbidden: no PriorityClass with name non-existing was found

So, on vanilla k8s the Job creation isn't blocked, and we get an event on the Job object. I think reproducing this is reasonable.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 5, 2024

So, on vanilla k8s the Job creation isn't blocked, and we get an event on the Job object. I think reproducing this is reasonable.

Yeah, I indicated the Pod creation:

cat <<EOF | k apply -f -
> apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
  namespace: default
spec:
  priorityClassName: high-priority
  containers:
    - name: nginx
      image: nginx:latest
      ports:
        - containerPort: 80
> EOF
Error from server (Forbidden): error when creating "STDIN": pods "high-priority-pod" is forbidden: no PriorityClass with name high-priority was found

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

I see, but IIUC WorkloadPriorityClass does not translate to Pods, so we wouldn't be able to capture that in pod webhook.

@akram
Copy link
Author

akram commented Nov 5, 2024

An additional 2cts: The admission controller should not protect from a later deletion of WorkloadPriorityClass .
One can create the workload with the existing WorkloadPriorityClass and then delete the WorkloadPriorityClass impacting already existing Workloads.

In any case, I think that transparency improves user experience. Even if the Workload is disabled, that would be great to log the event; or to have information in the status.

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

Yes, we definitely want the event, the only thing is that if we deactivate then we already have a mechanism to propagate the event to the Job object, so the solution will be shorter by reusing this path, and the information will be reflected both in event and status. wdyt @tenzen-y ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants