Allow force recreate of controller-manager pods through CSV upgrade options and be smarter about upgrade monitoring #3176
Labels
kind/feature
Categorizes issue or PR as related to a new feature.
Feature Request
Is your feature request related to a problem? Please describe.
When an updated CSV/operator manifest is created for a new operator version, we are sometimes faced with immutable fields blocking operator upgrades for users.
A recent issue with the grafana-operator: we updated operator manifests to match upstream standards in annotations and tagging, however this started causing upgrade issues, where OLM was trying to update an existing resource with immutable fields, thus causing the rollout to fail and block the upgrade for all users that had their install plan set to automatically upgrade on new versions.
In turn this meant that the upgrade from 5.6.0 -> 5.6.1(new annotations/tags) was failing and stuck in a crashloop, we tried to remediate this by pushing a hotfix, however due to some more weirdness with OLM experienced by @NissesSenap we weren't able to do so.
The hotfix, however, would not fix the core issue, which was that our operator was stuck mid-upgrade, trying to update to a version with changed immutable fields, so even our next versions (5.6.2, 5.6.3) didn't fix the issue once users were in the update path.
This then meant we had to ask users to force delete the controller-manager pod, in order to delete the version with the previous fields, and then allow OLM to create a new deployment with the new fields.
I guess there are two issues here that I'd like to get the maintainers take on:
Should OLM be a bit smarter about monitoring operator upgrades? i.e backing off and setting an appropriate status after noticing a crash loop during the upgrade?
Should OLM allow the user to set an option to force-delete the previous controller-manager deployment to enable immutable fields to change between operator releases?
Describe the solution you'd like
An optional upgrade setting that force deletes a previous deployment prior to applying a new one, thus allowing a seamless upgrade for operators that might want to change immutable fields between minor versions without creating a new release channel.
A drawback here is that this only works for stateless or mostly-stateless operators (grafana-operator is stateless), meaning that this might not be desirable if the operator requires stateful upgrade logic, however I still think the option should be there, but disabled by default.
Additionally, smarter monitoring of upgrades, allowing to back off to a working version, thus avoiding operators from being stuck in a crashloop while trying to upgrade to a new version.
If any of this is already resolved, or if this is just PEBKAC on our end, please let me know! :)
Thanks!
The text was updated successfully, but these errors were encountered: