-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s: Terraform deployment for Azure clusters #18
base: main
Are you sure you want to change the base?
Conversation
Also need to understand how we're doing access control in production - I'm expect there's a group or two we need to grant access to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea which we definitely need. Currently all the clusters were created manually with the Azure web UI, and also created at different times so the exact VM type/sizes may be different between clusters.
The credentials for cmdline admin of azure clusters are in Ansible (kernelci-builder2
repo) where we configure the az login
setup and connect it up so that kubectl
can manage jobs.
k8s/azure/aks-cluster.tf
Outdated
name = "workers" | ||
kubernetes_cluster_id = each.value.id | ||
|
||
# FIXME: This is a very small node, what are we using? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "normal" builders, our current clusters are 8-core. Either Standard_D8s_vX
(seems we have some v3, some v5 for recently created clusters.)
For the "big" builders, they're 32-core. Standard_F32s_v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm wondering if we should either standardise on the 32 core instances for everything (and pack more jobs on there) or take a hit to the allmodconfig builds and standardise on 16 cores (though I think the pahole builds need the big machines so we probably need to keep 32 cores). We should also figure out if we need the big builders to be separate clusters or we can just have 2 nodegroups on the same cluster - the latter seems better since it would give the scheduler more flexibility, we can still use nodeSelectors on the jobs to force jobs onto one of the nodegroups.
This is one area where Karpenter makes life a whole lot easier than cluster-autoscaler.
This provides a Terraform configuration for deploying our Kubernetes clusters to Azure. We deploy an identical cluster to each of a list of regions, with one small node for admin purposes due to a requirement to not use spot instances for the main node group for the and two autoscaling groups one with small 8 core nodes for most jobs and one with bigger nodes for the more resource intensive ones. This is different to our current scheme where each cluster has a single node group and we direct jobs in Jenkins. With this scheme we allow the Kubernetes scheduler to place jobs, or we can still direct them to specific node sizes using nodeSelector in the jobs and the labels that are assigned to the nodegroups. This is a more Kubernetes way of doing things and decouples further from Jenkins. Signed-off-by: Mark Brown <[email protected]>
Just pushed an update which should have the cluster configuration usable (scaling from 1..10 nodes per nodegroup, that might need revisiting?) in what should be the same regions we currently use. This is a bit different to what we currently use but as the commit log covers it is a more Kubernetes way of doing things so I've left it as it is. For deployment someone would need to create the Azure storage container referenced in the config, or just comment out the use of the Azurerm storage backend. I've not done anything about authentication. It looks like that's done by having a fixed service principal configured which fetches the Kubernetes credentials from Azure as I suggest in the README, if that SP is the same one used to create the clusters this should hopefully be usable as-is, ideally it'd be a separate role that Jenkins uses to connect to the clusters though. I can't properly test as there's a bunch of quota limits on the Azure account I have which prevent me deploying. |
This provides a Terraform configuration for deploying our
Kubernetes clusters to Azure. We deploy an identical cluster to
each of a list of regions, with one small node for admin purposes
due to a requirement to not use spot instances for the main node
group for the and an autoscaling node group with the actual
worker nodes.
This needs updates to reflect our actual cluster configurations
(which I don't currently know), and for the storage space for the
Terraform state.
Signed-off-by: Mark Brown [email protected]