Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[occm] Multi region openstack cluster #2595

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sergelogvinov
Copy link
Contributor

@sergelogvinov sergelogvinov commented May 14, 2024

What this PR does / why we need it:

Openstack CCM multi region support, if it has one Identity provider.

Which issue this PR fixes(if applicable):
fixes #1924

Special notes for reviewers:

CCM config changes:

[Global]
auth-url=https://auth.openstack.example.com/v3/
region=REGION1
# new param 'regions' can be specified multiple times
regions=REGION1
regions=REGION2
regions=REGION3

Optionally can be set in cloud.conf

clouds:
  kubernetes:
    auth:
      auth_url: https://auth.openstack.example.com/v3
    region_name: "REGION1"
    regions:
      - REGION1
      - REGION2
      - REGION3

During the initialization process, OCCM checks for the existence of providerID. If providerID does not exist, it defaults to using node.name, as it did previously. Additionally, if the node has the label topology.kubernetes.io/region, OCCM will prioritize using this region as the first one to check. This approach ensures that in the event of a region outage, OCCM can continue to function.

In addition, we can assist CCM in locating the node by providing kubelet parameters:

  • --provider-id=openstack:///$InstanceID - InstanceID exists in metadata
  • --provider-id=openstack://$REGION/$InstanceID - if you can define the region (by default meta server does not have this information)
  • --node-labels=topology.kubernetes.io/region=$REGION set preferred REGION in label, OCCM will then prioritize searching for the node in this specified region

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 14, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kayrus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @sergelogvinov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 14, 2024
@sergelogvinov
Copy link
Contributor Author

@mdbooth can you take a look on this PR.
Probably I need to add more configuration checks.

Thanks.

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't taken a deep look into this, but I much prefer this in principal: it only tells cloud-provider things we know to be true. This makes me much more confident that this will continue to work correctly as cloud-provider evolves.

Would you still run multiple CCMs, or switch to a single active CCM?

Comment on lines +60 to +75
for _, region := range os.regions {
opt := os.epOpts
opt.Region = region

compute[region], err = client.NewComputeV2(os.provider, opt)
if err != nil {
klog.Errorf("unable to access compute v2 API : %v", err)
return nil, false
}

network[region], err = client.NewNetworkV2(os.provider, opt)
if err != nil {
klog.Errorf("unable to access network v2 API : %v", err)
return nil, false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierreprinetti how much work is performed when initialising a new service client? Is it local-only, or do we have to go back to keystone?

I might be inclined to intialise this lazily anyway, tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar thought, maybe init them until real usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had one issue in proxmox with lazy initialization. The regions cannot exist, and during the rollout of OCCM, it starts without errors. The kubernetes administrator will think that all configuration is correct.

So we can check all regions here and crush if needed. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late response. Building a ProviderClient requires a Keystone roundtrip; building ServiceClients is cheap.

Comment on lines 178 to 188
if node.Spec.ProviderID == "" {
return i.getInstanceByName(node)
}

instanceID, instanceRegion, err := instanceIDFromProviderID(node.Spec.ProviderID)
if err != nil {
return nil, err
return nil, "", err
}

if instanceRegion == "" {
return i.getInstanceByID(instanceID, node.Labels[v1.LabelTopologyRegion])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably be a bit more explicit with where we're looking for stuff here. IIUC there are 2 possible places we can get a specific region from:

  • providerID
  • LabelTopologyRegion

Both may be unset because the node has not yet been adopted by the node-controller.

providerID may not contain a region either because it was set before we became multi-region, or because it was set by kubelet without a region and it's immutable.

But the end result is that either we know the region or we don't. If we know the region we should look only in that region. If we don't know the region we should look everywhere.

How about logic something like:

instanceID, instanceRegion, err := instanceIDFromProviderID(node.Spec.ProviderID)
..err omitted...

if instanceRegion == "" {
  instanceRegion = node.Labels[v1.LabelTopologyRegion]
}

var searchRegions []string
if instanceRegion != "" {
  if !slices.Contains(i.regions, instanceRegion) {
    return ...bad region error...
  }
  searchRegions = []string{instanceRegion}
} else {
  searchRegions = ..all the regions, preferred first...
}

for region := range searchRegions {
  mc := ...
  if instanceID != "" {
    getInstanceByID()
  } else {
    getInstanceByName()
  }
  mc.ObserveRequest()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this very very good idea.
I've changed the implementation. But one thought - i cannot trust LabelTopologyRegion, something can change it and node-lifecycle will remove node (for instance on reboot/upgrade event)...

So i can use LabelTopologyRegion only as prefered-region. And check this region first.
Thanks.

@MatthieuFin
Copy link
Contributor

Hi @sergelogvinov
I propose an implementation of multi cloud support for cinder-csi-plugin, which offer multiple openstack clusters support, not only multiple regions, I haven't take look of occm implementation yet, but is it possible to adapt it to support multiple cloud definitions instead of only multiple regions ?

@sergelogvinov
Copy link
Contributor Author

Hi @sergelogvinov I propose an implementation of multi cloud support for cinder-csi-plugin, which offer multiple openstack clusters support, not only multiple regions, I haven't take look of occm implementation yet, but is it possible to adapt it to support multiple cloud definitions instead of only multiple regions ?

Thank you for this PR, it is very interesting. Can we have a call/chat in slack #provider-openstack (Serge Logvinov)?

@jichenjc
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2024
@jichenjc
Copy link
Contributor

/ok-to-test

Comment on lines +60 to +75
for _, region := range os.regions {
opt := os.epOpts
opt.Region = region

compute[region], err = client.NewComputeV2(os.provider, opt)
if err != nil {
klog.Errorf("unable to access compute v2 API : %v", err)
return nil, false
}

network[region], err = client.NewNetworkV2(os.provider, opt)
if err != nil {
klog.Errorf("unable to access network v2 API : %v", err)
return nil, false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar thought, maybe init them until real usage?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 2, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2024
@sergelogvinov
Copy link
Contributor Author

I've rebased the PR. all tests passed and i've tested manually too

Can you take a look please @jichenjc @mdbooth
It will be great to merge this change into the upcoming release...

Thanks.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2024
@sergelogvinov
Copy link
Contributor Author

Is anything else we can do here? @jichenjc @mdbooth @kayrus

We had conversation how we need initialize the openstack clients

	for _, region := range os.regions {
		opt := os.epOpts
		opt.Region = region

		compute[region], err = client.NewComputeV2(os.provider, opt)
		if err != nil {
			klog.Errorf("unable to access compute v2 API : %v", err)
			return nil, false
		}

		network[region], err = client.NewNetworkV2(os.provider, opt)
		if err != nil {
			klog.Errorf("unable to access network v2 API : %v", err)
			return nil, false
		}

It seems to be a similar process to the one we followed in cinder-csi-plugin.
I believe @MatthieuFin and I can introduce multi OpenStack authentication support after this PR.

    [Global]
    auth-url="https://auth.cloud.openstackcluster.region-default.local/v3"
    username="region-default-username"
    password="region-default-password"
    region="default"
    tenant-id="region-default-tenant-id"
    tenant-name="region-default-tenant-name"
    domain-name="Default"
    
    [Global "region-one"]
    auth-url="https://auth.cloud.openstackcluster.region-one.local/v3"
    username="region-one-username"
    password="region-one-password"
    region="one"
    tenant-id="region-one-tenant-id"
    tenant-name="region-one-tenant-name"
    domain-name="Default"

Thanks.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[occm] Multi region support
6 participants