Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CURATOR-688. SharedCount will be never updated successful when version of ZNode is overflow. #478

Merged
merged 4 commits into from
Jun 4, 2024

Conversation

Hexiaoqiao
Copy link
Contributor

@Hexiaoqiao Hexiaoqiao commented Aug 31, 2023

I met one issue which will never update znode value successfully when integer overflow (-2147483648) of znode data version using curator to invoke SharedCount#trySetCount(VersionedValue, int).
After dig the limitation logic and found that here could be the root cause.

https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java#L196-L209

My environment is, curator version: 2.10.0, zookeeper version: 3.4.6

Ref - CURATOR-688

@Hexiaoqiao
Copy link
Contributor Author

Sorry without new unit test here. Because I want to init one znode with version Integer.MAX_VALUE but no interface found. It will be helpful if someone give some guide to cover this case.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change are you able to manually reproduce the problem ?

@@ -196,7 +196,7 @@ public boolean trySetValue(VersionedValue<byte[]> previous, byte[] newValue) thr
private void updateValue(int version, byte[] bytes) {
while (true) {
VersionedValue<byte[]> current = currentValue.get();
if (current.getVersion() >= version) {
if (current.getVersion() >= version && version != Integer.MIN_VALUE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens after we reach this new condition ?
is the system stuck ?

also, as "version" is a constant here, could early exit ? in the beginning of the method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eolivelli , we try to build tmp zookeeper version and create zonde with version Integer.MAX_VALUE force and this case can reappear when trySetCount this znode. When meet this issue, SharedCount#trySetCount(VersionedValue<java.lang.Integer>, int) will always return false, then application will be stuck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum: I still try to dig if it meet wrong version before updateValue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that VersionedValue.version are Stat.version from ZooKeeper. Then it is "known" to overflow in finite time in possible large setData frequency. Given default visibility of VersionedValue's visibility, I think we probably can enhance this a bit with extra zxid. But the crucial problem comes from ZooKeeper, it only do atomic check against Stat.version which is a 32-bit integer and overflow finally for long running frequently modified data. And we will finally run into corner case checkAndIncVersion exposed, -1 is not an atomic condition which means we could write wrong node data in case of wraparound and contention.

I really hope a check_zxid style setData. I am always fearing of setData with version from old incarnation.

There is a similar case in https://lists.apache.org/thread/4o3rl49rdj5y0134df922zgc8clyt86s, which overflow Stat.cversion. I believed ZooKeeper was not designed and suitable for certain usages, performance degradation is desireable in wrong usages. But these overflow more sounds like bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hope a check_zxid style setData.

Totally +1. It should push zookeeper to changes, which is another thing. If need we should file another thread to discuss.

Addendum, here I didn't traverse cversion and aclversion if also have the same issue at curator side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when current.getVersion() reach -1, the next trySetCount is indeterminate.

Sorry, I don't get this meaning, the server side indead to check if version is -1. But if both version and currentVersion are -1, it will increase as expect. Do you mean that some other logic I missed at curator side? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is checkAndIncVersion in ZooKeeper side. When the expectedVersion is -1, it means setData blindly.

In Curator side, that is when getVersionedValue reports -1, the next trySetValue will ship -1 as expectedVersion to server which is apparently not what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad. We both say the same code segment, but I misunderstand it. I will add some javadoc for this changes moment later. Thanks again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One corner case when review code back, if the version overflow at server side and it is negative value such as -100 now, then restart application and currentValue will be initialized at curator side, and the version number will be set to UNINITIALIZED_VERSION which is -1 now. Then SharedCount will never be updated because current.getVersion() >= version && version != Integer.MIN_VALUE always true now.
cc @kezhuw @eolivelli What do you think about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess -1 was chose as UNINITIALIZED_VERSION because of it is minimum in absent of overflow. So, now we can have Integer.MIN_VALUE as UNINITIALIZED_VERSION 😮‍💨 .

@Hexiaoqiao
Copy link
Contributor Author

Some conclusion,
a. zookeeper server side only add 1 for version when update znode. And it allows integer overflow.
b. curator side can't be compatible with integer overflow because it compares old version and new version here and not consider the negative version number.
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java#L196-L209
c. The issue will be triggered from user interface both SharedCount#trySetCount and SharedCount#start(). When znode version meet Integer.MIN_VALUE SharedCount will be never update because condition current.getVersion() >= version always true.
d. While add condition current.getVersion() >= version && version != Integer.MIN_VALUE, this case could be resolved. And update znode as expected.

@Hexiaoqiao
Copy link
Contributor Author

One corner case didn't considered, and try to update this PR and trigger CI again.

@Hexiaoqiao
Copy link
Contributor Author

fix checkstyle and trigger ci again.

@Hexiaoqiao
Copy link
Contributor Author

Hi @eolivelli @kezhuw , anymore suggestions here?

@kezhuw kezhuw self-requested a review September 1, 2023 10:36
@@ -196,8 +196,12 @@ public boolean trySetValue(VersionedValue<byte[]> previous, byte[] newValue) thr
private void updateValue(int version, byte[] bytes) {
while (true) {
VersionedValue<byte[]> current = currentValue.get();
if (current.getVersion() >= version) {
// A newer version was concurrently set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure wether SharedValue was designed to work with multiple owners, but I saw there is a background watcher to update value and also there is no rule to forbid concurrent usages. So, I assume it should work well in case of concurrency.

Then, let me assume a situation:

  1. current.getVersion is Integer.MAX_VALUE.
  2. Thread1 call trySetValue and succeed to get overflowed version Integer.MIN_VALUE, but the call to updateValue is somewhat delayed.
  3. Thread2 (assume watcher, which runs in ZooKeeper thread if I am not wrong) call updateVersion with version Integer.MIN_VALUE + 1. According to the code, this will be ignored.
  4. Thread1 call updateValue to continue its task with version Integer.MIN_VALUE. It succeeds.
  5. That is all, assume no changes anymore. I know it may not realistic.

current stores dated version while the javadoc says "All clients watching the same path will have the up-to-date value (considering ZK's normal consistency guarantees)".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced in the overflow case either. @tisonkun @Hexiaoqiao

For the overflow case,

  • +1 to deprecate VersionedValue#getVersion so to warn clients about the "ordering assumption" if any about version.
  • +1 to a viable workaround if any and/or exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor updateValue a bit to compare ordering using Stat.mzxid. This way we are not fearing this overflow issue. I am stupid in reviwing without a deep thought, sorry for that. So my finally points are:

  • Deprecate VersionedValue#getVersion to warn clients about "ordering assumptions" and "overflow behavior".
  • Refactor updateValue to order using Stat.mzxid.
  • Throw exception in case of -1 Stat.version in trySetValue. I am positive to ZOOKEEPER-4743.
  • Document somehow about "overflow" and exception case in trySetValue.

Besides above, should we expose a VersionedValue#getZxid for client usage ?

Any thoughts @tisonkun @eolivelli @Hexiaoqiao ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed new commits to go through above direction. Could you please take a look @Hexiaoqiao @tisonkun @eolivelli ?

@kezhuw
Copy link
Member

kezhuw commented Sep 2, 2023

@Hexiaoqiao Sorry for the delay. I am stuck about and also fearing this overflow things. There are not simple bugs, there are limitations. Hard to fix, only fragile workaround for best wish. I have some chaos thoughts for this issue:

  1. Curator guarantees no ordering-ness about VersionedValue.getVersion. That is good. Better to document this out of order issue or make it private in future.
  2. The promise about "up-to-date" should be guaranteed.
  3. Silent "never update" is not good which is the goal for us to fix here.

I am ok to a workaround but I am also think exception(e.g. let caller know they hit an implementation limitation) is a good fallback except for background watcher. Though, I am not positive to a good workaround. I think there are few choices when encountering this hard limitation.

  1. Push foward underlying implementation. In this case, a check_zxid version of setData in ZooKeeper.
  2. Rewrites something in caller side. In this case, either curator side or your application.
  3. Workaround possibly fragile.

All these suggestions happened in https://lists.apache.org/thread/4o3rl49rdj5y0134df922zgc8clyt86s. cc @li4wang

Besides above, did you find this in production ? What is your use case ? @Hexiaoqiao

@Hexiaoqiao
Copy link
Contributor Author

@kezhuw Thanks for your detailed comments. I totally agree that this issue is not very easy/simple to fix perfectly. But for my case, this improvement could fix it when try to reproduce.

Besides above, did you find this in production ? What is your use case ?

Of course YES. The corresponding code snippet as following shows.

https://github.com/apache/hadoop/blob/a6c2526c6c7ccfc4483f58695aedf0cb892b4c4d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java#L506-L516

I would like to give a brief explanation (which is dependent by Hadoop project).
A. For HDFS RBF architecture[1], all Routers share the same states which is managed by zookeeper, and Token is one of these states.
B. About token[2], one Router generate token using increment sequence number shared by all Routers and update corresponding znode value (/zkdtsm/ZKDTSMRoot/ZKDTSMSeqNumRoot), let's name it Z.
C. Where token number will be million/day or more, and the version of znode Z will overflow only years after create(maybe less than three years if 5 million/day tokens generated.)

At first, I want to fix it at Hadoop side, but it is not smoothy and could not fix the root cause.

For this PR, my first thought is fix at curator side first, then try to improve it at both zookeeper and curator side as the solution you mentioned above. Any thoughts? Thanks.

[1] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html
[2] http://hortonworks.com/wp-content/uploads/2011/08/adding_security_to_apache_hadoop.pdf

@kezhuw
Copy link
Member

kezhuw commented Sep 4, 2023

@Hexiaoqiao Thanks your sharing, that is important!

I found some potential problem in the usages, though I haven't take a deep in look. incrSharedCount uses batch which is what Ted Dunning suggested in above mail thread. That is good. But it is somewhat error-prone in implementation.

  1. The counter itself is a 32-bit integer, so it will overflow finally regard less of Stat.version.
  2. The default batchSize is small. I got two defaults. public static final int DEFAULT_SEQ_NUM_BATCH_SIZE = 10; and public static final int ZK_DTSM_TOKEN_SEQNUM_BATCH_SIZE_DEFAULT = 1;.

So, for your use case, I suggest:

  1. Rewrites sequence number as 64-bit integer. I am not sure what the consequence of this. But 32-bit in your case will overflow finally. It could overflowed already hiding by overflow of Stat.version.
  2. Uses a sensible default batch size and refuse small batch size. This is easy in your case, but you will get negative sequence number cause of above.

Stat.version is 32-bit, it is not good to do a 1 to 1 mapping to application sequence number.

@Hexiaoqiao
Copy link
Contributor Author

@kezhuw Great response. Some concerns here.

  • Rewrites sequence number as 64-bit integer. I am not sure what the consequence of this. But 32-bit in your case will overflow finally. It could overflowed already hiding by overflow of Stat.version.
  • Uses a sensible default batch size and refuse small batch size. This is easy in your case, but you will get negative sequence number cause of above.

Actually sequence number can be compatible with 32-bit integer overflow, which only should be distinguishable number without other conditions. IMO it is OK even if overflow, actually I try to test it on our test env and it works fine. So it is not hiding issue here.
Another side, I totally agree that set the default configuration ZK_DTSM_TOKEN_SEQNUM_BATCH_SIZE_DEFAULT to some bigger number (such as 10 or much more) could bypass this issue, however it does not solve root cause about Curator. Right?
Thanks again.

@kezhuw
Copy link
Member

kezhuw commented Sep 4, 2023

IMO it is OK even if overflow, actually I try to test it on our test env and it works fine. So it is not hiding issue here.

@Hexiaoqiao Glad to hear, that is good!

however it does not solve root cause about Curator. Right?

I am open to a workaround as long as it fit. For a workaround to work, I think we should guarantee:

  1. The promise about "up-to-date" should be guaranteed.
  2. The promise about "Changes the shared value only if its value has not changed since the version specified by newValue" should be hold. This means we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount.

I feel hopeless about viable workaround, but not against one. Anyway please go ahead, I would be happy to hear good news.

@Hexiaoqiao
Copy link
Contributor Author

Thanks @kezhuw for your suggestions and sorry for the late response.

The promise about "up-to-date" should be guaranteed.
The promise about "Changes the shared value only if its value has not changed since the version specified by newValue" should be hold. This means we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount.

I am not sure if I understood this point clearly. 'we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount', Will it never update successfully when dataversion is back to -1 if we need to guarantee this rule?

@kezhuw
Copy link
Member

kezhuw commented Sep 13, 2023

'we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount', Will it never update successfully when dataversion is back to -1 if we need to guarantee this rule?

Yes, and no. When previous.getVersion reach to -1, trySetCount has few choices:

  1. Exception to caller for limitation of the implementation, that is Stat.version overflowed and wraparound to -1 and setData with -1 version is a blind update.
  2. Keep silent and do a blind update.

The later behavior is a bug due to the contract what trySetCount try to express. As a library, I don't think Curator should do that for callers silently.

For the "no" part, callers can risk themselves by doing a blind update through setCount.

@Hexiaoqiao
Copy link
Contributor Author

The later behavior is a bug due to the contract what trySetCount try to express. As a library, I don't think Curator should do that for callers silently.

+1, Agree. So how about expose the choice to end user and offer some configuration items to set? IMO, some sensitive data should not be updated blindly, but some case we should offer solution to bypass this bug, such as Token for Hadoop as mentioned above, which can update smoothly even blindly. What do you think about?
TBH, I am not familiar with implement about Curator, so would you mind to list some demo code snippet about configuration if this solution could be approved.

@tisonkun
Copy link
Member

tisonkun commented Sep 15, 2023

TBH, I am not familiar with implement about Curator, so would you mind to list some demo code snippet about configuration if this solution could be approved.

Somehow you can copy the shared count implementation..It's not quite a lot of code.

Since this requirement is quite customized, I'm afraid that it's not suitable to hack into Curator. The root cause is integer overflow that is backed by ZK. (check_zxid may be complicated and ZK's zxid also suffers from integer overflow while that is a leader election issue; due to Jute's limitation, we cannot extend version from int to long or add a new field while keep compatibility.)

Also, I don't understand actually how this patch "fixes" the issue. You add more strict condition to exit but the original issue is hang?

@kezhuw
Copy link
Member

kezhuw commented Sep 15, 2023

So how about expose the choice to end user and offer some configuration items to set? IMO, some sensitive data should not be updated blindly, but some case we should offer solution to bypass this bug

I am -0 to this approach. Exception is the default, in anyway. Clients can bypass themselves easily on exception if they encounter or aware of this. I don't think it is worth for Curator to do that for this limitation(either ZooKeepr or Curator implementation from perspective) in case of awareness from clients. The important things for Curator here from my side are documentation and exception in code. Anything beyond that are probably overkill.

@tisonkun
Copy link
Member

I understand the condition now and agree that this patch should work.

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

@tisonkun
Copy link
Member

reproduce the problem

@eolivelli The tricky point here is that ZK doesn't expose API to manually edit node version so you should change the data for Int.MAX times which can be time-consuming..

@kezhuw
Copy link
Member

kezhuw commented Sep 15, 2023

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

Good idea! I just saw ZOOKEEPER-4743.

@Hexiaoqiao
Copy link
Contributor Author

@tisonkun Glad to see you here.

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

It will be more smooth while ZK to skip version -1, Strong +1. From my first thought, it would be the next step after we fix at Curator side. Now it is great to see that other guys already try to push this forward.
For ZOOKEEPER-4743, I just suggest that we should consider to solve dataVersion/aclVersion/cversion 32-integer overflow at once.
Back to this PR, what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways? Thanks.

@kezhuw
Copy link
Member

kezhuw commented Sep 19, 2023

what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways?

I suggest we use Stat.mzxid to guard ordering in updateValue. (I posted in another thread https://github.com/apache/curator/pull/478/files#r1327946265).

The current approach does not sound correct to me(https://github.com/apache/curator/pull/478/files#r1313839760). And I think it is error-prone.

When Stat.version is -1, exception is enough for us to go. ZOOKEEPRER-4743 is another story, Curator should work irrespective of ZooKeeper version.

@tisonkun
Copy link
Member

Back to this PR, what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways? Thanks.

They are irrelevant things.

For this patch, I may give another look as well as the comments above. @Hexiaoqiao you can try to address @kezhuw's comments/

@kezhuw
Copy link
Member

kezhuw commented Sep 27, 2023

Hi @Hexiaoqiao, do you still work on this ? Or should I take over this ? I plan to resort to Stat.mzxid in updateValue.

@Hexiaoqiao
Copy link
Contributor Author

Sorry for the late response since I took a long vacation. Please feel free to take over it if interested.

@kezhuw
Copy link
Member

kezhuw commented Oct 12, 2023

Thank you @Hexiaoqiao ! I have pushed a commit to using Stat.mzxid. Would you mind to take a look ?

@tisonkun
Copy link
Member

tisonkun commented Jun 3, 2024

@Hexiaoqiao @kezhuw @eolivelli I'll review this patch in this week. I'd like to try include this patch in 5.7.0.

while (true) {
VersionedValue<byte[]> current = currentValue.get();
if (current.getVersion() >= version) {
// A newer version was concurrently set.
if (current.getZxid() >= zxid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if only one client here and it goes through the overflow bound?

Said current.getZxid() == MAX_VALUE and zxid == MIN_VALUE, the update action will be skipped and the value will never updated.

Or we have different assumption on zxid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is paranoid -:).

ZooKeeper guarantees a total order of messages, and it also guarantees a total order of proposals. ZooKeeper exposes the total ordering using a ZooKeeper transaction id (zxid). All proposals will be stamped with a zxid when it is proposed and exactly reflects the total ordering. -- https://zookeeper.apache.org/doc/r3.9.0/zookeeperInternals.html

Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2. -- https://zookeeper.apache.org/doc/r3.9.0/zookeeperProgrammers.html

In case of above situation, I believed that ZooKeeper is doomed to failure. The "never updated" should be negligible in case of the disaster.

@kezhuw kezhuw merged commit a9a8020 into apache:master Jun 4, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants