CURATOR-688. SharedCount will be never updated successful when version of ZNode is overflow. #478

Hexiaoqiao · 2023-08-31T07:53:03Z

I met one issue which will never update znode value successfully when integer overflow (-2147483648) of znode data version using curator to invoke SharedCount#trySetCount(VersionedValue, int).
After dig the limitation logic and found that here could be the root cause.

https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java#L196-L209

My environment is, curator version: 2.10.0, zookeeper version: 3.4.6

Ref - CURATOR-688

Hexiaoqiao · 2023-08-31T07:56:02Z

Sorry without new unit test here. Because I want to init one znode with version Integer.MAX_VALUE but no interface found. It will be helpful if someone give some guide to cover this case.

eolivelli

with this change are you able to manually reproduce the problem ?

eolivelli · 2023-08-31T08:01:25Z

curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java

@@ -196,7 +196,7 @@ public boolean trySetValue(VersionedValue<byte[]> previous, byte[] newValue) thr
    private void updateValue(int version, byte[] bytes) {
        while (true) {
            VersionedValue<byte[]> current = currentValue.get();
-            if (current.getVersion() >= version) {
+            if (current.getVersion() >= version && version != Integer.MIN_VALUE) {


what happens after we reach this new condition ?
is the system stuck ?

also, as "version" is a constant here, could early exit ? in the beginning of the method

Thanks @eolivelli , we try to build tmp zookeeper version and create zonde with version Integer.MAX_VALUE force and this case can reappear when trySetCount this znode. When meet this issue, SharedCount#trySetCount(VersionedValue<java.lang.Integer>, int) will always return false, then application will be stuck.

Addendum: I still try to dig if it meet wrong version before updateValue.

Seems that VersionedValue.version are Stat.version from ZooKeeper. Then it is "known" to overflow in finite time in possible large setData frequency. Given default visibility of VersionedValue's visibility, I think we probably can enhance this a bit with extra zxid. But the crucial problem comes from ZooKeeper, it only do atomic check against Stat.version which is a 32-bit integer and overflow finally for long running frequently modified data. And we will finally run into corner case checkAndIncVersion exposed, -1 is not an atomic condition which means we could write wrong node data in case of wraparound and contention.

I really hope a check_zxid style setData. I am always fearing of setData with version from old incarnation.

There is a similar case in https://lists.apache.org/thread/4o3rl49rdj5y0134df922zgc8clyt86s, which overflow Stat.cversion. I believed ZooKeeper was not designed and suitable for certain usages, performance degradation is desireable in wrong usages. But these overflow more sounds like bugs.

I really hope a check_zxid style setData.

Totally +1. It should push zookeeper to changes, which is another thing. If need we should file another thread to discuss.

Addendum, here I didn't traverse cversion and aclversion if also have the same issue at curator side.

But when current.getVersion() reach -1, the next trySetCount is indeterminate.

Sorry, I don't get this meaning, the server side indead to check if version is -1. But if both version and currentVersion are -1, it will increase as expect. Do you mean that some other logic I missed at curator side? Thanks.

It is checkAndIncVersion in ZooKeeper side. When the expectedVersion is -1, it means setData blindly.

In Curator side, that is when getVersionedValue reports -1, the next trySetValue will ship -1 as expectedVersion to server which is apparently not what we want.

Yes, my bad. We both say the same code segment, but I misunderstand it. I will add some javadoc for this changes moment later. Thanks again.

One corner case when review code back, if the version overflow at server side and it is negative value such as -100 now, then restart application and currentValue will be initialized at curator side, and the version number will be set to UNINITIALIZED_VERSION which is -1 now. Then SharedCount will never be updated because current.getVersion() >= version && version != Integer.MIN_VALUE always true now.
cc @kezhuw @eolivelli What do you think about?

I guess -1 was chose as UNINITIALIZED_VERSION because of it is minimum in absent of overflow. So, now we can have Integer.MIN_VALUE as UNINITIALIZED_VERSION 😮‍💨 .

Hexiaoqiao · 2023-08-31T09:09:52Z

Some conclusion,
a. zookeeper server side only add 1 for version when update znode. And it allows integer overflow.
b. curator side can't be compatible with integer overflow because it compares old version and new version here and not consider the negative version number.
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java#L196-L209
c. The issue will be triggered from user interface both SharedCount#trySetCount and SharedCount#start(). When znode version meet Integer.MIN_VALUE SharedCount will be never update because condition current.getVersion() >= version always true.
d. While add condition current.getVersion() >= version && version != Integer.MIN_VALUE, this case could be resolved. And update znode as expected.

Hexiaoqiao · 2023-08-31T12:47:08Z

One corner case didn't considered, and try to update this PR and trigger CI again.

Hexiaoqiao · 2023-09-01T02:22:54Z

fix checkstyle and trigger ci again.

Hexiaoqiao · 2023-09-01T10:27:08Z

Hi @eolivelli @kezhuw , anymore suggestions here?

kezhuw · 2023-09-02T14:10:36Z

curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java

@@ -196,8 +196,12 @@ public boolean trySetValue(VersionedValue<byte[]> previous, byte[] newValue) thr
    private void updateValue(int version, byte[] bytes) {
        while (true) {
            VersionedValue<byte[]> current = currentValue.get();
-            if (current.getVersion() >= version) {
-                // A newer version was concurrently set.


I am not sure wether SharedValue was designed to work with multiple owners, but I saw there is a background watcher to update value and also there is no rule to forbid concurrent usages. So, I assume it should work well in case of concurrency.

Then, let me assume a situation:

current.getVersion is Integer.MAX_VALUE.

Thread1 call trySetValue and succeed to get overflowed version Integer.MIN_VALUE, but the call to updateValue is somewhat delayed.

Thread2 (assume watcher, which runs in ZooKeeper thread if I am not wrong) call updateVersion with version Integer.MIN_VALUE + 1. According to the code, this will be ignored.

Thread1 call updateValue to continue its task with version Integer.MIN_VALUE. It succeeds.

That is all, assume no changes anymore. I know it may not realistic.

current stores dated version while the javadoc says "All clients watching the same path will have the up-to-date value (considering ZK's normal consistency guarantees)".

I am not convinced in the overflow case either. @tisonkun @Hexiaoqiao

For the overflow case,

+1 to deprecate VersionedValue#getVersion so to warn clients about the "ordering assumption" if any about version.

+1 to a viable workaround if any and/or exception.

I think we should refactor updateValue a bit to compare ordering using Stat.mzxid. This way we are not fearing this overflow issue. I am stupid in reviwing without a deep thought, sorry for that. So my finally points are:

Deprecate VersionedValue#getVersion to warn clients about "ordering assumptions" and "overflow behavior".

Refactor updateValue to order using Stat.mzxid.

Throw exception in case of -1 Stat.version in trySetValue. I am positive to ZOOKEEPER-4743.

Document somehow about "overflow" and exception case in trySetValue.

Besides above, should we expose a VersionedValue#getZxid for client usage ?

Any thoughts @tisonkun @eolivelli @Hexiaoqiao ?

I have pushed new commits to go through above direction. Could you please take a look @Hexiaoqiao @tisonkun @eolivelli ?

kezhuw · 2023-09-02T14:54:40Z

@Hexiaoqiao Sorry for the delay. I am stuck about and also fearing this overflow things. There are not simple bugs, there are limitations. Hard to fix, only fragile workaround for best wish. I have some chaos thoughts for this issue:

Curator guarantees no ordering-ness about VersionedValue.getVersion. That is good. Better to document this out of order issue or make it private in future.
The promise about "up-to-date" should be guaranteed.
Silent "never update" is not good which is the goal for us to fix here.

I am ok to a workaround but I am also think exception(e.g. let caller know they hit an implementation limitation) is a good fallback except for background watcher. Though, I am not positive to a good workaround. I think there are few choices when encountering this hard limitation.

Push foward underlying implementation. In this case, a check_zxid version of setData in ZooKeeper.
Rewrites something in caller side. In this case, either curator side or your application.
Workaround possibly fragile.

All these suggestions happened in https://lists.apache.org/thread/4o3rl49rdj5y0134df922zgc8clyt86s. cc @li4wang

Besides above, did you find this in production ? What is your use case ? @Hexiaoqiao

Hexiaoqiao · 2023-09-03T15:07:12Z

@kezhuw Thanks for your detailed comments. I totally agree that this issue is not very easy/simple to fix perfectly. But for my case, this improvement could fix it when try to reproduce.

Besides above, did you find this in production ? What is your use case ?

Of course YES. The corresponding code snippet as following shows.

https://github.com/apache/hadoop/blob/a6c2526c6c7ccfc4483f58695aedf0cb892b4c4d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java#L506-L516

I would like to give a brief explanation (which is dependent by Hadoop project).
A. For HDFS RBF architecture[1], all Routers share the same states which is managed by zookeeper, and Token is one of these states.
B. About token[2], one Router generate token using increment sequence number shared by all Routers and update corresponding znode value (/zkdtsm/ZKDTSMRoot/ZKDTSMSeqNumRoot), let's name it Z.
C. Where token number will be million/day or more, and the version of znode Z will overflow only years after create(maybe less than three years if 5 million/day tokens generated.)

At first, I want to fix it at Hadoop side, but it is not smoothy and could not fix the root cause.

For this PR, my first thought is fix at curator side first, then try to improve it at both zookeeper and curator side as the solution you mentioned above. Any thoughts? Thanks.

[1] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html
[2] http://hortonworks.com/wp-content/uploads/2011/08/adding_security_to_apache_hadoop.pdf

kezhuw · 2023-09-04T00:25:03Z

@Hexiaoqiao Thanks your sharing, that is important!

I found some potential problem in the usages, though I haven't take a deep in look. incrSharedCount uses batch which is what Ted Dunning suggested in above mail thread. That is good. But it is somewhat error-prone in implementation.

The counter itself is a 32-bit integer, so it will overflow finally regard less of Stat.version.
The default batchSize is small. I got two defaults. public static final int DEFAULT_SEQ_NUM_BATCH_SIZE = 10; and public static final int ZK_DTSM_TOKEN_SEQNUM_BATCH_SIZE_DEFAULT = 1;.

So, for your use case, I suggest:

Rewrites sequence number as 64-bit integer. I am not sure what the consequence of this. But 32-bit in your case will overflow finally. It could overflowed already hiding by overflow of Stat.version.
Uses a sensible default batch size and refuse small batch size. This is easy in your case, but you will get negative sequence number cause of above.

Stat.version is 32-bit, it is not good to do a 1 to 1 mapping to application sequence number.

Hexiaoqiao · 2023-09-04T02:57:15Z

@kezhuw Great response. Some concerns here.

Rewrites sequence number as 64-bit integer. I am not sure what the consequence of this. But 32-bit in your case will overflow finally. It could overflowed already hiding by overflow of Stat.version.

Uses a sensible default batch size and refuse small batch size. This is easy in your case, but you will get negative sequence number cause of above.

Actually sequence number can be compatible with 32-bit integer overflow, which only should be distinguishable number without other conditions. IMO it is OK even if overflow, actually I try to test it on our test env and it works fine. So it is not hiding issue here.
Another side, I totally agree that set the default configuration ZK_DTSM_TOKEN_SEQNUM_BATCH_SIZE_DEFAULT to some bigger number (such as 10 or much more) could bypass this issue, however it does not solve root cause about Curator. Right?
Thanks again.

kezhuw · 2023-09-04T06:58:04Z

IMO it is OK even if overflow, actually I try to test it on our test env and it works fine. So it is not hiding issue here.

@Hexiaoqiao Glad to hear, that is good!

however it does not solve root cause about Curator. Right?

I am open to a workaround as long as it fit. For a workaround to work, I think we should guarantee:

The promise about "up-to-date" should be guaranteed.
The promise about "Changes the shared value only if its value has not changed since the version specified by newValue" should be hold. This means we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount.

I feel hopeless about viable workaround, but not against one. Anyway please go ahead, I would be happy to hear good news.

Hexiaoqiao · 2023-09-13T13:28:21Z

Thanks @kezhuw for your suggestions and sorry for the late response.

The promise about "up-to-date" should be guaranteed.
The promise about "Changes the shared value only if its value has not changed since the version specified by newValue" should be hold. This means we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount.

I am not sure if I understood this point clearly. 'we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount', Will it never update successfully when dataversion is back to -1 if we need to guarantee this rule?

kezhuw · 2023-09-13T15:03:36Z

'we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount', Will it never update successfully when dataversion is back to -1 if we need to guarantee this rule?

Yes, and no. When previous.getVersion reach to -1, trySetCount has few choices:

Exception to caller for limitation of the implementation, that is Stat.version overflowed and wraparound to -1 and setData with -1 version is a blind update.
Keep silent and do a blind update.

The later behavior is a bug due to the contract what trySetCount try to express. As a library, I don't think Curator should do that for callers silently.

For the "no" part, callers can risk themselves by doing a blind update through setCount.

Hexiaoqiao · 2023-09-14T03:27:03Z

The later behavior is a bug due to the contract what trySetCount try to express. As a library, I don't think Curator should do that for callers silently.

+1, Agree. So how about expose the choice to end user and offer some configuration items to set? IMO, some sensitive data should not be updated blindly, but some case we should offer solution to bypass this bug, such as Token for Hadoop as mentioned above, which can update smoothly even blindly. What do you think about?
TBH, I am not familiar with implement about Curator, so would you mind to list some demo code snippet about configuration if this solution could be approved.

tisonkun · 2023-09-15T04:35:11Z

TBH, I am not familiar with implement about Curator, so would you mind to list some demo code snippet about configuration if this solution could be approved.

Somehow you can copy the shared count implementation..It's not quite a lot of code.

Since this requirement is quite customized, I'm afraid that it's not suitable to hack into Curator. The root cause is integer overflow that is backed by ZK. (check_zxid may be complicated and ZK's zxid also suffers from integer overflow while that is a leader election issue; due to Jute's limitation, we cannot extend version from int to long or add a new field while keep compatibility.)

Also, I don't understand actually how this patch "fixes" the issue. You add more strict condition to exit but the original issue is hang?

kezhuw · 2023-09-15T06:32:49Z

So how about expose the choice to end user and offer some configuration items to set? IMO, some sensitive data should not be updated blindly, but some case we should offer solution to bypass this bug

I am -0 to this approach. Exception is the default, in anyway. Clients can bypass themselves easily on exception if they encounter or aware of this. I don't think it is worth for Curator to do that for this limitation(either ZooKeepr or Curator implementation from perspective) in case of awareness from clients. The important things for Curator here from my side are documentation and exception in code. Anything beyond that are probably overkill.

tisonkun · 2023-09-15T08:45:25Z

I understand the condition now and agree that this patch should work.

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

tisonkun · 2023-09-15T08:51:20Z

reproduce the problem

@eolivelli The tricky point here is that ZK doesn't expose API to manually edit node version so you should change the data for Int.MAX times which can be time-consuming..

kezhuw · 2023-09-15T13:49:38Z

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

Good idea! I just saw ZOOKEEPER-4743.

Hexiaoqiao · 2023-09-18T02:27:02Z

@tisonkun Glad to see you here.

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

It will be more smooth while ZK to skip version -1, Strong +1. From my first thought, it would be the next step after we fix at Curator side. Now it is great to see that other guys already try to push this forward.
For ZOOKEEPER-4743, I just suggest that we should consider to solve dataVersion/aclVersion/cversion 32-integer overflow at once.
Back to this PR, what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways? Thanks.

kezhuw · 2023-09-19T13:29:31Z

what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways?

I suggest we use Stat.mzxid to guard ordering in updateValue. (I posted in another thread https://github.com/apache/curator/pull/478/files#r1327946265).

The current approach does not sound correct to me(https://github.com/apache/curator/pull/478/files#r1313839760). And I think it is error-prone.

When Stat.version is -1, exception is enough for us to go. ZOOKEEPRER-4743 is another story, Curator should work irrespective of ZooKeeper version.

tisonkun · 2023-09-20T17:30:20Z

Back to this PR, what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways? Thanks.

They are irrelevant things.

For this patch, I may give another look as well as the comments above. @Hexiaoqiao you can try to address @kezhuw's comments/

kezhuw · 2023-09-27T09:12:21Z

Hi @Hexiaoqiao, do you still work on this ? Or should I take over this ? I plan to resort to Stat.mzxid in updateValue.

Hexiaoqiao · 2023-10-09T02:23:21Z

Sorry for the late response since I took a long vacation. Please feel free to take over it if interested.

kezhuw · 2023-10-12T04:11:13Z

Thank you @Hexiaoqiao ! I have pushed a commit to using Stat.mzxid. Would you mind to take a look ?

…n of ZNode is overflow.

…n overflowed

tisonkun · 2024-06-03T10:32:33Z

@Hexiaoqiao @kezhuw @eolivelli I'll review this patch in this week. I'd like to try include this patch in 5.7.0.

Signed-off-by: tison <[email protected]>

tisonkun · 2024-06-03T11:17:13Z

curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java

        while (true) {
            VersionedValue<byte[]> current = currentValue.get();
-            if (current.getVersion() >= version) {
-                // A newer version was concurrently set.
+            if (current.getZxid() >= zxid) {


What if only one client here and it goes through the overflow bound?

Said current.getZxid() == MAX_VALUE and zxid == MIN_VALUE, the update action will be skipped and the value will never updated.

Or we have different assumption on zxid?

This is paranoid -:).

ZooKeeper guarantees a total order of messages, and it also guarantees a total order of proposals. ZooKeeper exposes the total ordering using a ZooKeeper transaction id (zxid). All proposals will be stamped with a zxid when it is proposed and exactly reflects the total ordering. -- https://zookeeper.apache.org/doc/r3.9.0/zookeeperInternals.html

Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2. -- https://zookeeper.apache.org/doc/r3.9.0/zookeeperProgrammers.html

In case of above situation, I believed that ZooKeeper is doomed to failure. The "never updated" should be negligible in case of the disaster.

eolivelli requested changes Aug 31, 2023

View reviewed changes

Hexiaoqiao force-pushed the CURATOR-688 branch from 74deeb7 to 052450f Compare August 31, 2023 12:44

Hexiaoqiao force-pushed the CURATOR-688 branch from 052450f to 7173103 Compare September 1, 2023 02:21

kezhuw self-requested a review September 1, 2023 10:36

kezhuw reviewed Sep 2, 2023

View reviewed changes

kezhuw requested review from eolivelli and tisonkun October 1, 2023 10:36

CURATOR-688. SharedCount will be never updated successful when versio…

ff801a8

…n of ZNode is overflow.

kezhuw force-pushed the CURATOR-688 branch from 75d3a5e to 3c84466 Compare May 23, 2024 04:51

CURATOR-688. Fix SharedCount/SharedValue not update after Stat.versio…

174a8a1

…n overflowed

kezhuw force-pushed the CURATOR-688 branch from 3c84466 to 174a8a1 Compare May 23, 2024 05:07

Merge branch 'master' into CURATOR-688

849eaa4

tidy redundant suppression

d333087

Signed-off-by: tison <[email protected]>

tisonkun reviewed Jun 3, 2024

View reviewed changes

tisonkun approved these changes Jun 3, 2024

View reviewed changes

kezhuw merged commit a9a8020 into apache:master Jun 4, 2024
10 checks passed

CURATOR-688. SharedCount will be never updated successful when version of ZNode is overflow. #478

CURATOR-688. SharedCount will be never updated successful when version of ZNode is overflow. #478

Conversation

Hexiaoqiao commented Aug 31, 2023 • edited by tisonkun Loading

Hexiaoqiao commented Aug 31, 2023

eolivelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hexiaoqiao commented Aug 31, 2023

Hexiaoqiao commented Aug 31, 2023

Hexiaoqiao commented Sep 1, 2023

Hexiaoqiao commented Sep 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kezhuw commented Sep 2, 2023 • edited Loading

Hexiaoqiao commented Sep 3, 2023

kezhuw commented Sep 4, 2023 • edited Loading

Hexiaoqiao commented Sep 4, 2023

kezhuw commented Sep 4, 2023

Hexiaoqiao commented Sep 13, 2023

kezhuw commented Sep 13, 2023

Hexiaoqiao commented Sep 14, 2023

tisonkun commented Sep 15, 2023 • edited Loading

kezhuw commented Sep 15, 2023

tisonkun commented Sep 15, 2023

tisonkun commented Sep 15, 2023

kezhuw commented Sep 15, 2023

Hexiaoqiao commented Sep 18, 2023

kezhuw commented Sep 19, 2023

tisonkun commented Sep 20, 2023

kezhuw commented Sep 27, 2023

Hexiaoqiao commented Oct 9, 2023

kezhuw commented Oct 12, 2023

tisonkun commented Jun 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hexiaoqiao commented Aug 31, 2023 •

edited by tisonkun

Loading

kezhuw commented Sep 2, 2023 •

edited

Loading

kezhuw commented Sep 4, 2023 •

edited

Loading

tisonkun commented Sep 15, 2023 •

edited

Loading