Connection remains in INACTIVE state for prolonged period of time. #2930
Labels
for: team-attention
An issue we need to discuss as a team to make progress
status: mre-available
Minimal Reproducible Example is available
status: waiting-for-triage
Bug Report
I noticed that some of the cached write connection from getWriteConnection(slot) from PooledClusterConnectionProvider.java remain in inactive state for indefinite period of time. This implementation is different from getReadConnection(slot) where we check if a connection is active before returning it to the caller.
Current Behavior
Connections remains in inactive state leading to Currently not connected. Commands are rejected for all requests based on config rejectCommandsWhenInactive.
Possible Solution
I made some changes in the getWriteConnection(slot) similar to read path and that effectively fixed the issue. Essentially whenever a node connection is requested, I check for the status and return a new one instead if its marked as inative. I understand this is a bit inefficient in the sense that actual connections which may become active as also closed as part of this code fix.
below are the code snippets of the code I have modified.
From PooledClusterConnectionProvider.java
From AsyncConnectionProvider.java
Additional context
We baked this fix for over a month on an industry standard code with QPS in order of millions without any issues. Logging did show us that there were multiple instances where this fix helped a rquest go through instead of failing with above mentioned exception.
I would like your thoughts on this and if this is something we can turn into a pull request.
The text was updated successfully, but these errors were encountered: