Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking commands may block permanently due to lost wakeup #2473

Open
1 of 2 tasks
PokIsemaine opened this issue Aug 7, 2024 · 12 comments
Open
1 of 2 tasks

Blocking commands may block permanently due to lost wakeup #2473

PokIsemaine opened this issue Aug 7, 2024 · 12 comments
Labels
bug type bug

Comments

@PokIsemaine
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar issues.

Version

unstable

Minimal reproduce step

I discovered this issue while testing a PR(#2332), even though the PR increased the likelihood of it occurring, the problem actually existed beforehand.
It is difficult to reproduce manually without modifying the code, and it almost exclusively occurs during Go's integration tests.
Here's an example with TestRegression:

func TestRegression(t *testing.T) {
    srv := util.StartServer(t, map[string]string{})
    defer srv.Close()

    ctx := context.Background()
    rdb := srv.NewClient()
    defer func() { require.NoError(t, rdb.Close()) }()

    c := srv.NewTCPClient()
    defer func() { require.NoError(t, c.Close()) }()

    proto := "*3\r\n$5\r\nBLPOP\r\n$6\r\nhandle\r\n$1\r\n0\r\n"
    require.NoError(t, c.Write(fmt.Sprintf("%s%s", proto, proto)))

    resList := []string{"*2", "$6", "handle", "$1", "a"}

    v := rdb.RPush(ctx, "handle", "a")
    require.EqualValues(t, 1, v.Val())
    for _, res := range resList {
        c.MustRead(t, res)
    }

    v = rdb.RPush(ctx, "handle", "a")
    require.EqualValues(t, 1, v.Val())

    for _, res := range resList {
        c.MustRead(t, res)
    }
}

By printing some information, I was able to trace the error:

Start Execute BPOP
TryPopFormList

CommandPush::Execute
WakeupBlockingConns: handle, n_conns: 1
CommandPush::Execute OK

BlockingCommander::StartBlocking
BlockKeys

The issue occurs here:

// CommandBPop
  Status Execute(Server *srv, Connection *conn, std::string *output) override {
    srv_ = srv;
    InitConnection(conn);

    auto s = TryPopFromList();
    if (s.ok() || !s.IsNotFound()) {
      return Status::OK();  // error has already output in TryPopFromList
    }
    // <==HERE: Push is done after TryPopFromList, but not yet Blocking
    return StartBlocking(timeout_, output);
  }

To manually reproduce the issue consistently, you can add a sleep and complete a PUSH within 5 seconds:

  Status Execute(Server *srv, Connection *conn, std::string *output) override {
    srv_ = srv;
    InitConnection(conn);

    auto s = TryPopFromList();
    if (s.ok() || !s.IsNotFound()) {
      return Status::OK();  // error has already output in TryPopFromList
    }
    std::cout << "Sleep for 5 seconds" << std::endl;
    std::this_thread::sleep_for(std::chrono::seconds(5));
    std::cout << "Wake up" << std::endl;
    return StartBlocking(timeout_, output);
  }

image

What did you expect to see?

Blocking commands will not be blocked after new key values ​​are added.

What did you see instead?

Blocking command blocks permanently.

Anything Else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@git-hulk
Copy link
Member

git-hulk commented Aug 7, 2024

@PokIsemaine I'm not sure if I understood your point correctly:

  1. Client-A sends B[L|R]POP command to Kvrocks and no element in the list at that time
  2. Client-B publishes an element but Client-A hasn't entered the blocking mode, so Client-B's message won't wake up the Client-A.
  3. Client-A needs to be pending until the next publish message

Is that right?

@PokIsemaine
Copy link
Contributor Author

@PokIsemaine I'm not sure if I understood your point correctly:

  1. Client-A sends B[L|R]POP command to Kvrocks and no element in the list at that time
  2. Client-B publishes an element but Client-A hasn't entered the blocking mode, so Client-B's message won't wake up the Client-A.
  3. Client-A needs to be pending until the next publish message

Is that right?

Yeah

@git-hulk
Copy link
Member

git-hulk commented Aug 7, 2024

I see. Thank you for raising this issue. I knew this issue indeed, but didn't find a proper way to solve the data race. It's nice to have an issue to track this.

PokIsemaine added a commit to PokIsemaine/kvrocks that referenced this issue Aug 7, 2024
@PokIsemaine
Copy link
Contributor Author

/assign

@mapleFU
Copy link
Member

mapleFU commented Oct 29, 2024

I forgot this a bit, is this issue fixed?

@mapleFU
Copy link
Member

mapleFU commented Oct 29, 2024

@RiversJin tells me and I just realized that Blocking commands might need a mechenism for lock and blocking. Some commands, like bzpop, would block on multiple keys. If multiple bzpop comming with same key and different order, would this being a dead lock?

@PokIsemaine
Copy link
Contributor Author

PokIsemaine commented Oct 29, 2024

@RiversJin tells me and I just realized that Blocking commands might need a mechenism for lock and blocking. Some commands, like bzpop, would block on multiple keys. If multiple bzpop comming with same key and different order, would this being a dead lock?

We can try using MultiLockGuard to get the locks in the same order

@PokIsemaine
Copy link
Contributor Author

I forgot this a bit, is this issue fixed?

It's probably not fixed yet, but I think it might be resolved after #2620. I'll test this part later.

@mapleFU
Copy link
Member

mapleFU commented Oct 29, 2024

We can try using MultiLockGuard to get the locks in the same order

That's not lock, I think "BlockKeys" might following the same ordering? Or we should have a very well-defined order for "Lock-acquire - blockOnKey - LockRelease". The later part can be solved by #2620 or being a different patch

The current problem is that the sequence is "Lock-acquire - LockRelease - blockOnKey", which between release and blockOn the whole things can be done by another thread

@mapleFU
Copy link
Member

mapleFU commented Oct 29, 2024

The blockKeys itself might not blocking, my bad, this can(and might should) be handled by well-designed sequence

@PokIsemaine
Copy link
Contributor Author

But after #2620, the lock is released after the blockkey

@mapleFU
Copy link
Member

mapleFU commented Oct 29, 2024

Aha, you're right. So this should be fixed after that. Maybe we can also denote that the lock should be held when StartBlocking and OnWrite called

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug type bug
Projects
None yet
Development

No branches or pull requests

3 participants