v1.x RFC: pull based approach #430

freeekanayaka · 2023-06-07T19:08:36Z

The v0.x series of libraft has been around for a few years now, and during that time we gained experience around its strengths and weaknesses.

I believe there's now enough data that we could aim at introducing a v1.x series with better long-term design, extensibility and maintainability.

I have a few ideas around this topic which I've been ruminating around for quite some time now, and I'd like to get feedback about them. I'll start by listing what I consider the aspect of the current design that would need to be improved and I'll then add possible approaches. It's not anything radical, but rather a change in the control flow that should however make things more flexible, composable and understandable.

Aspects of v0.x design that could be improved

`struct raft_io` and `struct raft_fsm`

The initial goal of the struct raft_io interface was to make it easy to implement the network and storage parts of struct raft, while keeping those concerns hidden from the point of view of struct raft consumers, which would get a reasonably friendly/intuitive API.

The same can be said of struct raft_fsm, which was meant to offer an easy way for struct raft consumers to plug in their application FSM.

However, we observed that there are actually several flavors/styles of FSMs and I/O that we want to support: synchronous vs aynchronous FSM snapshots, in-memory vs disk-based FSMs, chunked snapshots, io_uring-style I/O (see below). All that puts a strain on struct raft_io and struct raft_fsm that must be made generic enough to accommodate all these various mechanics, and that leads to increased complexity.

push model

The current callback-based style (which I'll call "push model", because we have user's code that gets called by raft, instead of raft code that gets called by the user) was mainly inspired by libuv and other similar asynchronous frameworks in other languages (Javascript and Python).

That might be fine for writing asynchronous network clients or mostly-stateless web services (with request/response dynamics and state managed by a db). However it seems to complicate the control flow a bit for complex stateful engines like struct raft.

This is something that was discussed in libuv itself, however it is harder for them to move away from that because of backward compatibility, see libuv/leps#3 and libuv/libuv#6.

Interestingly, io_uring is instead based on a pull model, which is more flexible and efficient for asynchronous systems, because it gives control to the user and not to the framework.

Possible ideas for v1.x

I think both the two concerns above can be addressed by moving to a pull model.

Essentially the control flow would be inverted: struct raft would be a pure state machine which does not make any call to external code, instead the user would be responsible to drive struct raft and advance its state based on external events (network I/O, FSM, user requests). Such design is actually similar to other raft implementation like etcd's raft.

The benefit will be to decouple struct raft from any "rigid" struct raft_io or struct raft_fsm interfaces, which would then be in the hand of the consumer and not dictated by libraft (although libraft would still provide conveniences for that). Later down the road it would also make it possible to have a pure and idiomatic io_uring-based implementation (no libuv), which takes full advantage of all the features that the io_uring approach has with respect to epoll and friends (pull/completion-based vs push/readiness-based I/O see also https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023 from io_uring's author).

The current logic of struct raft wouldn't need to change much, as this change mainly concerns the "border" of struct raft, i.e. its user-facing APIs that would become pull based instead of push based. I have some ideas about how the pull based API could look like, but first wanted to gather feedback around the high-level approach. I can surely provide concrete API sketches if they help to clarify.

Implementation remarks

It would be possible to work towards v1.x slowly and incrementally, without any abrupt change. Being v1.x more flexible and general than v0.x, we could keep supporting v0.x for as long as we want, with some glue/compatibility code.

The text was updated successfully, but these errors were encountered:

cole-miller · 2023-06-07T19:43:28Z

I like the general idea! I agree that a "pull model" is more elegant and probably more flexible pleasant to maintain.

How do you see filesystem I/O working in this model? When libraft decides it's time to persist a log entry or take a snapshot, would that actually be done by the caller (i.e. dqlite)?

freeekanayaka · 2023-06-07T20:25:44Z

I like the general idea! I agree that a "pull model" is more elegant and probably more flexible pleasant to maintain.

Yeah, I feel the same. There are tradeoffs, but they are probably worth.

How do you see filesystem I/O working in this model? When libraft decides it's time to persist a log entry or take a snapshot, would that actually be done by the caller (i.e. dqlite)?

I would like this to be similar to io_uring's design. In that case there are 2 queues, submission queue and completion queue. The part that wants to do I/O (user code) places its request in the submission queue, then the part that executes the I/O (the kernel) fetches the request from the submission queue and places the result on the completion queue. The nice thing is that the two sides are decoupled: there's no hard constraint about what happens when, user code can place requests in the submission queue when it wants and the kernel fetches them when it wants (of course those steps can be optionally synchronized). The same for the completion queue.

When struct raft decides it's time to persist an entry it would place a request into something equivalent to the submission queue, then the code that is driving struct raft would pull the request from the submission queue and place it back into something equivalent to a completion queue when done. The nice part here is that typically struct raft will place into the submission queue a bunch of requests in a single "step" of the state machine, not a single request (e.g. persist an entry, send messages, etc). This way all those requests can be "reaped" at the same time and submitted efficiently by the driver code (e.g. with io_uring).

I would not expect dqlite to execute all this driving directly (that'd be a lot), I still think libraft should offer some supporting code external to struct raft to make users' job easier (that code would perform logic similar to our current libuv-based struct raft_io-based implementation). However, there would not be any rigidly pre-defined interface to adhere to (like struct raft_io and struct raft_fsm), so in principle callers could just decide to drive struct raft directly, or to implement a slightly different version of this supporting code, more apt to their use case.

MathieuBordere · 2023-06-08T13:37:02Z

Could you sketch how the API would work and how dqlite would consume it? From my perspective, it has to be a direct win for dqlite. Ideally (still imo) dqlite ultimately will only have 1 way of operating (disk probably) and 1 way of doing things with the raft library and having a very general libraft might not currently fit in dqlite's goals.

freeekanayaka · 2023-06-08T20:57:16Z

Could you sketch how the API would work and how dqlite would consume it?

For simplicity let's first start focusing only on the aspects of the API strictly relevant to dqlite as a consumer.

One important thing that I'd like to reiterate is that all this can be done in stages and very gradually, over a relatively long period of time, during which we might refine things as we learn and explore more.

Stage 0 would be probably no change at all for dqlite itself, in how it consumes libraft. Externally libraft would keep the exact same v0.x API that we have right now, but internally libraft could have already a v1.x API that should make struct raft inner mechanics a bit easier to reason about (I'll expand on this separately to keep the focus on dqlite). Since the v1.x API is more flexible, the v0.x API would be just a thin shim around it, that we can keep around as long as we deem fit.

Stage 1 could be dqlite beginning to move away from the push model and adopt the pull model. Basically there are only two v0.x APIs that dqlite (or any other consumer for that matter) depends on:

int raft_apply(struct raft *r,
               struct raft_apply *req,
               const struct raft_buffer bufs[],
               const unsigned n,
               raft_apply_cb cb);

and

struct raft_fsm; /* Definition omitted for brevity */

I'll treat them separately.

`raft_apply()`

I think the raft_apply() part is basically fine and would be unchanged, except for one important detail which I feel is a fundamental misdesign of v0.x (regardless of v1.x): the raft_apply_cb callback. I believe that's wrong not so much because it's a push model, but more importantly because it "splits" the consumer control flow into two. Let me explain.

In Raft the "idiomatic" (and arguably most natural) way/place to handle a command which has been committed is when applying it to the FSM: that's where the Raft application updates its state and replies to the client. The fact that we have a raft_apply callback is unfortunate because it means there are 2 places/hooks where that handling occurs, instead of one. That doesn't encourage dqlite/consumers to implement a saner control flow.

So I'd suggest to change the raft_apply() signature to:

int raft_apply(struct raft *r, const struct raft_buffer bufs[], unsigned n);

The change for dqlite would be that we'd move the raft_apply_cb logic to the dqlite FSM, so we'd have a single entry point/context when handling committed commands and a simpler control flow.

(instead of changing the existing signature we could introduce a separate new function, sayraft_submit() or something like that, and drop raft_apply only later down the road when dropping v0.x support).

`struct raft_fsm`

This is where we'd move more into a pull model and take advantage of the associated code/interface decoupling.

Instead of forcing a specific interface via struct raft_fsm methods, which struct raft is meant to call, dqlite would "poll" struct raft to see if there's anything to do.

The struct raft state machine would have a queue where it pushes operations to be performed by the application (dqlite). This would be similar to io_uring's submission queue. An item in the queue would be a struct with "abstract" and "concrete" definitions (the same pattern libuv uses for handles, and that we already use too, to some extent) :

/* Codes for the various types of operations that can be pushed to the queue */
enum raft_op_type {
    RAFT_APPLY_COMMAND,
    RAFT_TAKE_SNAPSHOT,
    /* ... */
};

/* Common fields across all operation structs */
#define RAFT_OP_FIELDS      \
    int version;            \
    enum raft_op_type type

/* Abstract operation struct */
struct raft_op {
    RAFT_OP_FIELDS;
}

/* Instruct the consumer to apply a given command to its FSM */
struct raft_apply_command
{
    RAFT_OP_FIELDS;
    raft_index index;
    const struct raft_buffer *command;
};

/* Instruct the consumer to take a snapshot of its FSM and persist it along with the given metadata */
struct raft_take_snapshot
{
    RAFT_OP_FIELDS;
    raft_index index; /* Index of last committed entry */
    raft_term term; /* Term of last committed entry */
    struct raft_configuration configuration; /* Last committed configuration */
    raft_index configuration_index; /* Index of the configuration */
};

The operation queue could be consumed with an API like:

/* Return the next operation in the queue, or NULL if empty. */
struct raft_op *raft_poll(struct raft *r);

The way dqlite consumes the queue is not dictated, but since we use libuv, a natural way would be to use a libuv uv_check_t handle to poll struct raft after I/O has happened, e.g.:

struct raft *r; /* handle to dqlite's raft object */
struct raft_op *op;
while ((op = raft_poll(r))) {
    switch (op->type) {
        case RAFT_APPLY_COMMAND:
             /* Apply command to dqlite's FSM */
            startApplyingCommand(op);
            break;
        case RAFT_TAKE_SNAPSHOT:
            /* Take a snapshot of dqlite's FSM and then store it on disk */
            startTakingSnapshot(op);
            break;
    }
}

Note that dqlite would be completely free to carry out the operation as it wants. For example for snapshots there would be no "prescribed" sequence of events or APIs: the can be taken synchronously or asynchronously, using a threadpool or not, saved in chunks or not, etc. Same for applying the command, that can be synchronous or not (e.g. for storing state on disk).

All dqlite would need to do would be to inform struct raft when it has completed the operation that was requested. For that we could have an API like:

int raft_done(struct raft *r, struct raft_op* op, int status);

which would be the equivalent of io_uring's completion queue.

This would roughly be how a "stage 1" dqlite adoption of v1.x could be like. In a further "stage 2" development we could turn more parts of dqlite into a push-based model, even parts that are not directly raft-related. That should make reasoning about control flow easier than callbacks.

For sake of brevity for now I'm not including a sketch of the part of the API that would be the pull-based equivalent of struct raft_io, but it'd be basically just an extension of the above mechanism (which is the pull-based equivalent of struct raft_fsm). I can provide it as a follow up though.

From my perspective, it has to be a direct win for dqlite.

I believe a v1.x design along these lines would grant dqlite the flexibility to evolve with less coupling to struct raft and with less need to change struct raft in response to new dqlite needs or approaches. That seems a win for dqlite because sometimes the current struct raft_io/struct raft_fsm interfaces just seem to get in the way.

Ideally (still imo) dqlite ultimately will only have 1 way of operating (disk probably) and 1 way of doing things with the raft library and having a very general libraft might not currently fit in dqlite's goals.

The goal would be to make code structure/flows more understandable, composable and manageable (both for libraft and possibly later for dqlite, e.g. with a broader "stage 2"), that should help reducing bugs and making it easier to fix them when they occur.

freeekanayaka · 2023-06-30T09:24:26Z

/* Codes for the various types of operations that can be pushed to the queue */
enum raft_op_type {
    RAFT_APPLY_COMMAND,
    RAFT_TAKE_SNAPSHOT,
    /* ... */
};

/* Common fields across all operation structs */
#define RAFT_OP_FIELDS      \
    int version;            \
    enum raft_op_type type

/* Instruct the consumer to take a snapshot of its FSM and persist it along with the given metadata */
struct raft_take_snapshot
{
    RAFT_OP_FIELDS;
    raft_index index; /* Index of last committed entry */
    raft_term term; /* Term of last committed entry */
    struct raft_configuration configuration; /* Last committed configuration */
    raft_index configuration_index; /* Index of the configuration */
};

Note that dqlite would be completely free to carry out the operation as it wants. For example for snapshots there would be no "prescribed" sequence of events or APIs: the can be taken synchronously or asynchronously, using a threadpool or not, saved in chunks or not, etc. Same for applying the command, that can be synchronous or not (e.g. for storing state on disk).

Just wanted to highlight that as I mentioned during our call this v1 version of the API would avoid the tight application coupling we have in v0 when it comes to taking snapshots: in the sketch above there is no mandated buffer that the application has to use, and no mandated way in which snapshots must be taken and persisted, so dqlite could for example write its snapshot to disk incrementally with whatever best strategy is appropriate, or if the dqlite FSM gets eventually backed by storage a copy-on-write approach could be used.

Signed-off-by: Free Ekanayaka <[email protected]>

MathieuBordere added Maybe Undecided whether in scope for the project Feature New feature, not a bug labels Jun 12, 2023

freeekanayaka mentioned this issue Jun 28, 2023

[question] Understanding how to add new servers #440

Closed

freeekanayaka mentioned this issue Nov 15, 2023

replication: Try to reinstate entries that have been truncated away #483

Merged

freeekanayaka mentioned this issue Dec 29, 2023

async snapshot install support cowsql/raft#123

Open

cole-miller closed this as completed Feb 13, 2024

cole-miller mentioned this issue Feb 14, 2024

Mothball the project #497

Merged

freeekanayaka referenced this issue in cowsql/raft Mar 21, 2024

raft: Require batch attribute to be set when firing RAFT_SUBMIT

669cc92

Signed-off-by: Free Ekanayaka <[email protected]>

freeekanayaka mentioned this issue Apr 7, 2024

Functionallity for snapshot suggestion cowsql/raft#185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.x RFC: pull based approach #430

v1.x RFC: pull based approach #430

freeekanayaka commented Jun 7, 2023

cole-miller commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023 •

edited

Loading

MathieuBordere commented Jun 8, 2023

freeekanayaka commented Jun 8, 2023 •

edited

Loading

freeekanayaka commented Jun 30, 2023

v1.x RFC: pull based approach #430

v1.x RFC: pull based approach #430

Comments

freeekanayaka commented Jun 7, 2023

Aspects of v0.x design that could be improved

struct raft_io and struct raft_fsm

push model

Possible ideas for v1.x

Implementation remarks

cole-miller commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023 • edited Loading

MathieuBordere commented Jun 8, 2023

freeekanayaka commented Jun 8, 2023 • edited Loading

raft_apply()

struct raft_fsm

freeekanayaka commented Jun 30, 2023

`struct raft_io` and `struct raft_fsm`

freeekanayaka commented Jun 7, 2023 •

edited

Loading

freeekanayaka commented Jun 8, 2023 •

edited

Loading

`raft_apply()`

`struct raft_fsm`