Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use custom error types and remove error downcasting #52

Open
Stebalien opened this issue Jan 20, 2022 · 5 comments
Open

Use custom error types and remove error downcasting #52

Stebalien opened this issue Jan 20, 2022 · 5 comments
Labels
cleanup enhancement New feature or request P3

Comments

@Stebalien
Copy link
Member

Currently, we have a lot of "anyhow" errors and rely on error downcasting to figure out the right exit code. Unfortunately, this makes it very difficult to figure out what the exit code should be.

Instead, we'd ideally:

  1. Abort immediately once we know the right exit code (power actor tests part 3 #282).
  2. Return situation-specific errors from state types like partitions & deadlines, so we can figure out the correct exit code.
@jennijuju jennijuju transferred this issue from filecoin-project/ref-fvm Mar 7, 2022
@jennijuju jennijuju added this to the Network v16 milestone Mar 8, 2022
@jennijuju
Copy link
Member

jennijuju commented Mar 8, 2022

from zen: make sense to do before the audit, but shouldn't be a blocker

@jennijuju jennijuju modified the milestones: Network v16, Network v17 Mar 8, 2022
@anorth anorth moved this to Todo in Network nv17 Aug 11, 2022
@anorth
Copy link
Member

anorth commented Aug 11, 2022

#197 , #214 does a heap of the work for this (not yet merged, needs stewardship)

@anorth anorth moved this from Todo to Opportunistic in Network nv17 Aug 11, 2022
@anorth anorth added enhancement New feature or request and removed kind/improvement labels Sep 1, 2022
@jennijuju jennijuju removed this from Network nv17 Nov 25, 2022
@jennijuju jennijuju moved this to 🏗 In progress in Network v18 Nov 25, 2022
@jennijuju jennijuju removed this from Network v18 Nov 25, 2022
@anorth anorth removed this from the Network v17 milestone Mar 30, 2023
@anorth
Copy link
Member

anorth commented Jul 21, 2023

I've spent some time investigating how to improve things here, but it turns out to be quite complex. In our current situation we have:

  • A pattern of [Result].map_err(|e| e.downcast_default(...)) for adapting non-ActorErrors into ActorError. It's quite verbose.
  • A trait AsActorError implemented for Result<T, E: Display> with context_code() and similar methods to adapt into an Result<T, ActorError> with a specified exit code. These are concise to use but don't attempt any downcasting to discover wrapped exit codes.

Some problems with this state:

  • The map and downcast pattern isn't uniformly used. There are plenty of instances of mapping to a concrete ActorError without attempting to downcast. The pattern is not obvious enough to apply consistently.
  • The downcasting implementation relies on global knowledge of all the types that could be wrapping an ActorError. It's incomplete, e.g. it's missing fvm_ipld_kamt::Error and the frc46 token StateError (which can wrap an fvm_ipld_hamt::Error). In Rust, we'd need to import all these error types into the module defining the trait in order to implement it, and could easily miss new ones. The approach doesn't scale to more different error types.
  • Using AsActorError::context_code() can fail to propagate an ActorError nested inside, e.g. hamt::Error::Dynamic.
  • Because ActorError implements Display, the context_code() method can be inadvertently used to overwrite an exit code even when it's not hidden inside another error type

Why do we have downcasting at all? There are some places where we can generate an ActorError with exit code but not propagate it directly, usually due to library code. E.g. the HAMT/AMT for_each() method with return an Error::Dynamic containing an anyhow::Error that's actually an ActorError raised from the callback.

In #1322 I tried replacing a bunch of map and downcast patterns with more concise trait methods. It's on hold at present because a few tests fail due to changes in exit codes, as a code generated inside for_each was trashed (e.g. this USR_ILLEGAL_ARGUMENT in remove partitions).
#1338 is another attempt that aimed to avoid inadvertent trashing of exit codes, but it's stuck by the realisation that downcasting needs global knowledge of all the error types that could hide an ActorError, which it doesn't have.

So what shall we do? Some options:

  1. Try to get it "right", with the global knowledge of error types and robust downcasting. We can probably still find more concise patterns than map_err+downcast. Support generating ActorError deep in the code and propagating with ?. Something like leaning into Another attempt at error cleanup that preserves more exit codes #1338
  2. Give up on preserving exit codes generated low down in the code. Support ActorError deep in the code and ? propagation, but usually only ILLEGAL_STATE. Other exit codes can be generated in top-level actor code. Something like Remove legacy error downcasting from miner actor #1322.
  3. Implement specific error types for state objects that enumerate error conditions. Stop returning ActorError from state methods, instead only generate exit code in top-level actor business logic.
  4. Make the abort() syscall available in state methods (perhaps via a new very slim runtime trait passed to them) and call it directly in state code, with no opportunity for actor code to handle the error or adapt the exit code. Implement mock runtime and test VM etc to use panic or similar to support this.

@Stebalien
Copy link
Member Author

E.g. the HAMT/AMT for_each() method with return an Error::Dynamic containing an anyhow::Error that's actually an ActorError raised from the callback.

We should implement iteration with external iterators instead to fix this issue. I have a WIP patch locally that I can try to cleanup and ship if there's interest.

That should remove a large part of this issue.

Give up on preserving exit codes generated low down in the code.

In many cases, I believe this is the right thing to do. Most of these errors should just be "illegal state".

Implement specific error types for state objects that enumerate error conditions. Stop returning ActorError from state methods, instead only generate exit code in top-level actor business logic.

IMO, this is the "correct" solution. It's just really annoying which is why most rust users just use crates like anyhow and yolo it.

But yeah, we can do this where it matters. I.e.:

  1. Implement custom error types.
  2. Implement From<CustomErrorType> for ActorError wherever we care about conversions.

Make the abort() syscall available in state methods (perhaps via a new very slim runtime trait passed to them) and call it directly in state code, with no opportunity for actor code to handle the error or adapt the exit code. Implement mock runtime and test VM etc to use panic or similar to support this.

I like the idea of aborting early but... I'd prefer not to do this in all cases:

  1. IMO, we should abort directly if some state invariant is violated. Specifically: state block doesn't exist, state fails to decode, etc.
  2. IMO, we shouldn't abort directly if, e.g., the user passes an illegal argument, something isn't found, etc. (the actor may want to react to this in some way?).

However, I'm concerned about using the runtime here as it may be hard to thread it through to the very bottom (and makes it harder to reason about what the code can do). Instead, I'd consider exposing it as a bare function that:

  1. Panics when compiled to a native target. We can smuggle the exit code through the panic and catch the panic at the top-level.
  2. Calls the exit syscall otherwise.

@anorth
Copy link
Member

anorth commented Aug 1, 2023

External iterators would be great. And then we could "give up" on doing better than "illegal state" for remaining hard cases.

Your proposal re aborting sounds reasonable, and could clean up some boilerplate. In mechanism, why not just call the runtime's 'abort' anyway and have that panic (and catch it out the other side). Conditional compilation and static cling already causes headaches in this repo, I would rather avoid adding any more that we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup enhancement New feature or request P3
Projects
None yet
Development

No branches or pull requests

3 participants