Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws::Rails::SqsActiveJob#execute captures StandardError which is too wide #120

Open
jeropaul opened this issue May 14, 2024 · 8 comments
Open
Labels
feature-request A feature should be added or improved. p3 This is a minor priority issue

Comments

@jeropaul
Copy link

Steps to reproduce

  1. Set up AWS RDS to manage master passwords and rotate
  2. Configure your database details via environment properties
  3. Wait for your DB password to change

Once step 3 occurs my ActiveRecord jobs start throwing ActiveRecord::NoDatabaseError which is a subclass of StandardError

The current behaviour of Aws::Rails::SqsActiveJob#execute captures StandardError.

Expected behavior

  1. I would expect an ActiveRecord::NoDatabaseError NOT to be caught.
  2. I would expect the poller to die.

In my case the poller process finishing would kill the container which would be rescheduled with refreshed environment properties.

Actual behavior

The poller continues consuming from the queue, retrying messages at a great rate of speed.

System configuration

Rails version: 7.1

Ruby version: 3.3

@jterapin
Copy link
Contributor

Thank you for the submitting this issue! We will investigate and get back to you.

@alextwoods
Copy link
Contributor

This is a bit tricky - the logic for retrying StandardErrors is that arbitrary user code may throw transient StandardErrors which generally should be retried.

You can disable this behavior by configuring standard_error_retry to false (this logic was added recently in #115). However, this discards jobs rather than causing the entire poller to fail and stop running - in general, the poller was meant to be robust to errors in jobs. However, this seems like a special case where we want to detect such errors and actually kill the poller. I think we would need to come up with a list of these exceptions. ActiveRecordError - the superclass of NoDatabaseError - is likely too broad.

@jeropaul
Copy link
Author

I don't expect that it will be possible to come up with a full and complete list of error classes that should be ignored.

Personally I'd be happy if I could configure a list of error classes that should be re-raised by the executor with some really tight default that I can expand over time.

In terms of prior art and it is a slightly different use case but off the top of my head NewRelic allows error classes to be ignored via configuration

Which for one of my apps looks like:

  error_collector:
    ignore_errors: "ActionController::RoutingError,Sinatra::NotFound,ActionDispatch::Http::MimeNegotiation::InvalidType"

TLDR:

  1. continue to catch StandardError
  2. re-raise if the class is included in a configured list

@alextwoods
Copy link
Contributor

I've been doing some more investigation and thinking on this. I agree we can't really come up with a complete or useful list of error classes that should be considered fatal. (Side note: fatal may be a better description than ignored, since its some what the opposite. Those errors should cause the poller to raise the error and cause the worker to exit rather than being ignored).

I think rather that configuring a list, we should follow the user configured error handlers. That error handler could choose to re-raise specific exceptions, causing the poller to fail and exit.

I've marked this as a feature request and we'll create an internal story to track it. Or alternatively, we're happy to accept PRs!

@alextwoods alextwoods added the feature-request A feature should be added or improved. label May 15, 2024
@jeropaul
Copy link
Author

we should follow the user configured error handlers.

Can you elaborate further in regards to what this solution looks like?

@jeropaul
Copy link
Author

Are you referring to https://guides.rubyonrails.org/active_job_basics.html#exceptions where each job is expected to handle errors and anything not handled by a job causes the poller to fail.

@alextwoods
Copy link
Contributor

No - when a job has un-handled errors, the job should fail but should not cause the entire poller to fail.

What I was thinking is that the executor's error handler could be configured by a user, similar to what is supported in Shoryuken's Exception handler.

What I'm thinking:

# interface is a callable with 3 args: exception, the sqs message (so that delete could be called) and the job
my_handler = proc do |exception, message, job|
  if should_terminate_poller?(exception) # some custom logic that inspects the exception
    raise exception # re-raise the exception, causing the poller to terminate
  end
  # some other custom handling
end
Aws::Rails::SqsActiveJob.configure do |config|
  config.exception_handlers = [my_handler, DefaultErrorHandler]
end

@jeropaul
Copy link
Author

jeropaul commented Jun 7, 2024

This is looking to be a bigger change than I was expecting!

The current behaviour of Concurrent::RubyThreadPoolExecutor which is the concrete implementation of Concurrent::ThreadPoolExecutor used by Aws::Rails::SqsActiveJob::Executor does not propagate exceptions. Even when Thread.abort_on_exception=true is configured. There are some details and work arounds discussed in ruby-concurrency/concurrent-ruby#634 which has been open since 2017.

But the short of it is the current implementation provides no way for an exception raised when processing a message to terminate the polling process!

@RanVaknin RanVaknin added the p3 This is a minor priority issue label Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

4 participants