Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop crawler when we have been hit by a WAF protection #711

Open
benoit74 opened this issue Nov 1, 2024 · 0 comments
Open

Stop crawler when we have been hit by a WAF protection #711

benoit74 opened this issue Nov 1, 2024 · 0 comments

Comments

@benoit74
Copy link
Contributor

benoit74 commented Nov 1, 2024

For some websites, there is a WAF protection (e.g. Cloudflare systems) which triggers only after a bunch of page loads.

In other words, the crawl progress fine, and suddenly all requests are being blocked and finish with an error code. For instance in Cloudflare, suddenly all requests ends up with 403 errors.

Currently, the crawler just continues (wasting processing time) and finishes with a normal exit code (all pages have been crawled).

I would like the crawler to stop way sooner when such a behavior happen, and exit with an error code.

This should probably be something configurable, but by default if 100 pages in a row all finishes with 4xx or 5xx error codes, we can assume the crawler has been blocked by a WAF. In my case, the crawler had ~ 4700 in the queue, and it looks like they all finished with 403 error code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant