You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some websites, there is a WAF protection (e.g. Cloudflare systems) which triggers only after a bunch of page loads.
In other words, the crawl progress fine, and suddenly all requests are being blocked and finish with an error code. For instance in Cloudflare, suddenly all requests ends up with 403 errors.
Currently, the crawler just continues (wasting processing time) and finishes with a normal exit code (all pages have been crawled).
I would like the crawler to stop way sooner when such a behavior happen, and exit with an error code.
This should probably be something configurable, but by default if 100 pages in a row all finishes with 4xx or 5xx error codes, we can assume the crawler has been blocked by a WAF. In my case, the crawler had ~ 4700 in the queue, and it looks like they all finished with 403 error code.
The text was updated successfully, but these errors were encountered:
For some websites, there is a WAF protection (e.g. Cloudflare systems) which triggers only after a bunch of page loads.
In other words, the crawl progress fine, and suddenly all requests are being blocked and finish with an error code. For instance in Cloudflare, suddenly all requests ends up with 403 errors.
Currently, the crawler just continues (wasting processing time) and finishes with a normal exit code (all pages have been crawled).
I would like the crawler to stop way sooner when such a behavior happen, and exit with an error code.
This should probably be something configurable, but by default if 100 pages in a row all finishes with 4xx or 5xx error codes, we can assume the crawler has been blocked by a WAF. In my case, the crawler had ~ 4700 in the queue, and it looks like they all finished with 403 error code.
The text was updated successfully, but these errors were encountered: