Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiving particular site kills networking on host machine #703

Open
vitorio opened this issue Oct 15, 2024 · 0 comments
Open

Archiving particular site kills networking on host machine #703

vitorio opened this issue Oct 15, 2024 · 0 comments

Comments

@vitorio
Copy link

vitorio commented Oct 15, 2024

I feel like this is a 500-mile-email or OpenOffice-won't-print-on-Tuesdays situation, but nevertheless: I have a URL which will eventually cause the browsertrix-crawler Docker image to kill networking on the host machine. I'm using latest, and this happens every time with this URL on both WSL2 and Ubuntu 20.04.

The crawl log eventually reports ~500 "Frame check timed out" messages, then ~7000 "continueResponse failed" messages, followed by ~1000 "Protocol error" exceptions. When it eventually returns, neither subsequent runs of the Docker image nor the host machine are able to resolve anything further, whether in browsers, or with ping in a terminal window, or anything. It seems like DNS resolution is killed rather than all networking, but I didn't do extensive tests (WSL2 could be recovered by restarting networking on the host machine, Ubuntu required a full reboot).

Testing the URL itself, it seems to never stop loading in both Firefox and Chrome. It's just this particular URL, a private bookmarking site capture of a page from 2014. The nearest Wayback Machine capture of the original page doesn't exhibit this issue. A wget of the capture doesn't exhibit this issue. Still, it doesn't seem like a crash in browsertrix-crawler should bring down DNS (or more) on the host machine.

I'm not sure the best way to triage this further, the crawl log is 3.5MB, and I'd like to not share the URL to a private account in a public issue queue if possible.

Here's a link to a redacted copy of the crawl log. I can re-run the capture attempt on a non-critical machine if there are other logs that would be useful, just let me know what they are. If there's a private way to provide the URL, happy to do that also.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant