You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've run into two issues while trying to recompress and re-index some of our older ARCs.
1): When running warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz we get:
IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read as a WARC or ARC
Could anyone elaborate on what's going on here/suggest possible work around?
2): For some of the ARCs that are sucessfully recompressed, we get this error after running the cdxj-indexer:
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 403: ordinal not in range(128)
We've hand checked a few of these ARCs and it seems that the offending resource is always an image in binary. Any suggestions on how to move forward? I can also post the first error in warcio if that's more appropriate.
The text was updated successfully, but these errors were encountered:
We've run into two issues while trying to recompress and re-index some of our older ARCs.
1): When running
warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz
we get:IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read as a WARC or ARC
Could anyone elaborate on what's going on here/suggest possible work around?
2): For some of the ARCs that are sucessfully recompressed, we get this error after running the cdxj-indexer:
We've hand checked a few of these ARCs and it seems that the offending resource is always an image in binary. Any suggestions on how to move forward? I can also post the first error in warcio if that's more appropriate.
The text was updated successfully, but these errors were encountered: