You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals.
The issue is very similar to this from aiobotocore aio-libs/aiobotocore#1006.
This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.
Experimented adding retry with exponential back offs for restoring the snapshot.
Tried using different versions of aiobototcore.
verified from the logs , the _credential value is present.
verified credentials are available form the logs
/0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role:
The issue doesn't happen when the credentials are set via ~/.aws/credentials file or environment variables.
NOTE:
I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session
testing time is (2hrs) ~ 100 checkpoints.
Logs:
checkpointing_ddp/0 [3]:Traceback (most recent call last):
checkpointing_ddp/0 [3]: File "/home/User/torchsnapshot/torchsnapshot/scheduler.py", line 369, in read_buffer
checkpointing_ddp/0 [3]: await self.storage.read(read_io=read_io)
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-35' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155640>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,590][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [3]: File "/home/User/torchsnapshot/torchsnapshot/storage_plugins/s3.py", line 60, in read
checkpointing_ddp/0 [3]: response = await client.get_object(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/client.py", line 354, in _make_api_call
checkpointing_ddp/0 [3]: http, parsed_response = await self._make_request(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/client.py", line 379, in _make_request
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,610][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [3]: return await self._endpoint.make_request(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 96, in _send_request
checkpointing_ddp/0 [3]: request = await self.create_request(request_dict, operation_model)
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 84, in create_request
checkpointing_ddp/0 [0]:task: <Task pending name='Task-36' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155790>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,634][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-37' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155550>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-38' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007c10>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-39' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007ac0>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-40' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f596ea95fa0>()]>>
checkpointing_ddp/0 [3]: await self._event_emitter.emit(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/hooks.py", line 66, in _emit
checkpointing_ddp/0 [3]: response = await resolve_awaitable(handler(**kwargs))
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/_helpers.py", line 15, in resolve_awaitable
checkpointing_ddp/0 [3]: return await obj
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/signers.py", line 24, in handler
checkpointing_ddp/0 [3]: return await self.sign(operation_name, request)
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/signers.py", line 82, in sign
checkpointing_ddp/0 [3]: auth.add_auth(request)
checkpointing_ddp/0 [3]: File "/opt/conda/envs/User/lib/python3.9/site-packages/botocore/auth.py", line 418, in add_auth
checkpointing_ddp/0 [3]: raise NoCredentialsError()
checkpointing_ddp/0 [3]:botocore.exceptions.NoCredentialsError: Unable to locate credentials
The text was updated successfully, but these errors were encountered:
hbikki
changed the title
[S3 storage_plugin] Seiing No credential issue at random intervals when saving / restoring snapshot from S3.
[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3.
May 16, 2023
Thanks for reporting @hbikki. You mentioned in the aio-libs issue that "when reading/writing to S3 with process count > 5 for versions 2.4.2". Curious if you had success with other versions?
🐛 Describe the bug
When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals.
The issue is very similar to this from aiobotocore aio-libs/aiobotocore#1006.
This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.
/0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role:
NOTE:
I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session
testing time is (2hrs) ~ 100 checkpoints.
Logs:
Versions
pytorch = 2.0.0+cu117
torchx-nightly>=2023.3.15
torchsnapshot=0.1.0
The text was updated successfully, but these errors were encountered: