Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3. #142

Open
hbikki opened this issue May 16, 2023 · 2 comments

Comments

@hbikki
Copy link

hbikki commented May 16, 2023

🐛 Describe the bug

When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals.
The issue is very similar to this from aiobotocore aio-libs/aiobotocore#1006.
This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.

 Snapshot.take(path=str(save_dir), app_state=app_state)
  • Experimented adding retry with exponential back offs for restoring the snapshot.
  • Tried using different versions of aiobototcore.
  • verified from the logs , the _credential value is present.
  • verified credentials are available form the logs
    /0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role:
  • The issue doesn't happen when the credentials are set via ~/.aws/credentials file or environment variables.

NOTE:
I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session
testing time is (2hrs) ~ 100 checkpoints.

Logs:

checkpointing_ddp/0 [3]:Traceback (most recent call last):
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/scheduler.py", line 369, in read_buffer
checkpointing_ddp/0 [3]:    await self.storage.read(read_io=read_io)
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-35' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155640>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,590][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/storage_plugins/s3.py", line 60, in read
checkpointing_ddp/0 [3]:    response = await client.get_object(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 354, in _make_api_call
checkpointing_ddp/0 [3]:    http, parsed_response = await self._make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 379, in _make_request
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,610][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [3]:    return await self._endpoint.make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 96, in _send_request
checkpointing_ddp/0 [3]:    request = await self.create_request(request_dict, operation_model)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 84, in create_request
checkpointing_ddp/0 [0]:task: <Task pending name='Task-36' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155790>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,634][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-37' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155550>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-38' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007c10>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-39' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007ac0>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-40' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f596ea95fa0>()]>>
checkpointing_ddp/0 [3]:    await self._event_emitter.emit(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/hooks.py", line 66, in _emit
checkpointing_ddp/0 [3]:    response = await resolve_awaitable(handler(**kwargs))
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/_helpers.py", line 15, in resolve_awaitable
checkpointing_ddp/0 [3]:    return await obj
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 24, in handler
checkpointing_ddp/0 [3]:    return await self.sign(operation_name, request)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 82, in sign
checkpointing_ddp/0 [3]:    auth.add_auth(request)
checkpointing_ddp/0 [3]:  File "/opt/conda/envs/User/lib/python3.9/site-packages/botocore/auth.py", line 418, in add_auth
checkpointing_ddp/0 [3]:    raise NoCredentialsError()
checkpointing_ddp/0 [3]:botocore.exceptions.NoCredentialsError: Unable to locate credentials


Versions

pytorch = 2.0.0+cu117
torchx-nightly>=2023.3.15
torchsnapshot=0.1.0

@hbikki hbikki changed the title [S3 storage_plugin] Seiing No credential issue at random intervals when saving / restoring snapshot from S3. [S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3. May 16, 2023
@yifuwang
Copy link
Contributor

Thanks for reporting @hbikki. You mentioned in the aio-libs issue that "when reading/writing to S3 with process count > 5 for versions 2.4.2". Curious if you had success with other versions?

@hbikki
Copy link
Author

hbikki commented May 25, 2023

No it isn't working even with diff versions of aiobototcore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants