Support Multipart Range Requests in S3Transfer's download_file #3466

forrestfwilliams · 2022-10-21T21:40:50Z

Describe the feature

Boto3 supports ranged get requests and multipart downloads, however it is not possible to perform a multi-part download over a specific range. This results in slow download times when you are trying to download a 1GB range of data from a 4GB file in S3. It would be great if a range argument were added to TransferConfig, that could then be passed to a download_file call. This would download the range of data specified, but would use multipart downloading if the range size exceed the multipart_threshold.

Use Case

I work at the Alaska Satellite Facility, where we distribute large amounts of remote sensing data to users across the globe via AWS. Many of these datasets come in legacy formats, such as zip files, that are not cloud-friendly. Due to the highly structured nature of these datasets, we can identify byte ranges that contain subsets of data that our users would be interested in downloading directly. However, since these datasets are still large (~1GB within a larger 4GB zip file), and multipart downloads are not supported for range requests, we cannot offer extraction of these dataset with low latency.

Proposed Solution

I have developed a workaround that involves using aiobotocore to set up threaded get requests for the range of data desired. This can be found within this benchmarking script. This is still much slower than the native multipart read.

Other Information

I have also started a discussion concerning this issue on stackOverflow, but no one has found a good solution.

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

SDK version used

1.24.59

Environment details (OS name and version, etc.)

r5d.xlarge EC2 instance running the latest Amazon Linux (same region as S3 bucket)

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2022-10-25T17:46:51Z

Hi @forrestfwilliams thanks for reaching out. It looks like this may be a duplicate of #1215. (Also the s3transfer repo may be the best place to track these requests.) I brought this up for discussion with the team and they weren't sure about supporting multi-part download over a specific range. It seems like there was some debate on that StackOverflow post as well, although there may be some workarounds. Have you tried any workarounds and if so what has worked for you?

forrestfwilliams · 2022-10-27T13:36:38Z

Hi @tim-finnigan, thanks for your reply. Yes, this does look like the same issue as #1215. So far I have tried solutions using both python's asyncio, and a ThreadPoolExecutor. When accessing a 1.3 GB region of data in an open bucket on an in-region r5d.xlarge EC2 instance, the asyncio approach will download the data in 6.28 seconds, and the ThreadPoolExecutor approach will download the data in 4.76 seconds. For comparison, using the boto3-native multipart download functionality to download the same amount of data under the same conditions takes 3.96 seconds (i.e. the ThreadPoolExecutor solution takes 1.2x the time of the native solution). These differences are further exacerbated under less-ideal download conditions.

Overall, this is a non-trivial difference in performance for our use case, and it would be great to work towards adding this functionality. I'm also happy to move this discussion to the s3transfer repo if that is more appropriate. Is there an open issue there along these lines?

tim-finnigan · 2022-11-16T21:55:40Z

Hi @forrestfwilliams thanks for your patience. I'll go ahead and close this issue so we can continue tracking #1215 in the boto3 repo and boto/s3transfer#248 which you opened in the s3transfer repo. I plan to bring this feature request up with the team soon for further review and feedback.

forrestfwilliams · 2022-11-18T13:32:43Z

@tim-finnigan thank you. This feature will be a major feature improvement for my organization, as well as anyone trying to access subsets of data files in AWS.

forrestfwilliams added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Oct 21, 2022

forrestfwilliams changed the title ~~Support Range Requests in S3Transfer's download_file~~ Support Multipart Range Requests in S3Transfer's download_file Oct 21, 2022

tim-finnigan added response-requested Waiting on additional information or feedback. and removed needs-triage This issue or PR still needs to be triaged. labels Oct 25, 2022

github-actions bot removed the response-requested Waiting on additional information or feedback. label Oct 27, 2022

forrestfwilliams mentioned this issue Nov 2, 2022

Support multipart downloads when downloading large ranges via TransferManager.download() boto/s3transfer#248

Open

aBurmeseDev added the p3 This is a minor priority issue label Nov 8, 2022

tim-finnigan self-assigned this Nov 16, 2022

tim-finnigan closed this as completed Nov 16, 2022

tim-finnigan added the duplicate This issue is a duplicate. label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multipart Range Requests in S3Transfer's download_file #3466

Support Multipart Range Requests in S3Transfer's download_file #3466

forrestfwilliams commented Oct 21, 2022 •

edited

Loading

tim-finnigan commented Oct 25, 2022

forrestfwilliams commented Oct 27, 2022

tim-finnigan commented Nov 16, 2022

forrestfwilliams commented Nov 18, 2022

Support Multipart Range Requests in S3Transfer's download_file #3466

Support Multipart Range Requests in S3Transfer's download_file #3466

Comments

forrestfwilliams commented Oct 21, 2022 • edited Loading

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

SDK version used

Environment details (OS name and version, etc.)

tim-finnigan commented Oct 25, 2022

forrestfwilliams commented Oct 27, 2022

tim-finnigan commented Nov 16, 2022

forrestfwilliams commented Nov 18, 2022

forrestfwilliams commented Oct 21, 2022 •

edited

Loading