Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta #24110

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Commits on Nov 12, 2024

  1. Configuration menu
    Copy the full SHA
    e914491 View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2024

  1. Parallelize Hive tables retrieval

    The parameter for specifying the maximum number of threads fetching
    tables ("hive.metadata.parallelism") aligns with the naming convention
    used in the BigQuery connector ("bigquery.metadata.parallelism").
    
    Parallelization has been introduced in HiveMetadata rather than in
    specific metastore implementations, primarily to avoid reintroducing a
    cache storing tables for all schemas, which was removed in
    trinodb@cb4d168.
    This approach attempts to parallelize table retrieval for all metastore
    types, even though not all support concurrent access. Currently, only
    the FileHiveMetastore does not support multithreaded access, making
    parallelization ineffective.
    
    Question: Should we consider setting the default value of
    "hive.metadata.parallelism" to 1 when using the "file" metastore?
    piotrrzysko committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    9388bf2 View commit details
    Browse the repository at this point in the history
  2. Parallelize Delta Lake tables retrieval

    Before introducing DeltaLakeMetadata::getRelationTypes,
    ConnectorMetadata::getRelationTypes was used to retrieve relation types
    for Delta Lake. The original implementation classified all tables as
    RelationType.TABLE, except those with the extended relational type
    TRINO_VIEW, which were classified as RelationType.VIEW. This is why the
    resolveRelationType method was added in this commit.
    
    Question: Is this resolution necessary? Could we instead use the
    existing mapping between ExtendedRelationType and RelationType that's
    already encapsulated in RelationType?
    piotrrzysko committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    48633f5 View commit details
    Browse the repository at this point in the history
  3. Parallelize Iceberg tables retrieval

    Parallelization has been implemented at the TrinoCatalog level, rather
    than in IcebergMetadata, because some catalogs (e.g., Nessie) seem to
    support optimized table retrieval across all schemas. Currently,
    parallelization has been added for Glue and Hive catalogs, but it can
    easily be extended to other catalogs as well.
    piotrrzysko committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    f7f8579 View commit details
    Browse the repository at this point in the history