Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect partitionValues_parsed with id & name column mapping in Delta Lake #24129

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Nov 14, 2024

Description

We should write physical column names in partitionValues_parsed field on checkpoint files.

Fixes #24121

Release notes

## Delta Lake
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 14, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Nov 14, 2024
@@ -161,7 +161,7 @@ public void write(CheckpointEntries entries, TrinoOutputFile outputFile)
}
List<DeltaLakeColumnHandle> partitionColumns = extractPartitionColumns(entries.metadataEntry(), entries.protocolEntry(), typeManager);
List<RowType.Field> partitionValuesParsedFieldTypes = partitionColumns.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a corresponding test in io.trino.plugin.deltalake.transactionlog.checkpoint.TestCheckpointWriter

Copy link
Member Author

@ebyhr ebyhr Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share scenarios you want to cover in the class? I intentionally avoided that. Both TestCheckpointWriter & TestCheckpointEntryIterator are not suitable to verify partitionValues_parsed field because AddFileEntry doesn't hold the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a test similar to io.trino.plugin.deltalake.transactionlog.checkpoint.TestCheckpointEntryIterator#testReadAddEntriesPartitionPruning with corresponding resource files

@@ -183,7 +183,7 @@ public RowType getAddEntryType(
List<DeltaLakeColumnHandle> partitionColumns = extractPartitionColumns(metadataEntry, protocolEntry, typeManager);
if (!partitionColumns.isEmpty()) {
List<RowType.Field> partitionValuesParsed = partitionColumns.stream()
.map(column -> RowType.field(column.columnName(), typeManager.getType(getTypeSignature(DeltaHiveTypeTranslator.toHiveType(column.type())))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test in io.trino.plugin.deltalake.transactionlog.checkpoint.TestCheckpointEntryIterator

@ebyhr ebyhr force-pushed the ebi/delta-writer-partition-stats branch from ee3627b to a60bbd3 Compare November 14, 2024 07:26
@@ -161,7 +161,7 @@ public void write(CheckpointEntries entries, TrinoOutputFile outputFile)
}
List<DeltaLakeColumnHandle> partitionColumns = extractPartitionColumns(entries.metadataEntry(), entries.protocolEntry(), typeManager);
List<RowType.Field> partitionValuesParsedFieldTypes = partitionColumns.stream()
.map(column -> RowType.field(column.columnName(), column.type()))
.map(column -> RowType.field(column.basePhysicalColumnName(), column.type()))
Copy link
Contributor

@findinpath findinpath Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now

"PartitionValues": {
        "col-6d32b73c-d46b-47f3-aeee-b4ce2231c81f": "30"
      },
      "PartitionValues_parsed": {
        "Col456d32b73c45d46b4547f345aeee45b4ce2231c81f": 30
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the missing dashes.

findinpath

This comment was marked as outdated.

@findinpath findinpath self-requested a review November 14, 2024 08:43
try (TestTable table = new TestTable(
getQueryRunner()::execute,
"test_checkpoint",
"(x int, part int) WITH (checkpoint_interval = 3, column_mapping_mode = '" + columnMappingMode + "', partitioned_by = ARRAY['part'])")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add also test for other types (like Date) which has different representation in PartitionValues and PartitionValues_parsed

testPartitionValuesParsedCheckpoint(ColumnMappingMode.NONE);
}

private void testPartitionValuesParsedCheckpoint(ColumnMappingMode columnMappingMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have also product test in TestDeltaLakeColumnMappingMode to check reading/writing checkpoints by trino/delta

@ebyhr ebyhr marked this pull request as draft November 14, 2024 13:21
@ebyhr
Copy link
Member Author

ebyhr commented Nov 14, 2024

(Changed to draft for avoiding accidental merge)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
4 participants