`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. · aws/aws-sdk-pandas#2845

Repository metrics

Stars: (3,560 stars)
PR merge metrics: (Avg merge 9d 12h) (24 merged PRs in 30d)

Description

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

Contributor guide

Research direction: Inspect the function `delete from iceberg table` in `awswrangler/athena/ write iceberg.py` around line 452 to understand how it processes `partition cols`. Determine if the issue can be fixed by parsing expressions like `hour(column)` to extract the column name for deletion, or by modifying the logic to handle transformed partition columns.
Tech stack: python
Domain: backend
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Mostly clear
Prerequisites: Python
Newbie friendliness: 70

Repository metrics

Description

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

Contributor guide

Get fresh easy issues in your inbox.