aws/aws-sdk-pandas

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`.

Open

#2,845 opened on Jun 4, 2024

View on GitHub
 (4 comments) (0 reactions) (0 assignees)Python (630 forks)batch import
backlogbughelp wanted

Repository metrics

Stars
 (3,560 stars)
PR merge metrics
 (Avg merge 9d 12h) (24 merged PRs in 30d)

Description

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

Contributor guide