Del via


bucketBy

Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.

Syntax

bucketBy(numBuckets, col, *cols)

Parameters

Parameter Type Description
numBuckets int The number of buckets to save.
col str, list, or tuple A column name, or a list of names.
*cols str, optional Additional column names. Must be empty if col is a list.

Returns

DataFrameWriter

Notes

Applicable for file-based data sources in combination with DataFrameWriter.saveAsTable.

Examples

Write a DataFrame into a bucketed table, and read it back.

spark.sql("DROP TABLE IF EXISTS bucketed_table")
spark.createDataFrame([
    (100, "Alice"), (120, "Alice"), (140, "Bob")],
    schema=["age", "name"]
).write.bucketBy(2, "name").mode("overwrite").saveAsTable("bucketed_table")

spark.read.table("bucketed_table").sort("age").show()
# +---+------------+
# |age|        name|
# +---+------------+
# |100|Alice|
# |120|Alice|
# |140| Bob|
# +---+------------+

spark.sql("DROP TABLE bucketed_table")