Merk
Tilgang til denne siden krever autorisasjon. Du kan prøve å logge på eller endre kataloger.
Tilgang til denne siden krever autorisasjon. Du kan prøve å endre kataloger.
Computes aggregates and returns the result as a DataFrame.
The available aggregate functions can be:
- Built-in aggregation functions, such as
avg,max,min,sum,count. - Group aggregate pandas UDFs, created with
pyspark.sql.functions.pandas_udf.
Syntax
agg(*exprs)
Parameters
| Parameter | Type | Description |
|---|---|---|
exprs |
dict or Column | A dict mapping from column name (string) to aggregate functions (string), or a list of aggregate Column expressions. |
Returns
DataFrame
Notes
Built-in aggregation functions and group aggregate pandas UDFs cannot be mixed in a single call to this function.
When exprs is a single dict, the key is the column to perform aggregation on and the value is the aggregate function. When exprs is a list of Column expressions, each expression specifies an aggregation to compute.
Examples
import pandas as pd
from pyspark.sql import functions as sf
df = spark.createDataFrame(
[(2, "Alice"), (3, "Alice"), (5, "Bob"), (10, "Bob")], ["age", "name"])
# Group-by name, and count each group.
df.groupBy(df.name).agg({"*": "count"}).sort("name").show()
# +-----+--------+
# | name|count(1)|
# +-----+--------+
# |Alice| 2|
# | Bob| 2|
# +-----+--------+
# Group-by name, and calculate the minimum age.
df.groupBy(df.name).agg(sf.min(df.age)).sort("name").show()
# +-----+--------+
# | name|min(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 5|
# +-----+--------+
# Same as above but uses a pandas UDF.
from pyspark.sql.functions import pandas_udf
@pandas_udf('int')
def min_udf(v: pd.Series) -> int:
return v.min()
df.groupBy(df.name).agg(min_udf(df.age)).sort("name").show()
# +-----+------------+
# | name|min_udf(age)|
# +-----+------------+
# |Alice| 2|
# | Bob| 5|
# +-----+------------+