Del via


summary

Computes specified statistics for numeric and string columns. Available statistics are: count, mean, stddev, min, max, arbitrary approximate percentiles specified as a percentage (e.g., 75%).

Syntax

summary(*statistics: str)

Parameters

Parameter Type Description
statistics str, optional Column names to calculate statistics by (default All columns).

Returns

DataFrame: A new DataFrame that provides statistics for the given DataFrame.

Notes

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

Examples

df = spark.createDataFrame(
    [("Bob", 13, 40.3, 150.5), ("Alice", 12, 37.8, 142.3), ("Tom", 11, 44.1, 142.2)],
    ["name", "age", "weight", "height"],
)
df.select("age", "weight", "height").summary().show()
# +-------+----+------------------+-----------------+
# |summary| age|            weight|           height|
# +-------+----+------------------+-----------------+
# |  count|   3|                 3|                3|
# |   mean|12.0| 40.73333333333333|            145.0|
# | stddev| 1.0|3.1722757341273704|4.763402145525822|
# |    min|  11|              37.8|            142.2|
# |    25%|  11|              37.8|            142.2|
# |    50%|  12|              40.3|            142.3|
# |    75%|  13|              44.1|            150.5|
# |    max|  13|              44.1|            150.5|
# +-------+----+------------------+-----------------+

df.select("age", "weight", "height").summary("count", "min", "25%", "75%", "max").show()
# +-------+---+------+------+
# |summary|age|weight|height|
# +-------+---+------+------+
# |  count|  3|     3|     3|
# |    min| 11|  37.8| 142.2|
# |    25%| 11|  37.8| 142.2|
# |    75%| 13|  44.1| 150.5|
# |    max| 13|  44.1| 150.5|
# +-------+---+------+------+