Merk
Tilgang til denne siden krever autorisasjon. Du kan prøve å logge på eller endre kataloger.
Tilgang til denne siden krever autorisasjon. Du kan prøve å endre kataloger.
A distributed collection of data grouped into named columns.
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.
Important
A DataFrame should not be directly created using the constructor.
Supports Spark Connect
Properties
| Property | Description |
|---|---|
sparkSession |
Returns SparkSession that created this DataFrame. |
rdd |
Returns the content as an RDD of Row (Classic mode only). |
na |
Returns a DataFrameNaFunctions for handling missing values. |
stat |
Returns a DataFrameStatFunctions for statistic functions. |
write |
Interface for saving the content of the non-streaming DataFrame out into external storage. |
writeStream |
Interface for saving the content of the streaming DataFrame out into external storage. |
schema |
Returns the schema of this DataFrame as a StructType. |
dtypes |
Returns all column names and their data types as a list. |
columns |
Retrieves the names of all columns in the DataFrame as a list. |
storageLevel |
Get the DataFrame's current storage level. |
isStreaming |
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives. |
executionInfo |
Returns a ExecutionInfo object after the query was executed. |
plot |
Returns a PySparkPlotAccessor for plotting functions. |
Methods
Data viewing and inspection
| Method | Description |
|---|---|
toJSON(use_unicode) |
Converts a DataFrame into a RDD of string or DataFrame. |
printSchema(level) |
Prints out the schema in the tree format. |
explain(extended, mode) |
Prints the (logical and physical) plans to the console for debugging purposes. |
show(n, truncate, vertical) |
Prints the first n rows of the DataFrame to the console. |
collect() |
Returns all the records in the DataFrame as a list of Row. |
toLocalIterator(prefetchPartitions) |
Returns an iterator that contains all of the rows in this DataFrame. |
take(num) |
Returns the first num rows as a list of Row. |
tail(num) |
Returns the last num rows as a list of Row. |
head(n) |
Returns the first n rows. |
first() |
Returns the first row as a Row. |
count() |
Returns the number of rows in this DataFrame. |
isEmpty() |
Checks if the DataFrame is empty and returns a boolean value. |
describe(*cols) |
Computes basic statistics for numeric and string columns. |
summary(*statistics) |
Computes specified statistics for numeric and string columns. |
Temporary views
| Method | Description |
|---|---|
createTempView(name) |
Creates a local temporary view with this DataFrame. |
createOrReplaceTempView(name) |
Creates or replaces a local temporary view with this DataFrame. |
createGlobalTempView(name) |
Creates a global temporary view with this DataFrame. |
createOrReplaceGlobalTempView(name) |
Creates or replaces a global temporary view using the given name. |
Selection and projection
| Method | Description |
|---|---|
select(*cols) |
Projects a set of expressions and returns a new DataFrame. |
selectExpr(*expr) |
Projects a set of SQL expressions and returns a new DataFrame. |
filter(condition) |
Filters rows using the given condition. |
where(condition) |
Alias for filter. |
drop(*cols) |
Returns a new DataFrame without specified columns. |
toDF(*cols) |
Returns a new DataFrame with new specified column names. |
withColumn(colName, col) |
Returns a new DataFrame by adding a column or replacing the existing column that has the same name. |
withColumns(*colsMap) |
Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names. |
withColumnRenamed(existing, new) |
Returns a new DataFrame by renaming an existing column. |
withColumnsRenamed(colsMap) |
Returns a new DataFrame by renaming multiple columns. |
withMetadata(columnName, metadata) |
Returns a new DataFrame by updating an existing column with metadata. |
metadataColumn(colName) |
Selects a metadata column based on its logical column name and returns it as a Column. |
colRegex(colName) |
Selects column based on the column name specified as a regex and returns it as Column. |
Sorting and ordering
| Method | Description |
|---|---|
sort(*cols, **kwargs) |
Returns a new DataFrame sorted by the specified column(s). |
orderBy(*cols, **kwargs) |
Alias for sort. |
sortWithinPartitions(*cols, **kwargs) |
Returns a new DataFrame with each partition sorted by the specified column(s). |
Aggregation and grouping
| Method | Description |
|---|---|
groupBy(*cols) |
Groups the DataFrame by the specified columns so that aggregation can be performed on them. |
rollup(*cols) |
Create a multi-dimensional rollup for the current DataFrame using the specified columns. |
cube(*cols) |
Create a multi-dimensional cube for the current DataFrame using the specified columns. |
groupingSets(groupingSets, *cols) |
Create multi-dimensional aggregation for the current DataFrame using the specified grouping sets. |
agg(*exprs) |
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()). |
observe(observation, *exprs) |
Define (named) metrics to observe on the DataFrame. |
Joins
| Method | Description |
|---|---|
join(other, on, how) |
Joins with another DataFrame, using the given join expression. |
crossJoin(other) |
Returns the cartesian product with another DataFrame. |
lateralJoin(other, on, how) |
Lateral joins with another DataFrame, using the given join expression. |
Set operations
| Method | Description |
|---|---|
union(other) |
Return a new DataFrame containing the union of rows in this and another DataFrame. |
unionByName(other, allowMissingColumns) |
Returns a new DataFrame containing union of rows in this and another DataFrame. |
intersect(other) |
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame. |
intersectAll(other) |
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. |
subtract(other) |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame. |
exceptAll(other) |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. |
Deduplication
| Method | Description |
|---|---|
distinct() |
Returns a new DataFrame containing the distinct rows in this DataFrame. |
dropDuplicates(subset) |
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. |
dropDuplicatesWithinWatermark(subset) |
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns, within watermark. |
Sampling and splitting
| Method | Description |
|---|---|
sample(withReplacement, fraction, seed) |
Returns a sampled subset of this DataFrame. |
sampleBy(col, fractions, seed) |
Returns a stratified sample without replacement based on the fraction given on each stratum. |
randomSplit(weights, seed) |
Randomly splits this DataFrame with the provided weights. |
Partitioning
| Method | Description |
|---|---|
coalesce(numPartitions) |
Returns a new DataFrame that has exactly numPartitions partitions. |
repartition(numPartitions, *cols) |
Returns a new DataFrame partitioned by the given partitioning expressions. |
repartitionByRange(numPartitions, *cols) |
Returns a new DataFrame partitioned by the given partitioning expressions. |
repartitionById(numPartitions, partitionIdCol) |
Returns a new DataFrame partitioned by the given partition ID expression. |
Reshaping
| Method | Description |
|---|---|
unpivot(ids, values, variableColumnName, valueColumnName) |
Unpivot a DataFrame from wide format to long format. |
melt(ids, values, variableColumnName, valueColumnName) |
Alias for unpivot. |
transpose(indexColumn) |
Transposes a DataFrame such that the values in the specified index column become the new columns. |
Missing data handling
| Method | Description |
|---|---|
dropna(how, thresh, subset) |
Returns a new DataFrame omitting rows with null or NaN values. |
fillna(value, subset) |
Returns a new DataFrame which null values are filled with new value. |
replace(to_replace, value, subset) |
Returns a new DataFrame replacing a value with another value. |
Statistical functions
| Method | Description |
|---|---|
approxQuantile(col, probabilities, relativeError) |
Calculates the approximate quantiles of numerical columns of a DataFrame. |
corr(col1, col2, method) |
Calculates the correlation of two columns of a DataFrame as a double value. |
cov(col1, col2) |
Calculate the sample covariance for the given columns, specified by their names. |
crosstab(col1, col2) |
Computes a pair-wise frequency table of the given columns. |
freqItems(cols, support) |
Finding frequent items for columns, possibly with false positives. |
Schema operations
| Method | Description |
|---|---|
to(schema) |
Returns a new DataFrame where each row is reconciled to match the specified schema. |
alias(alias) |
Returns a new DataFrame with an alias set. |
Iteration
| Method | Description |
|---|---|
foreach(f) |
Applies the f function to all Row of this DataFrame. |
foreachPartition(f) |
Applies the f function to each partition of this DataFrame. |
Caching and persistence
| Method | Description |
|---|---|
cache() |
Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER). |
persist(storageLevel) |
Sets the storage level to persist the contents of the DataFrame across operations. |
unpersist(blocking) |
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
Checkpointing
| Method | Description |
|---|---|
checkpoint(eager) |
Returns a checkpointed version of this DataFrame. |
localCheckpoint(eager, storageLevel) |
Returns a locally checkpointed version of this DataFrame. |
Streaming operations
| Method | Description |
|---|---|
withWatermark(eventTime, delayThreshold) |
Defines an event time watermark for this DataFrame. |
Optimization hints
| Method | Description |
|---|---|
hint(name, *parameters) |
Specifies some hint on the current DataFrame. |
Limits and offsets
| Method | Description |
|---|---|
limit(num) |
Limits the result count to the number specified. |
offset(num) |
Returns a new DataFrame by skipping the first n rows. |
Advanced transformations
| Method | Description |
|---|---|
transform(func, *args, **kwargs) |
Returns a new DataFrame. Concise syntax for chaining custom transformations. |
Conversion methods
| Method | Description |
|---|---|
toPandas() |
Returns the contents of this DataFrame as Pandas pandas.DataFrame. |
toArrow() |
Returns the contents of this DataFrame as PyArrow pyarrow.Table. |
pandas_api(index_col) |
Converts the existing DataFrame into a pandas-on-Spark DataFrame. |
mapInPandas(func, schema, barrier, profile) |
Maps an iterator of batches in the current DataFrame using a Python native function. |
mapInArrow(func, schema, barrier, profile) |
Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatch. |
Writing data
| Method | Description |
|---|---|
writeTo(table) |
Create a write configuration builder for v2 sources. |
mergeInto(table, condition) |
Merges a set of updates, insertions, and deletions based on a source table into a target table. |
DataFrame comparison
| Method | Description |
|---|---|
sameSemantics(other) |
Returns True when the logical query plans inside both DataFrames are equal. |
semanticHash() |
Returns a hash code of the logical query plan against this DataFrame. |
Metadata and file information
| Method | Description |
|---|---|
inputFiles() |
Returns a best-effort snapshot of the files that compose this DataFrame. |
Advanced SQL features
| Method | Description |
|---|---|
isLocal() |
Returns True if the collect and take methods can be run locally. |
asTable() |
Converts the DataFrame into a TableArg object, which can be used as a table argument in a TVF. |
scalar() |
Return a Column object for a SCALAR Subquery containing exactly one row and one column. |
exists() |
Return a Column object for an EXISTS Subquery. |
Examples
Basic DataFrame operations
# Create a DataFrame
people = spark.createDataFrame([
{"deptId": 1, "age": 40, "name": "Alice", "gender": "M", "salary": 50},
{"deptId": 1, "age": 50, "name": "Bob", "gender": "M", "salary": 100},
{"deptId": 2, "age": 60, "name": "Sue", "gender": "F", "salary": 150},
{"deptId": 3, "age": 20, "name": "Tom", "gender": "M", "salary": 200}
])
# Select columns
people.select("name", "age").show()
# Filter rows
people.filter(people.age > 30).show()
# Add a new column
people.withColumn("age_plus_10", people.age + 10).show()
Aggregation and grouping
# Group by and aggregate
people.groupBy("gender").agg({"salary": "avg", "age": "max"}).show()
# Multiple aggregations
from pyspark.sql import functions as F
people.groupBy("deptId").agg(
F.avg("salary").alias("avg_salary"),
F.max("age").alias("max_age")
).show()
Joins
# Create another DataFrame
department = spark.createDataFrame([
{"id": 1, "name": "PySpark"},
{"id": 2, "name": "ML"},
{"id": 3, "name": "Spark SQL"}
])
# Join DataFrames
people.join(department, people.deptId == department.id).show()
Complex transformations
# Chained operations
result = people.filter(people.age > 30) \\
.join(department, people.deptId == department.id) \\
.groupBy(department.name, "gender") \\
.agg({"salary": "avg", "age": "max"}) \\
.sort("max(age)")
result.show()