Jaa


createDataFrame

Creates a DataFrame from an RDD, a list, a pandas.DataFrame, a numpy.ndarray, or a pyarrow.Table.

Syntax

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Parameters

Parameter Type Description
data RDD or iterable An RDD of any kind of SQL data representation (Row, tuple, int, bool, dict, etc.), or a list, pandas.DataFrame, numpy.ndarray, or pyarrow.Table.
schema DataType, str, or list, optional A DataType, a datatype string, or a list of column names. When a list of column names is provided, the type of each column is inferred from data. When None, schema is inferred from data (requires Row, namedtuple, or dict). When a DataType or datatype string is provided, it must match the actual data.
samplingRatio float, optional The sample ratio of rows used for schema inference when data is an RDD. If None, the first few rows are used.
verifySchema bool, optional Verify data types of every row against the schema. Enabled by default. Not supported with pyarrow.Table input or Arrow-enabled pandas conversion.

Returns

DataFrame

Notes

Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.

Examples

# Create a DataFrame from a list of tuples.
spark.createDataFrame([('Alice', 1)]).show()
# +-----+---+
# |   _1| _2|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create a DataFrame from a list of dictionaries.
spark.createDataFrame([{'name': 'Alice', 'age': 1}]).show()
# +---+-----+
# |age| name|
# +---+-----+
# |  1|Alice|
# +---+-----+

# Create a DataFrame with column names specified.
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create a DataFrame with an explicit schema.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)])
spark.createDataFrame([('Alice', 1)], schema).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create a DataFrame with a DDL-formatted schema string.
spark.createDataFrame([('Alice', 1)], "name: string, age: int").show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create an empty DataFrame (schema is required when data is empty).
spark.createDataFrame([], "name: string, age: int").show()
# +----+---+
# |name|age|
# +----+---+
# +----+---+

# Create a DataFrame from Row objects.
from pyspark.sql import Row
Person = Row('name', 'age')
spark.createDataFrame([Person("Alice", 1)]).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+