Del via


mapInArrow

Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatchs both as input and output, and returns the result as a DataFrame.

Syntax

mapInArrow(func: "ArrowMapIterFunction", schema: Union[StructType, str], barrier: bool = False, profile: Optional[ResourceProfile] = None)

Parameters

Parameter Type Description
func function a Python native function that takes an iterator of pyarrow.RecordBatchs, and outputs an iterator of pyarrow.RecordBatchs.
schema DataType or str the return type of the func in PySpark. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.
barrier bool, optional, default False Use barrier mode execution, ensuring that all Python workers in the stage will be launched concurrently.
profile ResourceProfile, optional The optional ResourceProfile to be used for mapInArrow.

Returns

DataFrame

Examples

import pyarrow as pa
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
    for batch in iterator:
        pdf = batch.to_pandas()
        yield pa.RecordBatch.from_pandas(pdf[pdf.id == 1])
df.mapInArrow(filter_func, df.schema).show()
# +---+---+
# | id|age|
# +---+---+
# |  1| 21|
# +---+---+

df.mapInArrow(filter_func, df.schema, barrier=True).collect()
# [Row(id=1, age=21)]