read (DataSourceReader)

Generates data for a given partition and returns an iterator of tuples or rows.

This method is invoked once per partition to read the data. Implementing this method is required for readable data sources. You can initialize any non-serializable resources required for reading data from the data source within this method.

Syntax

read(partition: InputPartition)

Parameters

Parameter	Type	Description
`partition`	InputPartition	The partition to read. It must be one of the partition values returned by `partitions()`.

Returns

Iterator[Tuple] or Iterator[RecordBatch]

An iterator of tuples or rows. Each tuple or row will be converted to a row in the final DataFrame. It can also return an iterator of PyArrow RecordBatch objects if the data source supports it.

Examples

Yields a list of tuples:

def read(self, partition: InputPartition):
    yield (partition.value, 0)
    yield (partition.value, 1)

Yields a list of rows:

def read(self, partition: InputPartition):
    yield Row(partition=partition.value, value=0)
    yield Row(partition=partition.value, value=1)

Yields PyArrow RecordBatch objects:

def read(self, partition: InputPartition):
    import pyarrow as pa
    data = {
        "partition": [partition.value] * 2,
        "value": [0, 1]
    }
    table = pa.Table.from_pydict(data)
    for batch in table.to_batches():
        yield batch

Tilbakemeldinger

Var denne siden nyttig?

Last updated on 2026-04-17