Del via


to_avro

Converts a column into binary of Avro format.

If both subject and schemaRegistryAddress are provided, the function converts a column into binary of Schema Registry Avro format. The input data schema must have been registered to the given subject in Schema Registry, or the query fails at runtime.

Syntax

from pyspark.sql.avro.functions import to_avro

to_avro(data, jsonFormatSchema=None, subject=None, schemaRegistryAddress=None, options=None)

Parameters

Parameter Type Description
data pyspark.sql.Column or str The data column to serialize.
jsonFormatSchema str, optional User-specified output Avro schema in JSON string format.
subject pyspark.sql.Column or str, optional The subject in Schema Registry that the data belongs to.
schemaRegistryAddress str, optional The address (host and port) of the Schema Registry.
options dict, optional Options to control how the Avro record is serialized and configuration for the schema registry client.

Returns

pyspark.sql.Column: A new column containing the Avro-encoded binary data.

Examples

Example 1: Converting a string column to Avro binary format

from pyspark.sql.avro.functions import to_avro

data = ['SPADES']
df = spark.createDataFrame(data, "string")
df.select(to_avro(df.value).alias("avro")).show(truncate=False)
+--------------------+
|avro                |
+--------------------+
|[00 0C 53 50 41 4...|
+--------------------+

Example 2: Converting a string column to Avro using a custom JSON schema

from pyspark.sql.avro.functions import to_avro

data = ['SPADES']
df = spark.createDataFrame(data, "string")
json_format_schema = '''["null", {"type": "enum", "name": "value",
    "symbols": ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]}]'''
df.select(to_avro(df.value, json_format_schema).alias("avro")).show(truncate=False)
+--------+
|avro    |
+--------+
|[02 00] |
+--------+