Skip to content

ToAvro

Overview

The ToAvro expression converts input data into Apache Avro binary format. It serves as a runtime-replaceable wrapper that delegates to the actual Avro serialization implementation at query planning time, supporting both automatic schema inference and explicit schema specification via JSON format.

Syntax

TO_AVRO(column)
TO_AVRO(column, schema_json)
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(expr("to_avro(column)"))
df.select(expr("to_avro(column, schema_json)"))

Arguments

Argument Type Description
child Expression The input expression/column to convert to Avro format
jsonFormatSchema Expression Optional JSON string representing the Avro schema to use for conversion. Must be a constant string or null

Return Type

Binary data type containing the Avro-encoded representation of the input data.

Supported Data Types

The expression accepts any input data type for the child expression, as the underlying CatalystDataToAvro implementation handles the conversion. The jsonFormatSchema parameter must be:

  • StringType (constant/foldable expressions only)
  • NullType (when schema parameter is omitted)

Algorithm

  • Validates that the jsonFormatSchema parameter is either null or a constant string expression
  • At runtime replacement phase, evaluates the schema parameter to extract the JSON schema string
  • Uses reflection to instantiate the CatalystDataToAvro class from the spark-avro module
  • Delegates actual Avro serialization to the CatalystDataToAvro expression with the provided schema
  • Throws compilation error if spark-avro module is not available on the classpath

Partitioning Behavior

This expression preserves partitioning as it operates row-by-row without requiring data redistribution:

  • Preserves existing partitioning schemes
  • Does not require shuffle operations
  • Can be applied within partition boundaries

Edge Cases

  • Null jsonFormatSchema: Uses automatic schema inference from input data structure
  • Missing spark-avro dependency: Throws QueryCompilationErrors.avroNotLoadedSqlFunctionsUnusable at query compilation time
  • Non-constant schema parameter: Fails type checking with descriptive error message about requiring constant string
  • Invalid JSON schema format: Error handling delegated to underlying CatalystDataToAvro implementation

Code Generation

This expression uses runtime replacement pattern and does not directly support code generation. Code generation support depends on the underlying CatalystDataToAvro expression that replaces this wrapper at query planning time.

Examples

-- Convert struct to Avro with automatic schema inference
SELECT TO_AVRO(struct(id, name, age)) as avro_data FROM users;

-- Convert with explicit schema
SELECT TO_AVRO(user_data, '{"type":"record","name":"User","fields":[{"name":"id","type":"int"}]}') 
FROM users;
// DataFrame API with automatic schema
import org.apache.spark.sql.functions._
df.select(expr("to_avro(struct(id, name))").alias("avro_data"))

// With explicit schema
val schema = """{"type":"record","name":"User","fields":[{"name":"id","type":"int"}]}"""
df.select(expr(s"to_avro(user_struct, '$schema')"))

See Also

  • FROM_AVRO: Expression for deserializing Avro binary data back to Spark data types
  • CatalystDataToAvro: The underlying implementation class that performs actual Avro serialization
  • RuntimeReplaceable: The trait that enables compile-time expression replacement pattern