ToProtobuf¶

Overview¶

The ToProtobuf expression converts Spark SQL struct data into Protocol Buffer (protobuf) binary format. It serves as a runtime-replaceable expression that validates input parameters and delegates actual conversion to the CatalystDataToProtobuf implementation. This expression enables serialization of structured data to protobuf format for interoperability with protobuf-based systems.

Syntax¶

TO_PROTOBUF(data, messageName [, descFilePath] [, options])

// DataFrame API usage through SQL expression
df.selectExpr("TO_PROTOBUF(struct_column, 'MessageName', 'path/to/descriptor.desc') as protobuf_data")

Arguments¶

Argument	Type	Description
`data`	StructType	The struct data to be converted to protobuf format
`messageName`	StringType (constant)	The name of the protobuf message type to use for conversion
`descFilePath`	StringType/BinaryType (constant, optional)	File path to protobuf descriptor file or binary descriptor content
`options`	MapType[String, String] (constant, optional)	Configuration options for the protobuf conversion process

Return Type¶

Returns BinaryType - the protobuf-encoded binary representation of the input struct data.

Supported Data Types¶

Input data: Must be StructType only
Message name: StringType (must be foldable/constant)
Descriptor file: StringType, BinaryType, or NullType (must be foldable/constant)
Options: MapType[StringType, StringType] or NullType (must be foldable/constant)

Algorithm¶

Validates that the input data is a struct type and all parameters are constant expressions
Extracts the message name string from the constant expression
Reads descriptor file content (either from file path or direct binary data)
Parses options map from the constant map expression
Uses reflection to instantiate CatalystDataToProtobuf with the validated parameters
Delegates actual protobuf conversion to the replacement expression

Partitioning Behavior¶

Preserves partitioning: Yes, this is a row-level transformation that doesn't require data movement
Requires shuffle: No, operates independently on each row within partitions
Can be applied per-partition without cross-partition dependencies

Edge Cases¶

Null message name: Throws IllegalArgumentException at runtime
Missing protobuf library: Throws QueryCompilationErrors.protobufNotLoadedSqlFunctionsUnusable
Invalid descriptor file: Handled by ProtobufUtils.readDescriptorFileContent
Empty options map: Defaults to Map.empty when null or empty
Null descriptor file: Treated as None (uses default descriptor resolution)
Non-struct input data: Fails type checking with NON_STRUCT_TYPE error

Code Generation¶

This expression uses interpreted mode as it's a RuntimeReplaceable expression. Code generation is handled by the replacement CatalystDataToProtobuf expression, not by ToProtobuf itself. The expression acts as a factory that creates the actual implementation during analysis phase.

Examples¶

-- Basic usage with message name only
SELECT TO_PROTOBUF(struct_col, 'PersonMessage') as protobuf_data
FROM table_name;

-- With descriptor file path
SELECT TO_PROTOBUF(
  struct_col, 
  'PersonMessage', 
  '/path/to/person.desc'
) as protobuf_data
FROM table_name;

-- With options map
SELECT TO_PROTOBUF(
  struct_col,
  'PersonMessage',
  '/path/to/person.desc',
  map('option1', 'value1', 'option2', 'value2')
) as protobuf_data
FROM table_name;

// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(
  expr("TO_PROTOBUF(person_struct, 'PersonMessage', 'person.desc')").alias("protobuf_data")
)

// With multiple parameters
df.selectExpr(
  "TO_PROTOBUF(data_struct, 'MyMessage', descriptor_path, conversion_options) as proto_binary"
)