UserDefinedGenerator¶

Overview¶

A generator expression that produces multiple output rows using a user-provided lambda function. It converts Catalyst internal rows to Scala rows, applies the user function, and returns an iterable collection of internal rows for further processing.

Syntax¶

// DataFrame API usage - typically used internally
UserDefinedGenerator(elementSchema, function, children)

Arguments¶

Argument	Type	Description
elementSchema	StructType	Schema definition for the output rows produced by the generator
function	Row => IterableOnce[InternalRow]	User-defined lambda function that takes a Row and returns an iterable collection of InternalRows
children	Seq[Expression]	Child expressions that provide input data to the generator function

Return Type¶

Returns IterableOnce[InternalRow] - a collection of internal rows that can be iterated over to produce multiple output rows from a single input row.

Supported Data Types¶

All Catalyst data types supported through child expressions
Input types determined by the data types of child expressions
Output types must conform to the provided elementSchema
Supports User Defined Types (UDT) through schema-based conversion

Algorithm¶

Lazily initializes converters on first evaluation call
Creates an InterpretedProjection from child expressions to extract input values
Builds a Catalyst-to-Scala converter based on child expression schemas and data types
Projects input row through child expressions to get intermediate values
Converts intermediate internal row to Scala Row using type converters
Applies user-defined function to the converted Scala row
Returns the iterable collection of InternalRows produced by the function

Partitioning Behavior¶

How this expression affects partitioning:

Does not preserve partitioning as it's a generator that can produce variable numbers of output rows
May require shuffle depending on downstream operations that depend on generated data distribution
Generator nature means output cardinality is unpredictable and depends on user function behavior

Edge Cases¶

Null handling depends on the user-defined function implementation - no built-in null safety
Empty input behavior: if child expressions produce null/empty values, behavior depends on user function
Lazy initialization means first evaluation may have higher latency due to converter setup
Schema mismatch between elementSchema and actual function output can cause runtime failures
Function exceptions are propagated up without additional error handling

Code Generation¶

This expression uses CodegenFallback trait, meaning it does not support Tungsten code generation and always falls back to interpreted evaluation mode for safety and flexibility with user-defined functions.

Examples¶

// Example DataFrame API usage - internal usage
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.InternalRow

val schema = StructType(Seq(
  StructField("value", IntegerType, false)
))

val generator = UserDefinedGenerator(
  elementSchema = schema,
  function = (row: Row) => {
    val input = row.getInt(0)
    (1 to input).map(i => InternalRow(i))
  },
  children = Seq(col("number"))
)

-- UserDefinedGenerator is typically not directly accessible in SQL
-- It's used internally by higher-level constructs like table-valued functions