PythonUDTF¶

Overview¶

PythonUDTF represents a User-Defined Table Function (UDTF) implemented in Python that generates multiple rows from input data. It extends UnevaluableGenerator and PythonFuncExpression, allowing Python functions to be executed as table-generating functions within Spark SQL queries.

Syntax¶

-- SQL syntax
SELECT * FROM python_udtf_function(arg1, arg2, ...) [PARTITION BY (partition_cols)]

-- With table arguments
SELECT * FROM python_udtf_function(TABLE(table_arg), scalar_arg, ...)

// DataFrame API usage (typically used internally)
// PythonUDTF expressions are usually created through SQL interface

Arguments¶

Argument	Type	Description
`name`	`String`	The name of the Python UDTF function
`func`	`PythonFunction`	The Python function implementation containing the actual UDTF logic
`elementSchema`	`StructType`	The schema definition of the output rows generated by the UDTF
`pickledAnalyzeResult`	`Option[Array[Byte]]`	Optional pickled analysis result from Python for optimization purposes
`children`	`Seq[Expression]`	The input expressions/arguments passed to the UDTF
`evalType`	`Int`	The evaluation type identifier for the Python function
`udfDeterministic`	`Boolean`	Whether the UDTF produces deterministic results for the same inputs
`resultId`	`ExprId`	Unique identifier for the expression result (cosmetic, doesn't affect computation)
`pythonUDTFPartitionColumnIndexes`	`Option[PythonUDTFPartitionColumnIndexes]`	Zero-based indexes of PARTITION BY expressions within TABLE arguments
`tableArguments`	`Option[Seq[Boolean]]`	Flags indicating which input arguments are table arguments vs scalar arguments

Return Type¶

Returns a collection of rows conforming to the elementSchema (StructType). Each execution can generate zero, one, or multiple output rows.

Supported Data Types¶

All Spark SQL data types supported by the Python-Spark interface
Scalar arguments: primitive types, complex types (arrays, maps, structs)
Table arguments: complete table/DataFrame inputs
Output schema must be representable as a StructType with supported column types

Algorithm¶

Expression is marked as UnevaluableGenerator, meaning it cannot be directly evaluated in normal expression context
Actual evaluation occurs in specialized physical operators that handle generator expressions
Python function is executed in Python worker processes with serialized input data
Results are deserialized back to Spark rows according to elementSchema
Partitioning behavior is controlled by pythonUDTFPartitionColumnIndexes when table arguments are present

Partitioning Behavior¶

Partition Preservation: Does not preserve input partitioning due to the generator nature
Shuffle Requirements: May require shuffle if PARTITION BY clauses are specified with table arguments
Distribution: Output rows are distributed based on the partitioning strategy of the containing query plan
Table Arguments: When table arguments with PARTITION BY are used, data is repartitioned according to pythonUDTFPartitionColumnIndexes

Edge Cases¶

Null Handling: Null handling behavior is determined by the Python function implementation
Empty Input: Can generate zero rows if the Python function returns empty results
Table Argument Boundaries: Partition boundaries for table arguments are strictly enforced
Schema Mismatch: Runtime errors occur if Python function output doesn't match elementSchema
Determinism: Non-deterministic UDTFs may produce different results across executions, affecting caching and optimization

Code Generation¶

Code Generation Support: No - extends UnevaluableGenerator which bypasses Tungsten code generation
Execution Mode: Always uses interpreted mode through Python worker processes
Serialization: Relies on Python-Spark serialization protocols for data exchange
Performance: Subject to serialization overhead and inter-process communication costs

Examples¶

-- Basic Python UDTF usage
SELECT * FROM python_split_function('hello,world,spark', ',');

-- With PARTITION BY on table argument  
SELECT * FROM python_window_function(TABLE(sales_data) PARTITION BY (region), 'sum');

-- Mixed scalar and table arguments
SELECT * FROM python_analyze_table(TABLE(user_events) PARTITION BY (user_id), 'daily', 7);

// Scala/internal usage (typically not user-facing)
val pythonUDTF = PythonUDTF(
  name = "my_udtf",
  func = pythonFunction,
  elementSchema = StructType(Seq(StructField("result", StringType))),
  pickledAnalyzeResult = None,
  children = Seq(inputExpr),
  evalType = PythonEvalType.SQL_TABLE_UDF,
  udfDeterministic = true
)