PythonUDTF¶
Overview¶
PythonUDTF represents a User-Defined Table Function (UDTF) implemented in Python that generates multiple rows from input data. It extends UnevaluableGenerator and PythonFuncExpression, allowing Python functions to be executed as table-generating functions within Spark SQL queries.
Syntax¶
-- SQL syntax
SELECT * FROM python_udtf_function(arg1, arg2, ...) [PARTITION BY (partition_cols)]
-- With table arguments
SELECT * FROM python_udtf_function(TABLE(table_arg), scalar_arg, ...)
// DataFrame API usage (typically used internally)
// PythonUDTF expressions are usually created through SQL interface
Arguments¶
| Argument | Type | Description |
|---|---|---|
name |
String |
The name of the Python UDTF function |
func |
PythonFunction |
The Python function implementation containing the actual UDTF logic |
elementSchema |
StructType |
The schema definition of the output rows generated by the UDTF |
pickledAnalyzeResult |
Option[Array[Byte]] |
Optional pickled analysis result from Python for optimization purposes |
children |
Seq[Expression] |
The input expressions/arguments passed to the UDTF |
evalType |
Int |
The evaluation type identifier for the Python function |
udfDeterministic |
Boolean |
Whether the UDTF produces deterministic results for the same inputs |
resultId |
ExprId |
Unique identifier for the expression result (cosmetic, doesn't affect computation) |
pythonUDTFPartitionColumnIndexes |
Option[PythonUDTFPartitionColumnIndexes] |
Zero-based indexes of PARTITION BY expressions within TABLE arguments |
tableArguments |
Option[Seq[Boolean]] |
Flags indicating which input arguments are table arguments vs scalar arguments |
Return Type¶
Returns a collection of rows conforming to the elementSchema (StructType). Each execution can generate zero, one, or multiple output rows.
Supported Data Types¶
- All Spark SQL data types supported by the Python-Spark interface
- Scalar arguments: primitive types, complex types (arrays, maps, structs)
- Table arguments: complete table/DataFrame inputs
- Output schema must be representable as a StructType with supported column types
Algorithm¶
- Expression is marked as
UnevaluableGenerator, meaning it cannot be directly evaluated in normal expression context - Actual evaluation occurs in specialized physical operators that handle generator expressions
- Python function is executed in Python worker processes with serialized input data
- Results are deserialized back to Spark rows according to
elementSchema - Partitioning behavior is controlled by
pythonUDTFPartitionColumnIndexeswhen table arguments are present
Partitioning Behavior¶
- Partition Preservation: Does not preserve input partitioning due to the generator nature
- Shuffle Requirements: May require shuffle if PARTITION BY clauses are specified with table arguments
- Distribution: Output rows are distributed based on the partitioning strategy of the containing query plan
- Table Arguments: When table arguments with PARTITION BY are used, data is repartitioned according to
pythonUDTFPartitionColumnIndexes
Edge Cases¶
- Null Handling: Null handling behavior is determined by the Python function implementation
- Empty Input: Can generate zero rows if the Python function returns empty results
- Table Argument Boundaries: Partition boundaries for table arguments are strictly enforced
- Schema Mismatch: Runtime errors occur if Python function output doesn't match
elementSchema - Determinism: Non-deterministic UDTFs may produce different results across executions, affecting caching and optimization
Code Generation¶
- Code Generation Support: No - extends
UnevaluableGeneratorwhich bypasses Tungsten code generation - Execution Mode: Always uses interpreted mode through Python worker processes
- Serialization: Relies on Python-Spark serialization protocols for data exchange
- Performance: Subject to serialization overhead and inter-process communication costs
Examples¶
-- Basic Python UDTF usage
SELECT * FROM python_split_function('hello,world,spark', ',');
-- With PARTITION BY on table argument
SELECT * FROM python_window_function(TABLE(sales_data) PARTITION BY (region), 'sum');
-- Mixed scalar and table arguments
SELECT * FROM python_analyze_table(TABLE(user_events) PARTITION BY (user_id), 'daily', 7);
// Scala/internal usage (typically not user-facing)
val pythonUDTF = PythonUDTF(
name = "my_udtf",
func = pythonFunction,
elementSchema = StructType(Seq(StructField("result", StringType))),
pickledAnalyzeResult = None,
children = Seq(inputExpr),
evalType = PythonEvalType.SQL_TABLE_UDF,
udfDeterministic = true
)
See Also¶
Generator- Base interface for row-generating expressionsPythonFunction- Python function wrapper interfaceUnevaluableGenerator- Parent class for non-directly-evaluable generatorsPythonFuncExpression- Mixin for Python function expressions- SQL Table Functions documentation