Skip to content

PythonUDTF

Overview

PythonUDTF represents a User-Defined Table Function (UDTF) implemented in Python that generates multiple rows from input data. It extends UnevaluableGenerator and PythonFuncExpression, allowing Python functions to be executed as table-generating functions within Spark SQL queries.

Syntax

-- SQL syntax
SELECT * FROM python_udtf_function(arg1, arg2, ...) [PARTITION BY (partition_cols)]

-- With table arguments
SELECT * FROM python_udtf_function(TABLE(table_arg), scalar_arg, ...)
// DataFrame API usage (typically used internally)
// PythonUDTF expressions are usually created through SQL interface

Arguments

Argument Type Description
name String The name of the Python UDTF function
func PythonFunction The Python function implementation containing the actual UDTF logic
elementSchema StructType The schema definition of the output rows generated by the UDTF
pickledAnalyzeResult Option[Array[Byte]] Optional pickled analysis result from Python for optimization purposes
children Seq[Expression] The input expressions/arguments passed to the UDTF
evalType Int The evaluation type identifier for the Python function
udfDeterministic Boolean Whether the UDTF produces deterministic results for the same inputs
resultId ExprId Unique identifier for the expression result (cosmetic, doesn't affect computation)
pythonUDTFPartitionColumnIndexes Option[PythonUDTFPartitionColumnIndexes] Zero-based indexes of PARTITION BY expressions within TABLE arguments
tableArguments Option[Seq[Boolean]] Flags indicating which input arguments are table arguments vs scalar arguments

Return Type

Returns a collection of rows conforming to the elementSchema (StructType). Each execution can generate zero, one, or multiple output rows.

Supported Data Types

  • All Spark SQL data types supported by the Python-Spark interface
  • Scalar arguments: primitive types, complex types (arrays, maps, structs)
  • Table arguments: complete table/DataFrame inputs
  • Output schema must be representable as a StructType with supported column types

Algorithm

  • Expression is marked as UnevaluableGenerator, meaning it cannot be directly evaluated in normal expression context
  • Actual evaluation occurs in specialized physical operators that handle generator expressions
  • Python function is executed in Python worker processes with serialized input data
  • Results are deserialized back to Spark rows according to elementSchema
  • Partitioning behavior is controlled by pythonUDTFPartitionColumnIndexes when table arguments are present

Partitioning Behavior

  • Partition Preservation: Does not preserve input partitioning due to the generator nature
  • Shuffle Requirements: May require shuffle if PARTITION BY clauses are specified with table arguments
  • Distribution: Output rows are distributed based on the partitioning strategy of the containing query plan
  • Table Arguments: When table arguments with PARTITION BY are used, data is repartitioned according to pythonUDTFPartitionColumnIndexes

Edge Cases

  • Null Handling: Null handling behavior is determined by the Python function implementation
  • Empty Input: Can generate zero rows if the Python function returns empty results
  • Table Argument Boundaries: Partition boundaries for table arguments are strictly enforced
  • Schema Mismatch: Runtime errors occur if Python function output doesn't match elementSchema
  • Determinism: Non-deterministic UDTFs may produce different results across executions, affecting caching and optimization

Code Generation

  • Code Generation Support: No - extends UnevaluableGenerator which bypasses Tungsten code generation
  • Execution Mode: Always uses interpreted mode through Python worker processes
  • Serialization: Relies on Python-Spark serialization protocols for data exchange
  • Performance: Subject to serialization overhead and inter-process communication costs

Examples

-- Basic Python UDTF usage
SELECT * FROM python_split_function('hello,world,spark', ',');

-- With PARTITION BY on table argument  
SELECT * FROM python_window_function(TABLE(sales_data) PARTITION BY (region), 'sum');

-- Mixed scalar and table arguments
SELECT * FROM python_analyze_table(TABLE(user_events) PARTITION BY (user_id), 'daily', 7);
// Scala/internal usage (typically not user-facing)
val pythonUDTF = PythonUDTF(
  name = "my_udtf",
  func = pythonFunction,
  elementSchema = StructType(Seq(StructField("result", StringType))),
  pickledAnalyzeResult = None,
  children = Seq(inputExpr),
  evalType = PythonEvalType.SQL_TABLE_UDF,
  udfDeterministic = true
)

See Also

  • Generator - Base interface for row-generating expressions
  • PythonFunction - Python function wrapper interface
  • UnevaluableGenerator - Parent class for non-directly-evaluable generators
  • PythonFuncExpression - Mixin for Python function expressions
  • SQL Table Functions documentation