TransformExpression¶
Overview¶
TransformExpression is a Spark Catalyst expression that represents partition transform functions used in partitioned tables. It wraps a BoundFunction and provides compatibility checking between transform expressions to enable optimizations like storage-partitioned joins.
Syntax¶
This expression is primarily used internally by Spark for partition transforms and is not directly accessible through SQL or DataFrame API. It represents transforms like:
-- Examples of underlying transforms this expression represents
bucket(32, column)
year(date_column)
month(date_column)
Arguments¶
| Argument | Type | Description |
|---|---|---|
| function | BoundFunction | The transform function itself used for compatibility decisions |
| children | Seq[Expression] | Child expressions that provide input values |
| numBucketsOpt | Option[Int] | Number of buckets for bucket transforms, None for other transforms |
Return Type¶
The return type is determined by the underlying BoundFunction.resultType(). The expression is always nullable regardless of input nullability.
Supported Data Types¶
Supported data types depend on the underlying BoundFunction:
- Bucket transforms: typically numeric and string types
- Date/time transforms (year, month, day, hour): timestamp and date types
- Identity transforms: any data type
Algorithm¶
The expression evaluation follows these steps:
- Attempts to resolve the function as a
ScalarFunctionusingV2ExpressionUtils - For bucket transforms, prepends the number of buckets as a literal argument
- Evaluates the resolved scalar function against the input row
- Throws
QueryExecutionErrorif the function cannot be resolved or evaluated - Falls back to interpreted mode as code generation is not supported
Partitioning Behavior¶
This expression is specifically designed for partitioning operations:
- Preserves partitioning when used in partition schemes
- Enables storage-partitioned joins when compatible transforms are detected
- Compatible transforms can avoid shuffle operations in join scenarios
- Function compatibility is determined through
ReducibleFunctionrelationships
Edge Cases¶
- Null handling: Expression is always nullable and returns null for null inputs
- Function resolution failure: Throws
QueryExecutionErrorif the underlying function cannot be resolved - Incompatible functions: Non-reducible functions are never considered compatible
- Bucket mismatch: Bucket transforms with different bucket counts are incompatible unless reducible
- Code generation: Always falls back to interpreted mode, throws error if code generation is attempted
Code Generation¶
This expression does not support Tungsten code generation and always throws a QueryExecutionError when doGenCode is called. All evaluation is performed in interpreted mode through the resolved scalar function.
Examples¶
// Internal usage - not directly accessible via public APIs
// Represents partition transforms in table definitions
// Bucket transform with 32 buckets
TransformExpression(
bucketFunction,
Seq(col("user_id")),
Some(32)
)
// Year transform for date partitioning
TransformExpression(
yearFunction,
Seq(col("created_date")),
None
)
See Also¶
BoundFunction- The underlying function interfaceReducibleFunction- Interface for compatible transform functionsScalarFunction- Scalar function implementations used for evaluationV2ExpressionUtils- Utility for resolving scalar functions- Storage-partitioned joins optimization