SortPrefix¶
Overview¶
The SortPrefix expression generates a 64-bit long prefix value used for optimizing sort operations in Spark. It extracts a sortable prefix from the underlying data type to enable faster comparison during sorting without having to evaluate the full expression repeatedly.
Syntax¶
This is an internal Catalyst expression that is not directly accessible via SQL or DataFrame API. It is automatically generated by the Catalyst optimizer when creating sort operations.
Arguments¶
| Argument | Type | Description |
|---|---|---|
| child | SortOrder | The sort order expression containing the actual expression to generate prefix for and sort direction/null ordering |
Return Type¶
LongType - Always returns a 64-bit long value that represents a sortable prefix of the input data.
Supported Data Types¶
- Boolean: Converted to 0 (false) or 1 (true)
- Integral Types: Byte, Short, Int, Long - cast to long
- Temporal Types: Date, Timestamp, TimestampNTZ, TimeType - cast to long
- Interval Types: AnsiIntervalType - cast to long
- Floating Point: Float, Double - converted using DoublePrefixComparator
- String: UTF8String - converted using StringPrefixComparator
- Binary: Byte arrays - converted using BinaryPrefixComparator
- Decimal: Various strategies based on precision and scale
- Other Types: Default to 0L prefix
Algorithm¶
- Evaluates the child expression to get the raw value
- Returns null if the input value is null
- Applies type-specific prefix calculation based on the child's data type
- For decimals, uses different strategies based on precision/scale to fit in long or falls back to double conversion
- For floating point, uses DoublePrefixComparator to handle NaN and infinity correctly
- For strings and binary data, uses specialized comparators to generate meaningful prefixes
Partitioning Behavior¶
- Preserves Partitioning: Yes, this is a unary expression that doesn't change data distribution
- Requires Shuffle: No, operates on individual rows independently
Edge Cases¶
- Null Handling: Returns null for null inputs; provides
nullValueproperty for sort operations based on null ordering (NullsFirst/NullsLast) and sort direction - Decimal Overflow: For high-precision decimals that can't fit in long, falls back to Long.MinValue for negative values and Long.MaxValue for positive values
- Floating Point Special Values: NaN and infinity are handled by DoublePrefixComparator
- Default Case: Unsupported types default to 0L prefix
Code Generation¶
Supports full code generation (Tungsten). The doGenCode method generates optimized Java code for each supported data type, avoiding function call overhead and enabling CPU-efficient execution.
Examples¶
-- Not directly accessible in SQL
-- Generated automatically during ORDER BY operations
SELECT * FROM table ORDER BY column_name;
// Not directly accessible in DataFrame API
// Generated automatically during sort operations
df.orderBy($"column_name")
See Also¶
SortOrder- The parent expression that contains sort direction and null orderingDoublePrefixComparator- Handles floating point prefix generationStringPrefixComparator- Handles string prefix generationBinaryPrefixComparator- Handles binary data prefix generation- Sort-related expressions in Catalyst optimizer