SortPrefix¶

Overview¶

The SortPrefix expression generates a 64-bit long prefix value used for optimizing sort operations in Spark. It extracts a sortable prefix from the underlying data type to enable faster comparison during sorting without having to evaluate the full expression repeatedly.

Syntax¶

This is an internal Catalyst expression that is not directly accessible via SQL or DataFrame API. It is automatically generated by the Catalyst optimizer when creating sort operations.

Arguments¶

Argument	Type	Description
child	SortOrder	The sort order expression containing the actual expression to generate prefix for and sort direction/null ordering

Return Type¶

LongType - Always returns a 64-bit long value that represents a sortable prefix of the input data.

Supported Data Types¶

Boolean: Converted to 0 (false) or 1 (true)
Integral Types: Byte, Short, Int, Long - cast to long
Temporal Types: Date, Timestamp, TimestampNTZ, TimeType - cast to long
Interval Types: AnsiIntervalType - cast to long
Floating Point: Float, Double - converted using DoublePrefixComparator
String: UTF8String - converted using StringPrefixComparator
Binary: Byte arrays - converted using BinaryPrefixComparator
Decimal: Various strategies based on precision and scale
Other Types: Default to 0L prefix

Algorithm¶

Evaluates the child expression to get the raw value
Returns null if the input value is null
Applies type-specific prefix calculation based on the child's data type
For decimals, uses different strategies based on precision/scale to fit in long or falls back to double conversion
For floating point, uses DoublePrefixComparator to handle NaN and infinity correctly
For strings and binary data, uses specialized comparators to generate meaningful prefixes

Partitioning Behavior¶

Preserves Partitioning: Yes, this is a unary expression that doesn't change data distribution
Requires Shuffle: No, operates on individual rows independently

Edge Cases¶

Null Handling: Returns null for null inputs; provides nullValue property for sort operations based on null ordering (NullsFirst/NullsLast) and sort direction
Decimal Overflow: For high-precision decimals that can't fit in long, falls back to Long.MinValue for negative values and Long.MaxValue for positive values
Floating Point Special Values: NaN and infinity are handled by DoublePrefixComparator
Default Case: Unsupported types default to 0L prefix

Code Generation¶

Supports full code generation (Tungsten). The doGenCode method generates optimized Java code for each supported data type, avoiding function call overhead and enabling CPU-efficient execution.

Examples¶

-- Not directly accessible in SQL
-- Generated automatically during ORDER BY operations
SELECT * FROM table ORDER BY column_name;

// Not directly accessible in DataFrame API  
// Generated automatically during sort operations
df.orderBy($"column_name")