CollationAwareXxHash64¶

Overview¶

CollationAwareXxHash64 is a Spark Catalyst expression that computes 64-bit XXHash values for input data while respecting string collation rules. It extends the standard hash functionality to handle collation-aware string comparisons, making it suitable for operations that need consistent hashing across different string representations that are collation-equivalent.

Syntax¶

-- Internal expression, typically used in aggregations and joins
-- No direct SQL syntax available

// DataFrame API usage (internal)
CollationAwareXxHash64(children: Seq[Expression], seed: Long)

Arguments¶

Argument	Type	Description
children	Seq[Expression]	Sequence of expressions to hash
seed	Long	Seed value for the hash function

Return Type¶

Returns a LongType representing the 64-bit hash value.

Supported Data Types¶

Supports all data types that can be converted to bytes for hashing:

Numeric types (IntegerType, LongType, DoubleType, etc.)
String types with collation awareness
Binary data
Complex types (arrays, maps, structs)
Timestamp and date types

Algorithm¶

The expression evaluates using the following process:

Converts input expressions to their byte representations
Applies collation-aware transformations for string data
Uses XXH64 algorithm to compute hash values from unsafe byte arrays
Combines multiple input expressions using the seed value
Returns a 64-bit long hash result

Partitioning Behavior¶

This expression affects partitioning in the following ways: - Preserves partitioning when used consistently across operations - May require shuffle when collation rules change partition boundaries - Supports hash-based partitioning schemes with collation awareness

Edge Cases¶

Null values are handled according to Spark's null propagation rules
Empty strings are hashed according to their collation representation
Different collations of the same logical string produce the same hash
Seed value of 0 uses the default XXH64 initialization
Overflow is not applicable as hash functions produce bounded output

Code Generation¶

This expression supports Tungsten code generation for optimal performance. It generates unsafe memory access code for direct byte manipulation and leverages the native XXH64 implementation for maximum throughput.

Examples¶

-- Not directly accessible in SQL
-- Used internally by Spark for collation-aware operations

// Internal usage in Spark operations
import org.apache.spark.sql.catalyst.expressions._

val expr1 = Literal("Hello")
val expr2 = Literal("World") 
val hashExpr = CollationAwareXxHash64(Seq(expr1, expr2), seed = 42L)