CollationAwareXxHash64¶
Overview¶
CollationAwareXxHash64 is a Spark Catalyst expression that computes 64-bit XXHash values for input data while respecting string collation rules. It extends the standard hash functionality to handle collation-aware string comparisons, making it suitable for operations that need consistent hashing across different string representations that are collation-equivalent.
Syntax¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
| children | Seq[Expression] | Sequence of expressions to hash |
| seed | Long | Seed value for the hash function |
Return Type¶
Returns a LongType representing the 64-bit hash value.
Supported Data Types¶
Supports all data types that can be converted to bytes for hashing:
- Numeric types (IntegerType, LongType, DoubleType, etc.)
- String types with collation awareness
- Binary data
- Complex types (arrays, maps, structs)
- Timestamp and date types
Algorithm¶
The expression evaluates using the following process:
- Converts input expressions to their byte representations
- Applies collation-aware transformations for string data
- Uses XXH64 algorithm to compute hash values from unsafe byte arrays
- Combines multiple input expressions using the seed value
- Returns a 64-bit long hash result
Partitioning Behavior¶
This expression affects partitioning in the following ways: - Preserves partitioning when used consistently across operations - May require shuffle when collation rules change partition boundaries - Supports hash-based partitioning schemes with collation awareness
Edge Cases¶
- Null values are handled according to Spark's null propagation rules
- Empty strings are hashed according to their collation representation
- Different collations of the same logical string produce the same hash
- Seed value of 0 uses the default XXH64 initialization
- Overflow is not applicable as hash functions produce bounded output
Code Generation¶
This expression supports Tungsten code generation for optimal performance. It generates unsafe memory access code for direct byte manipulation and leverages the native XXH64 implementation for maximum throughput.
Examples¶
// Internal usage in Spark operations
import org.apache.spark.sql.catalyst.expressions._
val expr1 = Literal("Hello")
val expr2 = Literal("World")
val hashExpr = CollationAwareXxHash64(Seq(expr1, expr2), seed = 42L)
See Also¶
XxHash64- Standard XXHash64 implementation without collation awarenessMurmur3Hash- Alternative hash function for non-collation casesHashPartitioning- Partitioning strategy that may use this expression