Skip to content

CollationAwareXxHash64

Overview

CollationAwareXxHash64 is a Spark Catalyst expression that computes 64-bit XXHash values for input data while respecting string collation rules. It extends the standard hash functionality to handle collation-aware string comparisons, making it suitable for operations that need consistent hashing across different string representations that are collation-equivalent.

Syntax

-- Internal expression, typically used in aggregations and joins
-- No direct SQL syntax available
// DataFrame API usage (internal)
CollationAwareXxHash64(children: Seq[Expression], seed: Long)

Arguments

Argument Type Description
children Seq[Expression] Sequence of expressions to hash
seed Long Seed value for the hash function

Return Type

Returns a LongType representing the 64-bit hash value.

Supported Data Types

Supports all data types that can be converted to bytes for hashing:

  • Numeric types (IntegerType, LongType, DoubleType, etc.)
  • String types with collation awareness
  • Binary data
  • Complex types (arrays, maps, structs)
  • Timestamp and date types

Algorithm

The expression evaluates using the following process:

  • Converts input expressions to their byte representations
  • Applies collation-aware transformations for string data
  • Uses XXH64 algorithm to compute hash values from unsafe byte arrays
  • Combines multiple input expressions using the seed value
  • Returns a 64-bit long hash result

Partitioning Behavior

This expression affects partitioning in the following ways: - Preserves partitioning when used consistently across operations - May require shuffle when collation rules change partition boundaries - Supports hash-based partitioning schemes with collation awareness

Edge Cases

  • Null values are handled according to Spark's null propagation rules
  • Empty strings are hashed according to their collation representation
  • Different collations of the same logical string produce the same hash
  • Seed value of 0 uses the default XXH64 initialization
  • Overflow is not applicable as hash functions produce bounded output

Code Generation

This expression supports Tungsten code generation for optimal performance. It generates unsafe memory access code for direct byte manipulation and leverages the native XXH64 implementation for maximum throughput.

Examples

-- Not directly accessible in SQL
-- Used internally by Spark for collation-aware operations
// Internal usage in Spark operations
import org.apache.spark.sql.catalyst.expressions._

val expr1 = Literal("Hello")
val expr2 = Literal("World") 
val hashExpr = CollationAwareXxHash64(Seq(expr1, expr2), seed = 42L)

See Also

  • XxHash64 - Standard XXHash64 implementation without collation awareness
  • Murmur3Hash - Alternative hash function for non-collation cases
  • HashPartitioning - Partitioning strategy that may use this expression