XxHash64¶

Overview¶

The XxHash64 expression computes a 64-bit hash value using the XXH64 algorithm for one or more input values. It produces a deterministic hash that can be used for data distribution, partitioning, and integrity checks across Spark operations.

Syntax¶

xxhash64(expr1[, expr2, ...][, seed])

Arguments¶

Argument	Type	Description
expr1, expr2, ...	Any	One or more expressions of any supported data type to hash
seed	Long	Optional seed value for the hash function (defaults to 42L)

Return Type¶

LongType - Returns a 64-bit signed long integer representing the hash value.

Supported Data Types¶

All Spark SQL data types are supported as input, including:

Numeric types (IntegerType, LongType, DoubleType, etc.)
String types (StringType)
Binary types (BinaryType)
Complex types (ArrayType, MapType, StructType)
Date and timestamp types
Boolean type

Algorithm¶

The expression evaluates using the following process:

Uses the XXH64 hashing algorithm implementation from the XXH64 class
Processes multiple input expressions by combining their hash values
Applies the specified seed value (or default seed of 42L) to initialize the hash state
Does not consider string collation rules in hash computation (isCollationAware = false)
Delegates actual hash computation to XxHash64Function.hash() method

Partitioning Behavior¶

How this expression affects partitioning:

Preserves partitioning when used as a deterministic function
Does not require shuffle operations when computing hashes within partitions
Can be used effectively in partitioning schemes due to good hash distribution properties

Edge Cases¶

Null handling: Null values are handled consistently by the underlying hash function
Empty collections: Empty arrays, maps, or structs produce deterministic hash values
Seed variation: Different seed values produce completely different hash distributions
Cross-platform consistency: Hash values are consistent across different platforms and Spark versions

Code Generation¶

This expression supports Spark's Catalyst code generation (Tungsten) for optimized runtime performance, inheriting code generation capabilities from the HashExpression base class.

Examples¶

-- Hash a single string value
SELECT xxhash64('Spark');

-- Hash multiple values with default seed
SELECT xxhash64('Spark', array(123), 2);

-- Hash with custom seed
SELECT xxhash64('data', 12345);

-- Use in partitioning context
SELECT xxhash64(user_id) % 10 as partition_key FROM users;

// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(xxhash64(col("name"), col("age")))

// With custom seed in raw expression
df.selectExpr("xxhash64(name, age, 12345)")

// Use for custom partitioning
df.repartition(expr("xxhash64(user_id) % 100"))