HiveHash¶

Overview¶

The HiveHash expression computes a hash value of one or more input arguments using Hive's hashing algorithm. It returns a 32-bit integer hash code that combines the hash values of all input expressions using a polynomial rolling hash with multiplier 31.

Syntax¶

hive_hash(expr1, expr2, ...)

// DataFrame API
col("column").expr("hive_hash(col1, col2)")

Arguments¶

Argument	Type	Description
expr1, expr2, ...	Any	One or more expressions of any supported data type to hash

Return Type¶

IntegerType - Returns a 32-bit signed integer hash value.

Supported Data Types¶

The expression supports all Spark SQL data types:

Primitive types: Integer, Long, Float, Double, Boolean, Byte, Short
String types (with collation awareness)
Binary data
Decimal types
Timestamp and Date types
Calendar intervals
Complex types: Arrays, Maps, Structs

Algorithm¶

Initializes hash value to seed 0
For each input expression, computes its individual hash using HiveHashFunction.hash()
Combines hashes using polynomial rolling hash: hash = (31 * hash) + childHash
Uses collation-aware hashing for string types when applicable
Handles nested complex types recursively (arrays use 31x multiplier, maps use XOR for key-value pairs)

Partitioning Behavior¶

Preserves partitioning when used as a partitioning expression
Does not require shuffle when applied to already partitioned data
Commonly used for bucketing and data distribution in Hive-compatible scenarios

Edge Cases¶

Null inputs contribute 0 to the final hash (nulls are handled safely)
Empty arrays and maps contribute their base hash without elements
The expression itself never returns null (always produces an integer result)
String collation affects hash computation for non-binary-equal collations
Decimal values are normalized before hashing to ensure consistent results

Code Generation¶

This expression fully supports Tungsten code generation:

Generates optimized Java code for hash computation
Splits complex expressions across multiple methods to avoid JVM method size limits
Uses unsafe memory operations for efficient byte array hashing
Falls back to interpreted mode only in exceptional cases

Examples¶

-- Hash a single column
SELECT hive_hash(user_id) FROM users;

-- Hash multiple columns for bucketing
SELECT hive_hash(customer_id, order_date) FROM orders;

-- Hash complex data types
SELECT hive_hash(array_col, map_col, struct_col) FROM complex_table;

// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(expr("hive_hash(col1, col2)").alias("hash_value"))

// For bucketing
df.write
  .bucketBy(10, "hive_hash(customer_id)")
  .saveAsTable("bucketed_table")