HllSketchEstimate¶

Overview¶

The HllSketchEstimate expression extracts the cardinality estimate from a HyperLogLog (HLL) sketch binary representation. It takes a serialized HLL sketch as input and returns the estimated distinct count as a long integer, providing approximate cardinality estimation with high efficiency and low memory usage.

Syntax¶

hll_sketch_estimate(sketch_binary)

// DataFrame API
df.select(expr("hll_sketch_estimate(hll_column)"))

Arguments¶

Argument	Type	Description
sketch_binary	BinaryType	Serialized HyperLogLog sketch as a byte array

Return Type¶

LongType - Returns the cardinality estimate as a long integer, rounded to the nearest whole number.

Supported Data Types¶

Input: BinaryType only - expects serialized HLL sketch data
Output: LongType - estimated distinct count

Algorithm¶

Deserializes the input byte array using Apache DataSketches HllSketch.heapify() method
Wraps the byte array in a Memory object for efficient processing
Calls getEstimate() on the deserialized HLL sketch to obtain the cardinality estimate
Rounds the floating-point estimate to the nearest long integer using Math.round()
Throws QueryExecutionErrors.hllInvalidInputSketchBuffer() if deserialization fails due to invalid sketch data

Partitioning Behavior¶

Preserves partitioning: Yes, this is a deterministic function that operates row-by-row
Requires shuffle: No, evaluation is local to each partition
Can be pushed down to individual partitions without affecting correctness

Edge Cases¶

Null handling: Returns null if input is null (null-intolerant behavior)
Invalid sketch data: Throws QueryExecutionErrors.hllInvalidInputSketchBuffer for corrupted or invalid binary data
Memory errors: Catches SketchesArgumentException and java.lang.Error during deserialization
Empty sketches: Valid empty HLL sketches return estimate of 0
Precision bounds: Estimate accuracy depends on the original HLL sketch configuration parameters

Code Generation¶

This expression uses CodegenFallback, meaning it does not support Tungsten code generation and falls back to interpreted evaluation mode. The nullSafeEval method is called directly during query execution.

Examples¶

-- Estimate distinct count from HLL sketch
SELECT hll_sketch_estimate(hll_sketch_agg(col)) FROM VALUES (1), (1), (2), (2), (3) tab(col);
-- Result: 3

-- Use with pre-computed HLL sketches
SELECT customer_segment, hll_sketch_estimate(user_sketch) 
FROM customer_analytics 
WHERE date_partition = '2023-01-01';

// DataFrame API usage
import org.apache.spark.sql.functions._

// Estimate from aggregated sketch
df.select(expr("hll_sketch_estimate(hll_sketch_agg(user_id))"))

// Estimate from stored sketch column
df.select($"segment", expr("hll_sketch_estimate(stored_hll_sketch)"))