HllSketchAgg¶

Overview¶

HllSketchAgg is an aggregate expression that creates HyperLogLog (HLL) sketches for approximate distinct counting. It uses the DataSketches library to maintain a probabilistic data structure that can estimate cardinality with configurable precision, making it efficient for large-scale distinct count operations.

Syntax¶

hll_sketch_agg(column [, lgConfigK])

// DataFrame API
df.agg(expr("hll_sketch_agg(column, lgConfigK)"))

Arguments¶

Argument	Type	Description
column	IntegerType, LongType, StringType, BinaryType	The column or expression to create the HLL sketch from
lgConfigK	IntegerType	Optional. Log base 2 of the number of buckets (determines sketch precision). Defaults to HllSketch.DEFAULT_LG_K

Return Type¶

BinaryType - Returns a serialized HLL sketch as a byte array that can be stored, transmitted, or further processed.

Supported Data Types¶

IntegerType
LongType
StringType (with collation support, trimming collations supported)
BinaryType

Note: Floating point types are explicitly not supported due to precision issues. Array types and complex types are not implemented.

Algorithm¶

Creates an HLL sketch with configurable precision based on lgConfigK parameter
Uses HLL_8 target type (8-bit registers) for memory efficiency
Updates the sketch by hashing input values and updating corresponding buckets
Merges sketches using Union operations during shuffle phases
Serializes final sketch to updatable byte array format for storage

Partitioning Behavior¶

How this expression affects partitioning:

Does not preserve partitioning (this is an aggregate function)
Requires shuffle operation to merge partial sketches from different partitions
Uses Union-based merging to combine sketches across partitions

Edge Cases¶

Null values are ignored and do not contribute to the sketch
Empty byte arrays are ignored during updates
Strings that are collation-equal to empty string are ignored
Non-constant lgConfigK parameter throws QueryExecutionError
Unsupported data types throw SparkUnsupportedOperationException with error class "_LEGACY_ERROR_TEMP_3121"
The expression is marked as non-nullable (always returns a result)

Code Generation¶

This expression does not support Tungsten code generation and falls back to interpreted mode, as it extends TypedImperativeAggregate which uses imperative buffer management rather than code-generated aggregation.

Examples¶

-- Basic HLL sketch with default precision
SELECT hll_sketch_agg(user_id) FROM events;

-- HLL sketch with custom precision (lgConfigK = 12)
SELECT hll_sketch_agg(user_id, 12) FROM events GROUP BY date;

-- Using with string data
SELECT hll_sketch_agg(email_domain, 10) FROM users;

// DataFrame API usage
import org.apache.spark.sql.functions._

// Default precision
df.agg(expr("hll_sketch_agg(user_id)"))

// Custom precision
df.groupBy("category").agg(expr("hll_sketch_agg(product_id, 14)"))

// With binary data
df.select(expr("hll_sketch_agg(binary_column, 8)"))