ThetaSketchAgg¶

Overview¶

ThetaSketchAgg is a Spark SQL aggregate function that creates a Theta sketch for approximate distinct counting and set operations. It builds a probabilistic data structure that can estimate cardinality with configurable accuracy while using fixed memory space.

Syntax¶

theta_sketch_agg(column [, lg_nom_entries])

Arguments¶

Argument	Type	Description
column	Various	The input column or expression to sketch (supports multiple data types)
lg_nom_entries	Integer	Optional. Log base-2 of nominal entries (default: 12). Controls sketch accuracy and memory usage

Return Type¶

Returns BinaryType - a serialized compact Theta sketch as a byte array.

Supported Data Types¶

IntegerType (promoted to Long)
LongType
FloatType (promoted to Double)
DoubleType
StringType (with collation support)
BinaryType
ArrayType(IntegerType)
ArrayType(LongType)

Algorithm¶

Creates an UpdateSketch with configurable nominal entries parameter
Updates the sketch with input values, handling type-specific conversions
Merges partial sketches using Union operations when combining partitions
Uses lazy Union creation for efficient merging across partitions
Serializes final result as a compact binary sketch

Partitioning Behavior¶

How this expression affects partitioning:

Requires shuffle for final aggregation across partitions
Does not preserve input partitioning
Creates partial sketches per partition, then merges during shuffle phase

Edge Cases¶

Null values are ignored and do not update the sketch
Empty byte arrays are ignored
Empty arrays of supported element types are ignored
Strings that are collation-equal to empty string are ignored
Invalid lg_nom_entries values throw QueryExecutionErrors
Non-constant lg_nom_entries expressions are rejected
Unsupported data types throw SparkUnsupportedOperationException

Code Generation¶

This expression does not support Tungsten code generation and falls back to interpreted mode, as it extends TypedImperativeAggregate which requires custom object handling.

Examples¶

-- Basic usage with default parameters
SELECT theta_sketch_agg(user_id) FROM events;

-- With custom nominal entries for higher accuracy
SELECT theta_sketch_agg(user_id, 14) FROM events;

-- Sketching string values
SELECT theta_sketch_agg(email_domain) FROM users;

// DataFrame API usage
import org.apache.spark.sql.functions._

df.agg(expr("theta_sketch_agg(user_id)"))

// With custom parameters  
df.agg(expr("theta_sketch_agg(user_id, 16)"))