ThetaSketchEstimate¶

Overview¶

The ThetaSketchEstimate expression extracts the estimated distinct count from a serialized Theta Sketch binary representation. It takes a binary-encoded Theta Sketch (typically produced by theta_sketch_agg) and returns the approximate number of distinct elements as a rounded long value.

Syntax¶

theta_sketch_estimate(sketch_binary)

// DataFrame API
import org.apache.spark.sql.functions._
df.select(expr("theta_sketch_estimate(sketch_column)"))

Arguments¶

Argument	Type	Description
sketch_binary	BinaryType	A serialized Theta Sketch in binary format, typically produced by theta_sketch_agg aggregation

Return Type¶

LongType - Returns the estimated distinct count as a long integer, rounded to the nearest whole number.

Supported Data Types¶

Input: BinaryType only - expects serialized Theta Sketch binary data
Output: LongType - 64-bit signed integer

Algorithm¶

Deserializes the input binary array into a Theta Sketch object using ThetaSketchUtils.wrapCompactSketch
Calls the underlying Theta Sketch's getEstimate() method to retrieve the probabilistic distinct count estimate
Applies Math.round() to convert the floating-point estimate to the nearest long integer
Uses null-safe evaluation to handle null inputs properly
Validates the binary format during deserialization and throws errors for invalid sketch data

Partitioning Behavior¶

Preserves partitioning: Yes, this is a deterministic function that operates row-by-row
Requires shuffle: No, processes each sketch independently without cross-partition dependencies
Can be applied after partitioned aggregation of theta sketches

Edge Cases¶

Null handling: Returns null when input sketch_binary is null (null-intolerant behavior)
Invalid binary format: Throws runtime exception if input binary cannot be deserialized as valid Theta Sketch
Empty sketch: Returns 0 for valid but empty Theta Sketches
Precision bounds: Result accuracy depends on the original sketch's configured precision parameters
Large estimates: May lose precision for extremely large cardinalities due to double-to-long conversion

Code Generation¶

This expression uses CodegenFallback, meaning it does not support Tungsten code generation and falls back to interpreted evaluation mode. All operations are performed through method calls rather than generated bytecode.

Examples¶

-- Estimate distinct count from aggregated theta sketch
SELECT theta_sketch_estimate(theta_sketch_agg(col)) 
FROM VALUES (1), (1), (2), (2), (3) AS tab(col);
-- Result: 3

-- Using with GROUP BY
SELECT category, theta_sketch_estimate(theta_sketch_agg(user_id)) as estimated_users
FROM user_events 
GROUP BY category;

-- Combining multiple sketch estimates
SELECT 
  theta_sketch_estimate(sketch_col) as estimate,
  theta_sketch_estimate(sketch_col) / 1000000.0 as estimate_millions
FROM sketch_table;

// DataFrame API usage
import org.apache.spark.sql.functions._

// Basic estimation
df.select(expr("theta_sketch_estimate(sketch_column)").as("distinct_estimate"))

// With aggregation
df.groupBy("category")
  .agg(expr("theta_sketch_agg(user_id)").as("sketch"))
  .select($"category", expr("theta_sketch_estimate(sketch)").as("estimated_users"))