ThetaSketchEstimate¶
Overview¶
The ThetaSketchEstimate expression extracts the estimated distinct count from a serialized Theta Sketch binary representation. It takes a binary-encoded Theta Sketch (typically produced by theta_sketch_agg) and returns the approximate number of distinct elements as a rounded long value.
Syntax¶
// DataFrame API
import org.apache.spark.sql.functions._
df.select(expr("theta_sketch_estimate(sketch_column)"))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| sketch_binary | BinaryType | A serialized Theta Sketch in binary format, typically produced by theta_sketch_agg aggregation |
Return Type¶
LongType - Returns the estimated distinct count as a long integer, rounded to the nearest whole number.
Supported Data Types¶
- Input:
BinaryTypeonly - expects serialized Theta Sketch binary data - Output:
LongType- 64-bit signed integer
Algorithm¶
- Deserializes the input binary array into a Theta Sketch object using
ThetaSketchUtils.wrapCompactSketch - Calls the underlying Theta Sketch's
getEstimate()method to retrieve the probabilistic distinct count estimate - Applies
Math.round()to convert the floating-point estimate to the nearest long integer - Uses null-safe evaluation to handle null inputs properly
- Validates the binary format during deserialization and throws errors for invalid sketch data
Partitioning Behavior¶
- Preserves partitioning: Yes, this is a deterministic function that operates row-by-row
- Requires shuffle: No, processes each sketch independently without cross-partition dependencies
- Can be applied after partitioned aggregation of theta sketches
Edge Cases¶
- Null handling: Returns null when input sketch_binary is null (null-intolerant behavior)
- Invalid binary format: Throws runtime exception if input binary cannot be deserialized as valid Theta Sketch
- Empty sketch: Returns 0 for valid but empty Theta Sketches
- Precision bounds: Result accuracy depends on the original sketch's configured precision parameters
- Large estimates: May lose precision for extremely large cardinalities due to double-to-long conversion
Code Generation¶
This expression uses CodegenFallback, meaning it does not support Tungsten code generation and falls back to interpreted evaluation mode. All operations are performed through method calls rather than generated bytecode.
Examples¶
-- Estimate distinct count from aggregated theta sketch
SELECT theta_sketch_estimate(theta_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) AS tab(col);
-- Result: 3
-- Using with GROUP BY
SELECT category, theta_sketch_estimate(theta_sketch_agg(user_id)) as estimated_users
FROM user_events
GROUP BY category;
-- Combining multiple sketch estimates
SELECT
theta_sketch_estimate(sketch_col) as estimate,
theta_sketch_estimate(sketch_col) / 1000000.0 as estimate_millions
FROM sketch_table;
// DataFrame API usage
import org.apache.spark.sql.functions._
// Basic estimation
df.select(expr("theta_sketch_estimate(sketch_column)").as("distinct_estimate"))
// With aggregation
df.groupBy("category")
.agg(expr("theta_sketch_agg(user_id)").as("sketch"))
.select($"category", expr("theta_sketch_estimate(sketch)").as("estimated_users"))
See Also¶
theta_sketch_agg- Creates Theta Sketches from raw dataapprox_count_distinct- Alternative approximate distinct count function- Other sketch-based expressions for cardinality estimation