ApproxTopKCombine¶

Overview¶

ApproxTopKCombine is an internal aggregation expression used to combine intermediate results from ApproxTopK operations during the merge phase of distributed aggregation. It processes sketch data structures to maintain approximate counts of the top-K most frequent items across multiple partitions.

Syntax¶

This is an internal expression not directly callable from SQL or DataFrame API. It's automatically used during the combine phase of approx_top_k aggregation.

Arguments¶

Argument	Type	Description
child	Expression	Input expression containing serialized sketch data from partial aggregations
k	Int	Maximum number of top items to track
maxMapSize	Int	Maximum size of the internal frequency map to prevent excessive memory usage

Return Type¶

Returns a binary representation of the combined frequency sketch containing item-count pairs.

Supported Data Types¶

Accepts binary data type containing serialized sketch structures from the map phase of ApproxTopK aggregation.

Algorithm¶

Deserializes multiple frequency sketches from input binary data
Merges frequency maps from different partitions by summing counts for identical items
Maintains only the top-K items by frequency after merging
Applies maxMapSize limit to prevent memory exhaustion during processing
Serializes the combined result back to binary format for final aggregation phase

Partitioning Behavior¶

How this expression affects partitioning:

Does not preserve partitioning as it's used in shuffle-based aggregation
Operates during the combine phase before final reduce step
Requires shuffle to collect partial results from different partitions

Edge Cases¶

Null input sketches are skipped during the merge process
Empty sketches contribute no items to the final result
When combined item count exceeds maxMapSize, lower frequency items are discarded
Identical items from different partitions have their frequencies summed
If fewer than K distinct items exist, returns all available items

Code Generation¶

This expression typically falls back to interpreted mode due to the complexity of sketch deserialization and merging operations that don't benefit significantly from code generation.

Examples¶

-- Not directly callable - used internally by:
SELECT approx_top_k(column_name, 5) FROM table_name;

// Not directly callable - used internally during:
df.agg(expr("approx_top_k(column_name, 5)"))