ArrayAggregate¶

Overview¶

ArrayAggregate is a higher-order function that reduces an array to a single value by applying a merge function to each element with an accumulator, then applying a finish function to the final accumulator. It's equivalent to a fold operation followed by a transformation on arrays.

Syntax¶

aggregate(array_expr, initial_value, merge_function, finish_function)

Arguments¶

Argument	Type	Description
array_expr	ArrayType	The input array to aggregate
initial_value	AnyDataType	The initial accumulator value
merge_function	LambdaFunction	Function that takes (accumulator, element) and returns new accumulator
finish_function	LambdaFunction	Function that transforms the final accumulator to the result

Return Type¶

The data type returned by the finish_function expression.

Supported Data Types¶

Array: Any ArrayType for the input array
Initial Value: Any data type for the initial accumulator
Accumulator: Must maintain consistent type throughout merge operations
Result: Determined by the finish function's return type

Algorithm¶

Initialize accumulator with the initial_value
Iterate through each element in the input array
Apply merge_function(accumulator, element) to update the accumulator
After processing all elements, apply finish_function(accumulator) to get final result
Return null if the input array is null

Partitioning Behavior¶

How this expression affects partitioning (if applicable):

Preserves partitioning as it operates on individual array values within rows
Does not require shuffle operations
Each array is processed independently within its partition

Edge Cases¶

Null Array: Returns null if the input array is null
Empty Array: Processes zero elements, applies finish function to initial value
Null Elements: Null array elements are passed to the merge function as-is
Type Consistency: The merge function output type must match the initial value type structurally
Nullable Accumulator: Always treats accumulator as nullable for safety

Code Generation¶

This expression extends CodegenFallback, meaning it does not support Tungsten code generation and falls back to interpreted evaluation mode.

Examples¶

-- Sum array elements with final multiplication
SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10);
-- Result: 60

-- Concatenate strings with prefix
SELECT aggregate(array('a', 'b', 'c'), '', (acc, x) -> concat(acc, x), acc -> concat('prefix:', acc));
-- Result: 'prefix:abc'

-- Find maximum with default handling
SELECT aggregate(array(5, 2, 8, 1), 0, (acc, x) -> greatest(acc, x), acc -> acc);
-- Result: 8

// Example DataFrame API usage
import org.apache.spark.sql.functions._

df.select(
  expr("aggregate(numbers, 0, (acc, x) -> acc + x, acc -> acc * 2)")
).show()

// Using higher-order functions API
df.select(
  aggregate(
    col("array_col"), 
    lit(0), 
    (acc, x) => acc + x,
    acc => acc * 2
  )
).show()