Flatten¶

Overview¶

The flatten expression transforms a nested array (array of arrays) into a single-level array by concatenating all inner arrays. It flattens exactly one level of nesting, converting Array[Array[T]] to Array[T].

Syntax¶

flatten(array_of_arrays)

// DataFrame API
import org.apache.spark.sql.functions.flatten
df.select(flatten(col("nested_array")))

Arguments¶

Argument	Type	Description
array_of_arrays	Array[Array[T]]	A nested array where each element is itself an array

Return Type¶

Returns an array with the same element type as the inner arrays. If the input is Array[Array[T]], the output is Array[T].

Supported Data Types¶

Array of arrays with any element type (numeric, string, complex types, etc.)
The inner arrays must all have the same element type
Supports nested arrays with nullable elements

Algorithm¶

Validates that input is an array of arrays during type checking
Counts total number of elements across all inner arrays
Checks if total element count exceeds maximum array length limit
Creates a new flat array and copies elements from each inner array sequentially
Returns null if any inner array is null (null-intolerant behavior)

Partitioning Behavior¶

This expression does not affect partitioning:

Preserves existing partitioning schemes
Does not require data shuffling
Operates row-by-row within each partition

Edge Cases¶

Null handling: Returns null if the input array is null or if any inner array is null
Empty arrays: Empty inner arrays are skipped, contributing no elements to the result
Size limits: Throws QueryExecutionErrors.arrayFunctionWithElementsExceedLimitError if flattened array would exceed ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH
Type validation: Fails at analysis time if input is not an array of arrays
Nullable elements: Preserves nullability of inner array elements in the result

Code Generation¶

This expression supports Tungsten code generation (doGenCode) for optimized execution. It generates specialized code for:

Counting total elements across inner arrays
Allocating the result array with proper size
Copying elements using nested loops with bounds checking

Examples¶

-- Flatten array of arrays
SELECT flatten(array(array(1, 2), array(3, 4)));
-- Result: [1, 2, 3, 4]

-- With empty inner arrays
SELECT flatten(array(array(1, 2), array(), array(3)));
-- Result: [1, 2, 3]

-- With null handling
SELECT flatten(array(array(1, 2), null, array(3)));
-- Result: null

// DataFrame API usage
import org.apache.spark.sql.functions._

val df = Seq(
  Seq(Seq(1, 2), Seq(3, 4)),
  Seq(Seq(5), Seq(6, 7, 8))
).toDF("nested_arrays")

df.select(flatten(col("nested_arrays"))).show()
// +-------------------+
// |flatten(nested_arrays)|
// +-------------------+
// |        [1, 2, 3, 4]|
// |     [5, 6, 7, 8]   |
// +-------------------+