ArrayDistinct¶

Overview¶

The ArrayDistinct expression removes duplicate elements from an array while preserving the order of first occurrence. It handles null values by keeping only the first null encountered, and supports arrays containing any orderable data type.

Syntax¶

array_distinct(array_expr)

// DataFrame API
col("array_column").distinct()
// or using expr
expr("array_distinct(array_column)")

Arguments¶

Argument	Type	Description
array_expr	ArrayType	The input array from which to remove duplicates

Return Type¶

Returns an ArrayType with the same element type as the input array.

Supported Data Types¶

Supports arrays containing any data type that has ordering semantics:

Numeric types (IntegerType, LongType, DoubleType, etc.)
String types (StringType, VarcharType, CharType)
Date and timestamp types
Binary types
Nested types (arrays, structs, maps) that are orderable

Algorithm¶

The expression uses two different evaluation strategies based on element type:

For types with proper equals semantics: Uses SQLOpenHashSet for O(1) duplicate detection with special handling for NaN and null values
For types without proper equals: Falls back to O(n²) comparison using ordering semantics
Maintains insertion order by using ArrayBuffer to collect unique elements
Enforces maximum array size limit (MAX_ROUNDED_ARRAY_LENGTH) to prevent memory issues
Handles null deduplication by tracking whether a null has already been stored

Partitioning Behavior¶

This expression preserves partitioning characteristics:

Does not require data shuffle as it operates on individual array elements within each partition
Maintains the original partitioning scheme since it's a per-row transformation
No cross-partition dependencies

Edge Cases¶

Null arrays: Returns null when the input array itself is null (nullIntolerant = true)
Null elements: Preserves only the first null element encountered in the array
Empty arrays: Returns an empty array of the same type
NaN handling: Special logic for floating-point NaN values ensures proper deduplication
Size limits: Throws QueryExecutionErrors.arrayFunctionWithElementsExceedLimitError if result exceeds maximum array length
Non-orderable types: Fails type checking for element types that don't support ordering

Code Generation¶

Supports Tungsten code generation with optimizations:

Uses specialized hash sets for primitive types when canUseSpecializedHashSet is true
Falls back to interpreted evaluation for complex types without proper equals semantics
Generates efficient loops with minimal object allocation for supported types
Includes bounds checking and null tracking in generated code

Examples¶

-- Basic usage
SELECT array_distinct(array(1, 2, 3, 2, 1));
-- Result: [1, 2, 3]

-- With null values
SELECT array_distinct(array(1, 2, 3, null, 3, null));
-- Result: [1, 2, 3, null]

-- With strings
SELECT array_distinct(array('apple', 'banana', 'apple', 'cherry'));
-- Result: ['apple', 'banana', 'cherry']

-- Empty array
SELECT array_distinct(array());
-- Result: []

// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(expr("array_distinct(items)").as("unique_items"))

// With column reference
df.select(array_distinct(col("array_column")))