Shuffle¶

Overview¶

The Shuffle expression randomly shuffles the elements of an array using a random number generator seeded for deterministic behavior within the same partition. This is a non-deterministic function that returns the input array with its elements reordered in a random sequence.

Syntax¶

shuffle(array)
shuffle(array, seed)

// DataFrame API
shuffle(col("array_column"))
shuffle(col("array_column"), lit(42))

Arguments¶

Argument	Type	Description
array	ArrayType	The input array to be shuffled
seed	Expression (optional)	Random seed for deterministic shuffling across runs

Return Type¶

Returns an ArrayType with the same element type and nullability as the input array.

Supported Data Types¶

Input: ArrayType containing elements of any data type
Output: Same ArrayType as input with preserved element types and nullability

Algorithm¶

Evaluates the input array expression to get the source ArrayData
Initializes a RandomIndicesGenerator using the provided seed plus partition index
Generates a random permutation of indices from 0 to array length - 1
Creates a new GenericArrayData by mapping the shuffled indices to the original array elements
Preserves null elements and array structure while only reordering positions

Partitioning Behavior¶

Preserves partitioning: Yes, operates element-wise within each partition
Requires shuffle: No, this is a local transformation per partition
Seed behavior: Uses randomSeed + partitionIndex to ensure different shuffling across partitions while maintaining deterministic results

Edge Cases¶

Null array input: Returns null without processing
Empty array: Returns empty array (no elements to shuffle)
Single element array: Returns the same single-element array
Null elements: Preserves null elements within the array structure
Unresolved seed: Expression remains unresolved until seed is provided or generated

Code Generation¶

Supports Tungsten code generation through the doGenCode method:

Generates optimized Java code using nullSafeCodeGen
Creates mutable state for RandomIndicesGenerator instance
Produces inline array creation and assignment loops
Falls back to interpreted evalInternal when code generation is disabled

Examples¶

-- Basic array shuffling
SELECT shuffle(array(1, 2, 3, 4, 5));
-- Result: [3, 1, 5, 2, 4] (random order)

-- Deterministic shuffling with seed
SELECT shuffle(array('a', 'b', 'c'), 42);
-- Result: ['c', 'a', 'b'] (deterministic with seed 42)

-- Shuffling arrays with null elements
SELECT shuffle(array(1, null, 3, null, 5));
-- Result: [null, 1, null, 5, 3] (nulls preserved, positions shuffled)

// DataFrame API usage
import org.apache.spark.sql.functions._

// Basic shuffle
df.select(shuffle(col("array_col")))

// With specific seed for reproducibility
df.select(shuffle(col("array_col"), lit(12345)))

// Chaining with other array operations
df.select(shuffle(array_distinct(col("array_col"))))