ArrayRepeat¶

Overview¶

ArrayRepeat is a Spark Catalyst expression that creates an array by repeating a given element a specified number of times. It takes any data type as the element to repeat and an integer count to determine how many times the element should be duplicated in the resulting array.

Syntax¶

array_repeat(element, count)

// DataFrame API
df.select(array_repeat(col("element"), col("count")))

Arguments¶

Argument	Type	Description
element	Any	The element to be repeated in the array
count	Integer	The number of times to repeat the element

Return Type¶

Returns an ArrayType with the element type matching the input element's data type. The array's element nullability matches the nullability of the input element.

Supported Data Types¶

Element: Any data type (AnyDataType)
Count: Integer types only (IntegerType)

Algorithm¶

Evaluates the count parameter first and returns null if count is null
Validates that the count doesn't exceed ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH to prevent memory issues
Evaluates the element parameter once and reuses the same value for all array positions
Creates a GenericArrayData using Array.fill() to efficiently populate the array
Handles null elements by either setting all positions to null or copying the non-null value

Partitioning Behavior¶

This expression does not affect partitioning as it operates on individual rows:

Preserves existing partitioning since it's a row-level transformation
Does not require shuffle operations
Can be executed independently on each partition

Edge Cases¶

Returns null if the count parameter is null (nullable behavior depends on count nullability)
Returns empty array if count is 0 or negative (negative counts are treated as 0)
Throws QueryExecutionErrors.createArrayWithElementsExceedLimitError if count exceeds maximum array length
Properly handles null elements by setting all array positions to null
Maintains the nullability characteristics of the input element type

Code Generation¶

This expression supports Tungsten code generation through the doGenCode method:

Generates optimized code for array allocation using CodeGenerator.createArrayData
Uses efficient loops for element assignment with CodeGenerator.setArrayElement
Includes null-safety checks and proper exception handling in generated code
Avoids interpreted evaluation overhead for better performance

Examples¶

-- Create array with repeated string
SELECT array_repeat('hello', 3);
-- Result: ["hello", "hello", "hello"]

-- Create array with repeated number
SELECT array_repeat(42, 2);
-- Result: [42, 42]

-- Handle null count (returns null)
SELECT array_repeat('test', null);
-- Result: null

-- Empty array for zero count
SELECT array_repeat('item', 0);
-- Result: []

// DataFrame API usage
import org.apache.spark.sql.functions.array_repeat

df.select(array_repeat(lit("value"), col("repeat_count")))

// With column values
df.select(array_repeat(col("item"), lit(5)))