CreateArray¶

Overview¶

The CreateArray expression creates an array from a sequence of individual elements. It takes multiple input expressions as children and combines them into a single array value, handling type coercion to ensure all elements have a compatible common type.

Syntax¶

array(expr1, expr2, ..., exprN)

// DataFrame API
import org.apache.spark.sql.functions.array
array(col("col1"), col("col2"), lit(3))

Arguments¶

Argument	Type	Description
children	Seq[Expression]	Variable number of expressions that will become array elements
useStringTypeWhenEmpty	Boolean	Flag determining default element type for empty arrays (StringType if true, NullType if false)

Return Type¶

ArrayType with element type determined by finding the common type among all input expressions. The array is marked as containing nulls if any input expression is nullable.

Supported Data Types¶

All primitive types (numeric, string, boolean, binary, date, timestamp)
Complex types (arrays, maps, structs)
All input expressions must have compatible types that can be coerced to a common type

Algorithm¶

Validates that all input expressions have compatible data types using TypeUtils.checkForSameTypeInputExpr
Determines the common element type using TypeCoercion.findCommonTypeDifferentOnlyInNullFlags
Falls back to default element type (StringType or NullType) if no common type can be found
Evaluates each child expression and wraps results in a GenericArrayData structure
Sets containsNull flag based on whether any input expression is nullable

Partitioning Behavior¶

This expression preserves partitioning as it operates row-by-row without requiring data movement:

Preserves existing partitioning schemes
Does not require shuffle operations
Can be executed independently on each partition

Edge Cases¶

Empty arrays: Uses defaultElementType (StringType or NullType) based on useStringTypeWhenEmpty flag
Null elements: Individual null values are preserved as array elements; the array itself is never null
Type coercion: Automatically promotes compatible types (e.g., int and long become long array)
Mixed nullability: Array is marked as containing nulls if any input expression is nullable, even if actual values are non-null

Code Generation¶

This expression supports full code generation through the doGenCode method. It uses GenArrayData.genCodeToCreateArrayData to generate optimized Java code that creates the array data structure directly without interpretation overhead.

Examples¶

-- Basic array creation
SELECT array(1, 2, 3);
-- Result: [1, 2, 3]

-- Mixed compatible types (promoted to double)
SELECT array(1, 2.5, 3);
-- Result: [1.0, 2.5, 3.0]

-- String array
SELECT array('a', 'b', 'c');
-- Result: ["a", "b", "c"]

-- Array with nulls
SELECT array(1, null, 3);
-- Result: [1, null, 3]

-- Empty array
SELECT array();
-- Result: []

// DataFrame API examples
import org.apache.spark.sql.functions._

// Create array from column values
df.select(array(col("col1"), col("col2"), col("col3")))

// Mix columns and literals
df.select(array(col("name"), lit("constant"), col("other")))

// Nested in other operations
df.select(array(col("a"), col("b")).alias("arr"))
  .filter(size(col("arr")) > 0)