Stack¶
Overview¶
The Stack expression is a table-generating function that transposes multiple columns into rows. It takes a number n and a series of expressions, then creates n rows by stacking the expressions vertically, with each row containing a subset of the input expressions as columns.
Syntax¶
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(expr("stack(3, 'A', 1, 'B', 2, 'C', 3) as (col0, col1)"))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| n | IntegerType | The number of rows to generate (must be foldable and > 0) |
| expr1, expr2, ... | Any | Variable number of expressions to stack into rows |
Return Type¶
Returns a StructType with fields named col0, col1, etc. The number of columns is calculated as ceil((total_expressions - 1) / n). Each column's data type is determined from the first non-null expression in that column position.
Supported Data Types¶
- First argument (
n): Must beIntegerTypeand foldable (constant) - Remaining arguments: Any data type, but corresponding column positions across rows must have compatible types
Algorithm¶
- Validates that the first argument is a positive integer constant
- Calculates the number of output columns as
ceil((children.length - 1) / numRows) - Groups the input expressions into rows, filling each row with
numFieldsconsecutive expressions - Generates
numRowsoutput rows, padding with null values if insufficient expressions are provided - Each output row is represented as an
InternalRowwith the calculated number of fields
Partitioning Behavior¶
The Stack expression is a generator function that:
- Does not preserve partitioning as it transforms the structure of data
- Does not require shuffle as it operates row-by-row within partitions
- Each input row can generate multiple output rows independently
Edge Cases¶
- Null handling: Null expressions are preserved as null values in the output columns
- Insufficient expressions: If fewer expressions are provided than needed to fill all rows and columns, missing positions are filled with null values
- Type mismatch: All expressions that map to the same output column position must have the same data type
- Invalid row count: The first argument must be a positive integer, otherwise a
VALUE_OUT_OF_RANGEerror is thrown - Non-foldable n: The row count parameter must be a compile-time constant
Code Generation¶
Supports Tungsten code generation when numRows <= 50. For larger row counts, falls back to interpreted evaluation. The generated code creates an array of InternalRow objects and wraps them in a mutable.ArraySeq.
Examples¶
-- Transform 6 values into 3 rows with 2 columns each
SELECT stack(3, 'A', 1, 'B', 2, 'C', 3) as (letter, number);
-- Result:
-- letter | number
-- A | 1
-- B | 2
-- C | 3
-- With insufficient values (padded with nulls)
SELECT stack(2, 'X', 10, 'Y') as (col0, col1);
-- Result:
-- col0 | col1
-- X | 10
-- Y | NULL
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(expr("stack(2, name, age, city, population) as (attribute, value)"))
.show()
See Also¶
explode()- Explodes arrays or maps into multiple rowsposexplode()- Similar to explode but includes position informationlateral view- SQL construct often used with table-generating functions