ReplicateRows¶

Overview¶

ReplicateRows is an internal Spark Catalyst generator expression that replicates a row N times, where N is specified as the first argument. This expression is solely used by the Spark optimizer to rewrite EXCEPT ALL and INTERSECT ALL queries and is not directly accessible through SQL or DataFrame APIs.

Syntax¶

This is an internal expression not exposed through public APIs. It would be constructed programmatically as:

ReplicateRows(Seq(numRowsExpression, column1Expression, column2Expression, ...))

Arguments¶

Argument	Type	Description
children.head	Expression evaluating to Long	The number of times to replicate the row (multiplier)
children.tail	Seq[Expression]	The column expressions to be replicated in each output row

Return Type¶

Returns an iterator of InternalRow objects, where each row contains the evaluated column values. The output schema is a StructType with fields named "col0", "col1", etc., with data types matching the input expressions.

Supported Data Types¶

Supports any data type for the column expressions since it simply evaluates and replicates the values without type-specific operations. The multiplier must evaluate to a Long value.

Algorithm¶

Evaluates the first child expression to get the number of replications (numRows) as a Long
Evaluates all remaining child expressions once to get the column values
Creates a range from 0 to numRows and maps each iteration to create a new InternalRow
Each output row contains the same evaluated column values in an array structure
Returns an iterator over all the replicated rows

Partitioning Behavior¶

As a generator expression that produces multiple output rows from a single input row:

Does not preserve partitioning since it changes the number of rows
May require shuffling depending on how it's used in the query plan
The output rows will be in the same partition as the input row

Edge Cases¶

If numRows is 0 or negative, returns an empty iterator (no rows generated)
Null values in column expressions are preserved and replicated as-is
If the multiplier expression evaluates to null, it will cause a ClassCastException when cast to Long
Large multiplier values could cause memory issues since all rows are generated in memory

Code Generation¶

This expression extends CodegenFallback, which means it does not support Tungsten code generation and always falls back to interpreted evaluation mode.

Examples¶

-- Not directly accessible via SQL
-- Used internally for queries like:
SELECT col1, col2 FROM table1
EXCEPT ALL
SELECT col1, col2 FROM table2

// Internal usage only - not available through public DataFrame API
// Would be constructed during query optimization phase
val replicator = ReplicateRows(Seq(
  Literal(3L),           // replicate 3 times
  col("name"),           // first column
  col("age")             // second column
))