Overlay¶

Overview¶

The Overlay expression replaces a portion of a string or binary array with another string or binary array, starting at a specified position and optionally specifying the length of the portion to replace. It supports both string operations with UTF8String and binary operations with byte arrays.

Syntax¶

OVERLAY(input PLACING replace FROM pos [FOR len])

// DataFrame API usage
overlay(input, replace, pos, len)
overlay(input, replace, pos)  // len defaults to -1

Arguments¶

Argument	Type	Description
input	StringType/BinaryType	The original string or binary data to modify
replace	StringType/BinaryType	The replacement string or binary data
pos	IntegerType	The starting position (1-based) where replacement begins
len	IntegerType	Optional length of portion to replace (defaults to -1 if not specified)

Return Type¶

Returns the same data type as the input expression (StringType or BinaryType).

Supported Data Types¶

StringType with collation support (supports trim collation)
BinaryType

Input and replace arguments must be of the same type (both string or both binary).

Algorithm¶

Validates that input and replace expressions have matching data types
Uses position-based replacement starting from the specified 1-based position
If length is not specified (defaults to -1), replaces from position to end
Delegates actual calculation to Overlay.calculate static methods
Supports both UTF8String operations for text and byte array operations for binary data
Performs null-safe evaluation with null intolerance (returns null if any input is null)

Partitioning Behavior¶

This expression does not affect partitioning:

Preserves existing partitioning as it operates on individual row values
Does not require shuffle operations
Can be applied within partition boundaries

Edge Cases¶

Null handling: Expression is null-intolerant, returning null if any input argument is null
Type mismatch: Fails type checking if input and replace have different data types
Position bounds: Behavior for out-of-bounds positions is handled by the underlying calculate methods
Negative length: When len parameter is -1 (default), replaces from position to end of input
Empty inputs: Handled by the underlying UTF8String and byte array calculation methods

Code Generation¶

This expression supports Tungsten code generation through the doGenCode method, which generates optimized bytecode that directly calls org.apache.spark.sql.catalyst.expressions.Overlay.calculate methods.

Examples¶

-- Replace 'world' with 'Spark' starting at position 7
SELECT OVERLAY('Hello world' PLACING 'Spark' FROM 7 FOR 5) AS result;
-- Result: 'Hello Spark'

-- Replace from position without specifying length
SELECT OVERLAY('Hello world' PLACING 'Spark' FROM 7) AS result;
-- Result: 'Hello Spark'

// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(overlay(col("text"), lit("replacement"), lit(5), lit(3)))
df.select(overlay(col("text"), lit("replacement"), lit(5)))  // No length specified