Overlay¶
Overview¶
The Overlay expression replaces a portion of a string or binary array with another string or binary array, starting at a specified position and optionally specifying the length of the portion to replace. It supports both string operations with UTF8String and binary operations with byte arrays.
Syntax¶
// DataFrame API usage
overlay(input, replace, pos, len)
overlay(input, replace, pos) // len defaults to -1
Arguments¶
| Argument | Type | Description |
|---|---|---|
| input | StringType/BinaryType | The original string or binary data to modify |
| replace | StringType/BinaryType | The replacement string or binary data |
| pos | IntegerType | The starting position (1-based) where replacement begins |
| len | IntegerType | Optional length of portion to replace (defaults to -1 if not specified) |
Return Type¶
Returns the same data type as the input expression (StringType or BinaryType).
Supported Data Types¶
- StringType with collation support (supports trim collation)
- BinaryType
Input and replace arguments must be of the same type (both string or both binary).
Algorithm¶
- Validates that input and replace expressions have matching data types
- Uses position-based replacement starting from the specified 1-based position
- If length is not specified (defaults to -1), replaces from position to end
- Delegates actual calculation to
Overlay.calculatestatic methods - Supports both UTF8String operations for text and byte array operations for binary data
- Performs null-safe evaluation with null intolerance (returns null if any input is null)
Partitioning Behavior¶
This expression does not affect partitioning:
- Preserves existing partitioning as it operates on individual row values
- Does not require shuffle operations
- Can be applied within partition boundaries
Edge Cases¶
- Null handling: Expression is null-intolerant, returning null if any input argument is null
- Type mismatch: Fails type checking if input and replace have different data types
- Position bounds: Behavior for out-of-bounds positions is handled by the underlying calculate methods
- Negative length: When len parameter is -1 (default), replaces from position to end of input
- Empty inputs: Handled by the underlying UTF8String and byte array calculation methods
Code Generation¶
This expression supports Tungsten code generation through the doGenCode method, which generates optimized bytecode that directly calls org.apache.spark.sql.catalyst.expressions.Overlay.calculate methods.
Examples¶
-- Replace 'world' with 'Spark' starting at position 7
SELECT OVERLAY('Hello world' PLACING 'Spark' FROM 7 FOR 5) AS result;
-- Result: 'Hello Spark'
-- Replace from position without specifying length
SELECT OVERLAY('Hello world' PLACING 'Spark' FROM 7) AS result;
-- Result: 'Hello Spark'
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(overlay(col("text"), lit("replacement"), lit(5), lit(3)))
df.select(overlay(col("text"), lit("replacement"), lit(5))) // No length specified
See Also¶
Substring- Extract portions of stringsReplace- Replace all occurrences of a substringConcat- Concatenate multiple strings- String manipulation functions in
string_funcsgroup