RegExpReplace¶

Overview¶

The RegExpReplace expression performs regular expression-based string replacement operations. It searches for occurrences of a regular expression pattern in a subject string and replaces them with a replacement string, starting from a specified position.

Syntax¶

REGEXP_REPLACE(subject, regexp, rep[, pos])

// DataFrame API
col("subject").regexp_replace(regexp, replacement)
// or with position
regexp_replace(col("subject"), regexp, replacement, position)

Arguments¶

Argument	Type	Description
subject	StringType	The input string to search within
regexp	StringType	The regular expression pattern to search for
rep	StringType	The replacement string
pos	IntegerType	Starting position for the search (1-based, defaults to 1)

Return Type¶

Returns the same StringType as the subject input, preserving collation settings.

Supported Data Types¶

Subject: String types with binary lowercase collation support
Pattern: String types with collation support
Replacement: String types with binary lowercase collation support
Position: Integer type only

Algorithm¶

Validates that the position parameter is foldable (constant) and greater than 0
Caches compiled regex patterns for performance when the pattern hasn't changed
Converts 1-based position to 0-based for internal Java regex operations
Uses Java's Matcher.region() to limit search scope to the specified starting position
Applies appendReplacement() and appendTail() for efficient string building
Returns original subject string if position is beyond string length

Partitioning Behavior¶

This expression preserves partitioning as it operates on individual rows without requiring data movement:

Does not require shuffle operations
Maintains existing partitioning scheme
Can be pushed down to individual partitions

Edge Cases¶

Returns null if any input parameter is null (null intolerant behavior)
Returns original subject string unchanged if position exceeds string length
Position parameter must be a constant expression (foldable) and positive
Invalid regex patterns throw compilation errors at evaluation time
Invalid replacement strings (e.g., invalid group references) throw runtime errors
Empty subject strings are handled gracefully, returning empty results

Code Generation¶

This expression supports Tungsten code generation for optimized performance. It generates specialized bytecode that:

Caches compiled regex patterns and replacement strings
Uses efficient string building with StringBuilder
Includes proper exception handling for invalid regex operations
Avoids object allocation in the hot path when possible

Examples¶

-- Basic replacement
SELECT REGEXP_REPLACE('Hello World', 'World', 'Spark');
-- Result: 'Hello Spark'

-- With position parameter
SELECT REGEXP_REPLACE('abc abc abc', 'abc', 'xyz', 5);
-- Result: 'abc xyz xyz'

-- Using capture groups
SELECT REGEXP_REPLACE('2023-12-25', '(\\d{4})-(\\d{2})-(\\d{2})', '$3/$2/$1');
-- Result: '25/12/2023'

// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(regexp_replace(col("text"), "old", "new"))

// With position
df.select(regexp_replace(col("text"), "pattern", "replacement", 3))