StringTranslate¶

Overview¶

The StringTranslate expression performs character-by-character translation on a string by replacing characters found in a matching string with corresponding characters from a replacement string. It builds a translation dictionary from the matching and replacement strings, then applies this mapping to transform the source string.

Syntax¶

TRANSLATE(source_string, matching_characters, replacement_characters)

// DataFrame API
translate(source_col, matching_string, replacement_string)

Arguments¶

Argument	Type	Description
srcExpr	StringType	The source string to be translated
matchingExpr	StringType	String containing characters to be replaced
replaceExpr	StringType	String containing replacement characters

Return Type¶

Returns StringType with the same data type as the source expression, preserving collation settings.

Supported Data Types¶

All three arguments must be StringType with non-CSAI collation
Supports trim collation
Preserves the collation ID from the source expression

Algorithm¶

Builds a character translation dictionary by mapping characters from matchingExpr to corresponding positions in replaceExpr
Caches the translation dictionary when matching and replacement strings remain unchanged
Iterates through each character in the source string and applies the translation mapping
Uses collation-aware string processing through CollationSupport.StringTranslate.exec
Delegates dictionary construction to StringTranslate.buildDict utility method

Partitioning Behavior¶

This expression preserves partitioning as it operates on individual rows without requiring data movement:

Does not require shuffle operations
Maintains existing data distribution
Can be executed independently on each partition

Edge Cases¶

Null handling: Returns null if any input argument is null (null intolerant behavior)
Mismatched lengths: When replacement string is shorter than matching string, excess matching characters are handled by the buildDict implementation
Empty strings: Empty matching or replacement strings result in no character translation
Duplicate characters: If matching string contains duplicate characters, behavior depends on the dictionary building logic
Collation sensitivity: Translation respects the collation settings of the input strings

Code Generation¶

Supports Tungsten code generation for optimized performance:

Generates mutable state variables for caching translation dictionary
Optimizes dictionary rebuild checks for foldable (literal) expressions
Uses nullSafeCodeGen for efficient null handling in generated code
Falls back to interpreted mode only when code generation fails

Examples¶

-- Replace vowels with numbers
SELECT TRANSLATE('Apache Spark', 'aeiou', '12345');
-- Result: 'Ap1ch2 Sp1rk'

-- Remove characters by providing shorter replacement string
SELECT TRANSLATE('hello world', 'aeiou', '12');
-- Result: 'h2ll wrld'

// DataFrame API usage
import org.apache.spark.sql.functions.translate

df.select(translate(col("name"), "aeiou", "12345"))

// Using column expressions
df.withColumn("translated", translate($"text", lit("abc"), lit("123")))