SoundEx¶
Overview¶
The SoundEx expression implements the Soundex phonetic algorithm to generate a four-character code representing the pronunciation of a string. This algorithm is commonly used for matching names that sound similar but may have different spellings.
Syntax¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
| child | StringType | The input string expression to convert to a Soundex code |
Return Type¶
StringType - Returns a four-character Soundex code (one letter followed by three digits).
Supported Data Types¶
- StringType with collation support (supports trim collation)
Algorithm¶
-
Extracts the first letter of the input string as the first character of the code
-
Maps subsequent consonants to digits based on phonetic similarity groups
-
Removes duplicate adjacent digits and vowels from the mapping
-
Pads the result to exactly 4 characters with zeros or truncates if longer
-
Returns the final four-character code in uppercase format
Partitioning Behavior¶
-
Preserves partitioning as it operates on individual rows without requiring data movement
-
Does not require shuffle operations since it's a deterministic row-level transformation
Edge Cases¶
-
Null handling: Returns null if input is null (nullIntolerant = true)
-
Empty strings: Processes empty strings according to standard Soundex algorithm rules
-
Non-alphabetic characters: Ignores numbers, punctuation, and special characters in the phonetic mapping
-
Short strings: Pads results shorter than 4 characters with trailing zeros
Code Generation¶
Supports Tungsten code generation for optimized execution. The expression uses defineCodeGen to generate efficient bytecode that calls the soundex() method directly on UTF8String objects.
Examples¶
-- Basic Soundex usage
SELECT soundex('Miller');
-- Result: M460
SELECT soundex('Smith'), soundex('Smyth');
-- Both return: S530
-- Handling similar sounding names
SELECT name, soundex(name) FROM customers WHERE soundex(name) = soundex('Johnson');
// DataFrame API usage
import org.apache.spark.sql.functions.soundex
df.select(soundex(col("last_name")))
df.filter(soundex(col("name")) === soundex(lit("Miller")))
See Also¶
- String functions:
upper,lower,trim - Phonetic matching functions and approximate string matching algorithms