Skip to content

SoundEx

Overview

The SoundEx expression implements the Soundex phonetic algorithm to generate a four-character code representing the pronunciation of a string. This algorithm is commonly used for matching names that sound similar but may have different spellings.

Syntax

SELECT soundex(string_expr)
// DataFrame API
col("column_name").soundex()

Arguments

Argument Type Description
child StringType The input string expression to convert to a Soundex code

Return Type

StringType - Returns a four-character Soundex code (one letter followed by three digits).

Supported Data Types

  • StringType with collation support (supports trim collation)

Algorithm

  • Extracts the first letter of the input string as the first character of the code

  • Maps subsequent consonants to digits based on phonetic similarity groups

  • Removes duplicate adjacent digits and vowels from the mapping

  • Pads the result to exactly 4 characters with zeros or truncates if longer

  • Returns the final four-character code in uppercase format

Partitioning Behavior

  • Preserves partitioning as it operates on individual rows without requiring data movement

  • Does not require shuffle operations since it's a deterministic row-level transformation

Edge Cases

  • Null handling: Returns null if input is null (nullIntolerant = true)

  • Empty strings: Processes empty strings according to standard Soundex algorithm rules

  • Non-alphabetic characters: Ignores numbers, punctuation, and special characters in the phonetic mapping

  • Short strings: Pads results shorter than 4 characters with trailing zeros

Code Generation

Supports Tungsten code generation for optimized execution. The expression uses defineCodeGen to generate efficient bytecode that calls the soundex() method directly on UTF8String objects.

Examples

-- Basic Soundex usage
SELECT soundex('Miller');
-- Result: M460

SELECT soundex('Smith'), soundex('Smyth');  
-- Both return: S530

-- Handling similar sounding names
SELECT name, soundex(name) FROM customers WHERE soundex(name) = soundex('Johnson');
// DataFrame API usage
import org.apache.spark.sql.functions.soundex

df.select(soundex(col("last_name")))
df.filter(soundex(col("name")) === soundex(lit("Miller")))

See Also

  • String functions: upper, lower, trim
  • Phonetic matching functions and approximate string matching algorithms