Skip to content

Mask

Overview

The Mask expression masks characters in string input by replacing different character types (uppercase, lowercase, digits, and other characters) with specified replacement characters. It provides data obfuscation functionality while preserving the original string structure and length.

Syntax

mask(input, upperChar, lowerChar, digitChar, otherChar)
mask(input, upperChar, lowerChar, digitChar)
mask(input, upperChar, lowerChar)
mask(input, upperChar)
mask(input)

Arguments

Argument Type Description
input String The input string to be masked
upperChar String Single character to replace uppercase letters (default: 'X')
lowerChar String Single character to replace lowercase letters (default: 'x')
digitChar String Single character to replace digits (default: 'n')
otherChar String Single character to replace other characters (default: null - no replacement)

Return Type

Returns the same data type as the input expression (typically StringType with collation support).

Supported Data Types

  • Input: StringTypeWithCollation (supports trim collation)

  • Replacement characters: StringTypeWithCollation (supports trim collation)

  • All replacement character expressions must be foldable (compile-time constants)

  • Each replacement character must be exactly one character long

Algorithm

  • Evaluates all child expressions to get the input string and replacement characters

  • Delegates actual transformation logic to Mask.transformInput() utility method

  • Processes each character in the input string based on its type (uppercase, lowercase, digit, other)

  • Replaces characters according to the specified replacement mappings

  • Preserves original string length and non-alphabetic/non-digit characters (unless otherChar is specified)

Partitioning Behavior

  • Preserves partitioning as it operates on individual rows without requiring data movement

  • Does not require shuffle operations

  • Can be pushed down to individual partitions for parallel execution

Edge Cases

  • Returns null if the input string is null (only first child determines nullability)

  • Non-foldable replacement character expressions cause compilation errors

  • Replacement characters with length != 1 cause validation errors

  • Empty input strings return empty strings

  • When otherChar is null (default), other characters remain unchanged

  • All replacement character expressions are evaluated even if input is null (for validation)

Code Generation

Supports Tungsten code generation through doGenCode() method. Generated code calls the static Mask.transformInput() method directly, enabling efficient execution without interpreter overhead.

Examples

-- Basic masking with defaults
SELECT mask('Hello123!') -- Returns 'Xxxxx999!'

-- Custom replacement characters  
SELECT mask('Hello123!', 'U', 'l', '#', '*') -- Returns 'Ullll###*'

-- Partial customization
SELECT mask('Hello123!', 'A', 'b') -- Returns 'Abbbb999!'
// DataFrame API usage
import org.apache.spark.sql.functions._

// Basic masking
df.select(expr("mask(name)"))

// Custom masking
df.select(expr("mask(ssn, 'X', 'x', '#', '-')"))

// Using constructors programmatically
val maskExpr = new Mask(col("input").expr)

See Also

  • String manipulation functions (substring, regexp_replace)

  • Data obfuscation and anonymization functions

  • QuinaryExpression base class for five-argument expressions