Skip to content

TryValidateUTF8

Overview

The TryValidateUTF8 expression validates whether a string contains valid UTF-8 encoded characters. It returns the original string if validation passes, or NULL if the input contains invalid UTF-8 sequences. This expression is null-intolerant, meaning it will not produce null results from valid non-null inputs.

Syntax

TRY_VALIDATE_UTF8(string_expr)
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(expr("try_validate_utf8(column_name)"))

Arguments

Argument Type Description
input StringType with collation support The string expression to validate for UTF-8 compliance

Return Type

Returns the same data type as the input (StringType with collation), or NULL if validation fails.

Supported Data Types

  • StringType with collation support
  • Supports trim collation operations
  • Input must be castable to string type through implicit casting

Algorithm

  • Takes a string input and delegates validation to ExpressionImplUtils.tryValidateUTF8String
  • Performs UTF-8 byte sequence validation on the input string
  • Returns the original string unchanged if all byte sequences are valid UTF-8
  • Returns NULL if any invalid UTF-8 byte sequences are detected
  • Uses static method invocation for the actual validation logic

Partitioning Behavior

This expression preserves partitioning behavior:

  • Does not require data shuffle as it operates on individual rows
  • Maintains existing partitioning scheme since it's a row-level transformation
  • Can be pushed down to individual partitions for parallel processing

Edge Cases

  • Null handling: Returns NULL for NULL input (nullable = true)
  • Empty string: Empty strings are considered valid UTF-8 and return empty string
  • Invalid UTF-8 sequences: Any malformed UTF-8 byte sequences result in NULL output
  • Collation support: Respects string collation settings including trim operations
  • Null intolerant: The expression itself is null-intolerant for valid inputs

Code Generation

This expression uses runtime replacement with StaticInvoke:

  • Does not generate custom code but delegates to pre-compiled static methods
  • Uses StaticInvoke to call ExpressionImplUtils.tryValidateUTF8String
  • Leverages existing Java/Scala UTF-8 validation implementations
  • May benefit from JIT compilation of the underlying static method

Examples

-- Basic usage
SELECT TRY_VALIDATE_UTF8('Hello World') AS result;
-- Returns: 'Hello World'

-- Invalid UTF-8 example
SELECT TRY_VALIDATE_UTF8(UNHEX('C0C1')) AS result;  
-- Returns: NULL

-- With NULL input
SELECT TRY_VALIDATE_UTF8(NULL) AS result;
-- Returns: NULL
// DataFrame API usage
import org.apache.spark.sql.functions._

val df = Seq(
  "Valid UTF-8 string",
  "Another valid string",
  null
).toDF("text")

df.select(expr("try_validate_utf8(text)") as "validated")
  .show()

See Also

  • VALIDATE_UTF8 - Similar validation that throws exceptions instead of returning NULL
  • UTF8 string functions for encoding/decoding operations
  • String collation functions for text processing with locale-specific rules