Skip to content

ValidateUTF8

Overview

The ValidateUTF8 expression validates that a string contains valid UTF-8 encoded characters. It returns the original string if valid, or throws an exception if invalid UTF-8 sequences are detected. This expression is implemented as a runtime replaceable expression that delegates to native implementation methods.

Syntax

validate_utf8(string_expr)
// DataFrame API
import org.apache.spark.sql.catalyst.expressions.ValidateUTF8
ValidateUTF8(col("column_name"))

Arguments

Argument Type Description
input Expression The string expression to validate for UTF-8 compliance

Return Type

Returns the same data type as the input expression (StringType with preserved collation).

Supported Data Types

  • StringType with collation support
  • Supports trim collations
  • Input must be convertible to string type through implicit casting

Algorithm

  • Accepts a string input expression through implicit cast input types
  • Delegates validation to ExpressionImplUtils.validateUTF8String via StaticInvoke
  • Performs actual UTF-8 validation in native implementation code
  • Returns the original string if validation passes
  • Throws runtime exception if invalid UTF-8 sequences are detected

Partitioning Behavior

  • Preserves partitioning as it operates on individual string values
  • Does not require shuffle operations
  • Can be executed independently on each partition

Edge Cases

  • Null handling: Returns null for null input (nullable = true, nullIntolerant = true)
  • Empty strings: Empty strings are considered valid UTF-8 and pass validation
  • Invalid UTF-8: Throws runtime exception when encountering malformed byte sequences
  • Collation preservation: Maintains the original string's collation settings

Code Generation

This expression uses runtime replacement with StaticInvoke, which supports code generation through Tungsten. The actual validation is performed by native methods in ExpressionImplUtils for optimal performance.

Examples

-- Validate UTF-8 encoding of a string column
SELECT validate_utf8(text_column) FROM table_name;

-- Use in WHERE clause to filter valid UTF-8 strings
SELECT * FROM logs WHERE validate_utf8(message) IS NOT NULL;
// DataFrame API usage
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.expressions.ValidateUTF8

val df = spark.table("text_data")
val validatedDF = df.select(ValidateUTF8(col("content")))

// Using in filter operations
val cleanDF = df.filter(ValidateUTF8(col("user_input")).isNotNull)

See Also

  • String manipulation expressions
  • Character encoding functions
  • Data validation expressions
  • ImplicitCastInputTypes trait
  • RuntimeReplaceable expressions