ValidateUTF8¶
Overview¶
The ValidateUTF8 expression validates that a string contains valid UTF-8 encoded characters. It returns the original string if valid, or throws an exception if invalid UTF-8 sequences are detected. This expression is implemented as a runtime replaceable expression that delegates to native implementation methods.
Syntax¶
// DataFrame API
import org.apache.spark.sql.catalyst.expressions.ValidateUTF8
ValidateUTF8(col("column_name"))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| input | Expression | The string expression to validate for UTF-8 compliance |
Return Type¶
Returns the same data type as the input expression (StringType with preserved collation).
Supported Data Types¶
- StringType with collation support
- Supports trim collations
- Input must be convertible to string type through implicit casting
Algorithm¶
- Accepts a string input expression through implicit cast input types
- Delegates validation to
ExpressionImplUtils.validateUTF8Stringvia StaticInvoke - Performs actual UTF-8 validation in native implementation code
- Returns the original string if validation passes
- Throws runtime exception if invalid UTF-8 sequences are detected
Partitioning Behavior¶
- Preserves partitioning as it operates on individual string values
- Does not require shuffle operations
- Can be executed independently on each partition
Edge Cases¶
- Null handling: Returns null for null input (nullable = true, nullIntolerant = true)
- Empty strings: Empty strings are considered valid UTF-8 and pass validation
- Invalid UTF-8: Throws runtime exception when encountering malformed byte sequences
- Collation preservation: Maintains the original string's collation settings
Code Generation¶
This expression uses runtime replacement with StaticInvoke, which supports code generation through Tungsten. The actual validation is performed by native methods in ExpressionImplUtils for optimal performance.
Examples¶
-- Validate UTF-8 encoding of a string column
SELECT validate_utf8(text_column) FROM table_name;
-- Use in WHERE clause to filter valid UTF-8 strings
SELECT * FROM logs WHERE validate_utf8(message) IS NOT NULL;
// DataFrame API usage
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.expressions.ValidateUTF8
val df = spark.table("text_data")
val validatedDF = df.select(ValidateUTF8(col("content")))
// Using in filter operations
val cleanDF = df.filter(ValidateUTF8(col("user_input")).isNotNull)
See Also¶
- String manipulation expressions
- Character encoding functions
- Data validation expressions
- ImplicitCastInputTypes trait
- RuntimeReplaceable expressions