TryValidateUTF8¶
Overview¶
The TryValidateUTF8 expression validates whether a string contains valid UTF-8 encoded characters. It returns the original string if validation passes, or NULL if the input contains invalid UTF-8 sequences. This expression is null-intolerant, meaning it will not produce null results from valid non-null inputs.
Syntax¶
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(expr("try_validate_utf8(column_name)"))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| input | StringType with collation support | The string expression to validate for UTF-8 compliance |
Return Type¶
Returns the same data type as the input (StringType with collation), or NULL if validation fails.
Supported Data Types¶
- StringType with collation support
- Supports trim collation operations
- Input must be castable to string type through implicit casting
Algorithm¶
- Takes a string input and delegates validation to
ExpressionImplUtils.tryValidateUTF8String - Performs UTF-8 byte sequence validation on the input string
- Returns the original string unchanged if all byte sequences are valid UTF-8
- Returns NULL if any invalid UTF-8 byte sequences are detected
- Uses static method invocation for the actual validation logic
Partitioning Behavior¶
This expression preserves partitioning behavior:
- Does not require data shuffle as it operates on individual rows
- Maintains existing partitioning scheme since it's a row-level transformation
- Can be pushed down to individual partitions for parallel processing
Edge Cases¶
- Null handling: Returns NULL for NULL input (nullable = true)
- Empty string: Empty strings are considered valid UTF-8 and return empty string
- Invalid UTF-8 sequences: Any malformed UTF-8 byte sequences result in NULL output
- Collation support: Respects string collation settings including trim operations
- Null intolerant: The expression itself is null-intolerant for valid inputs
Code Generation¶
This expression uses runtime replacement with StaticInvoke:
- Does not generate custom code but delegates to pre-compiled static methods
- Uses
StaticInvoketo callExpressionImplUtils.tryValidateUTF8String - Leverages existing Java/Scala UTF-8 validation implementations
- May benefit from JIT compilation of the underlying static method
Examples¶
-- Basic usage
SELECT TRY_VALIDATE_UTF8('Hello World') AS result;
-- Returns: 'Hello World'
-- Invalid UTF-8 example
SELECT TRY_VALIDATE_UTF8(UNHEX('C0C1')) AS result;
-- Returns: NULL
-- With NULL input
SELECT TRY_VALIDATE_UTF8(NULL) AS result;
-- Returns: NULL
// DataFrame API usage
import org.apache.spark.sql.functions._
val df = Seq(
"Valid UTF-8 string",
"Another valid string",
null
).toDF("text")
df.select(expr("try_validate_utf8(text)") as "validated")
.show()
See Also¶
VALIDATE_UTF8- Similar validation that throws exceptions instead of returning NULLUTF8string functions for encoding/decoding operations- String collation functions for text processing with locale-specific rules