IsValidUTF8¶
Overview¶
The IsValidUTF8 expression checks whether a binary or string input contains valid UTF-8 encoded data. It returns true if the input is valid UTF-8, false if it contains invalid UTF-8 sequences, and null if the input is null.
Syntax¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
| input | StringType with collation support | The binary or string data to validate for UTF-8 encoding |
Return Type¶
BooleanType - Returns true for valid UTF-8, false for invalid UTF-8, null for null input.
Supported Data Types¶
- StringType with collation support (including trim collation)
- Binary data that can be cast to string
- Supports implicit type casting for compatible input types
Algorithm¶
- The expression is implemented as a
RuntimeReplaceablethat delegates to theisValidmethod on the input - Uses the
Invokeexpression to call the nativeisValidmethod withBooleanTypereturn - Performs UTF-8 validation at the byte level to detect malformed sequences
- Returns false immediately upon encountering any invalid UTF-8 byte sequence
- Handles multi-byte UTF-8 characters by validating continuation bytes
Partitioning Behavior¶
How this expression affects partitioning:
- Preserves existing partitioning as it's a deterministic row-level transformation
- Does not require shuffle operations
- Can be pushed down to individual partitions for parallel execution
Edge Cases¶
- Null handling: Returns null for null input (nullable = true, nullIntolerant = true)
- Empty input: Returns true for empty strings as they are valid UTF-8
- Binary data: Validates the raw byte sequence for UTF-8 compliance
- Collation support: Works with different string collations including trim collation
- Malformed sequences: Returns false for incomplete multi-byte sequences or invalid continuation bytes
Code Generation¶
This expression uses runtime replacement with method invocation rather than direct code generation. It delegates to the underlying isValid method implementation, which may have optimized native implementations depending on the execution engine.
Examples¶
-- Basic UTF-8 validation
SELECT IS_VALID_UTF8('Hello World');
-- true
-- Invalid UTF-8 sequence (from documentation example)
SELECT IS_VALID_UTF8(x'61C262');
-- false
-- Null input
SELECT IS_VALID_UTF8(NULL);
-- NULL
-- Valid multi-byte UTF-8
SELECT IS_VALID_UTF8('café');
-- true
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(col("text_column").expr("is_valid_utf8(text_column)").as("is_valid"))
// With binary data
df.select(expr("is_valid_utf8(binary_column)").as("utf8_valid"))
See Also¶
TRY_TO_BINARY- For safe binary conversionDECODE- For character set conversion- String manipulation functions for UTF-8 text processing