CollationKey¶
Overview¶
The CollationKey expression generates a binary collation key from a string input based on the string's collation rules. This enables efficient comparison and sorting of strings according to locale-specific collation semantics by converting strings to binary keys that can be compared byte-wise.
Syntax¶
// DataFrame API usage
import org.apache.spark.sql.catalyst.expressions.CollationKey
CollationKey(col("string_column"))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| expr | Expression | String expression with collation information to generate collation key from |
Return Type¶
BinaryType - Returns a binary array representing the collation key for the input string.
Supported Data Types¶
StringTypeWithCollationwith trim collation support enabled- Input must be a string type that contains collation metadata
Algorithm¶
- Extracts the collation ID from the input string expression's data type
- Converts the input UTF8String to a collation-specific binary key using
CollationFactory.getCollationKeyBytes - The binary key preserves the collation ordering semantics, allowing byte-wise comparison
- Handles null inputs by returning null (via
nullSafeEvalmechanism) - Uses the collation ID to determine the appropriate collation rules for key generation
Partitioning Behavior¶
This expression affects partitioning behavior:
- Does not preserve existing partitioning schemes since it transforms string data to binary keys
- May require shuffle operations when used in grouping or sorting contexts
- The binary output enables consistent partitioning based on collation order
Edge Cases¶
- Null handling: Returns null when input string is null (handled by
nullSafeEval) - Empty string: Generates a valid collation key for empty strings based on collation rules
- Invalid collation ID: Relies on
CollationFactoryto handle invalid or unsupported collation identifiers - Collation mismatch: Requires input to be
StringTypeWithCollationwith trim support, enforced byExpectsInputTypes
Code Generation¶
Supports Tungsten code generation through the doGenCode method, which generates optimized Java code calling CollationFactory.getCollationKeyBytes directly with the resolved collation ID.
Examples¶
-- Generate collation keys for locale-specific string comparison
SELECT COLLATION_KEY(name) as name_key
FROM customers
WHERE name COLLATE 'en_US' > 'Smith';
// DataFrame API usage for sorting with collation
import org.apache.spark.sql.catalyst.expressions.CollationKey
import org.apache.spark.sql.functions._
val df = spark.table("customers")
val collationKeyExpr = CollationKey(col("name").expr)
df.select(col("name"), collationKeyExpr.as("name_key"))
.orderBy("name_key")
See Also¶
CollationFactory- Factory class for collation operations and key generationStringTypeWithCollation- String data type with collation metadata- Collation-aware string comparison expressions
UnaryExpression- Base class for single-argument expressions