Skip to content

Collate

Overview

The Collate expression marks a given expression with a specified collation without modifying the input data. This is a pass-through function that only updates type metadata to associate collation information with string data.

Syntax

COLLATE(expression, collation)

Arguments

Argument Type Description
expression StringTypeWithCollation The input expression to be marked with collation
collation AnyDataType The collation specification to apply to the expression

Return Type

Returns the same data type as the collation parameter, which contains the collation metadata.

Supported Data Types

  • String types with collation support (including trim collation support)
  • The collation parameter can be of any data type

Algorithm

  • Evaluates the child expression without any modification to the actual data
  • Passes through the result value unchanged during evaluation
  • Updates only the type metadata to include collation information
  • Uses simple passthrough for code generation without custom logic
  • Maintains the foldable property of the child expression

Partitioning Behavior

This expression preserves partitioning behavior since it's a metadata-only operation:

  • Does not affect data distribution or partitioning
  • No shuffle operations are required
  • Maintains the same partitioning as the input expression

Edge Cases

  • Null values are passed through unchanged from the child expression
  • Empty strings maintain their empty state with updated collation metadata
  • The expression inherits the foldable behavior from its child expression
  • Evaluation behavior is identical to the child expression since no data transformation occurs

Code Generation

This expression supports Tungsten code generation through simple passthrough. It uses child.genCode(ctx) directly and explicitly throws an error if doGenCode is called, indicating it relies entirely on the child's code generation.

Examples

-- Apply UTF8_BINARY collation to a string column
SELECT COLLATE(name, 'UTF8_BINARY') FROM users;

-- Mark a string literal with specific collation
SELECT COLLATE('Hello World', 'UTF8_LCASE') AS greeting;
// DataFrame API usage with collate function
import org.apache.spark.sql.functions._
df.select(collate(col("name"), lit("UTF8_BINARY")))

// Using in column expressions
df.withColumn("collated_name", collate(col("name"), lit("UTF8_LCASE")))

See Also

  • String collation functions
  • Type casting expressions
  • Metadata manipulation expressions