Uuid¶

Overview¶

The Uuid expression generates universally unique identifiers (UUIDs) using a pseudo-random number generator. This is a non-deterministic, stateful expression that produces different UUID values on each evaluation while maintaining reproducibility when a seed is provided.

Syntax¶

UUID()

// DataFrame API
df.withColumn("id", expr("uuid()"))

Arguments¶

Argument	Type	Description
seed (optional)	Long	Optional random seed for deterministic UUID generation across runs

Return Type¶

String (UTF8String internally) - Returns a standard UUID string format.

Supported Data Types¶

This expression does not accept input data types as it's a leaf expression (generator function). The optional seed parameter accepts numeric types that can be converted to Long.

Algorithm¶

Uses an internal RandomUUIDGenerator initialized with a seed value
The seed combines the provided seed (or random seed) with the partition index
Each evaluation calls getNextUUIDUTF8String() to generate the next UUID in sequence
The generator maintains state across evaluations within the same partition
UUIDs are generated using pseudo-random algorithms for reproducibility when seeded

Partitioning Behavior¶

How this expression affects partitioning:

Preserves existing partitioning as it's a generator expression
Does not require shuffle operations
Each partition gets its own generator instance with a unique seed (base seed + partition index)
Ensures different UUIDs across partitions even with the same base seed

Edge Cases¶

Null handling: Never returns null values (nullable = false)
Unresolved state: Expression remains unresolved until a seed is provided (explicitly or implicitly)
Partition consistency: Same partition with same seed will generate identical UUID sequences
Cross-execution determinism: Requires explicit seed for reproducible results across different Spark jobs

Code Generation¶

Supports full code generation (Tungsten). The generated code creates a mutable state variable for the RandomUUIDGenerator and initializes it during partition setup, avoiding object creation overhead during evaluation.

Examples¶

-- Generate random UUIDs
SELECT uuid() as random_id FROM table;

-- Note: SQL doesn't support direct seed specification
-- Seeding typically done through configuration or DataFrame API

// Generate random UUIDs
df.withColumn("uuid", expr("uuid()"))

// For deterministic UUIDs (via internal constructor)
// This requires direct expression construction
val seededUuid = new Uuid(lit(12345L))
df.select(Column(seededUuid).alias("deterministic_uuid"))