Randn¶

Overview¶

The Randn expression generates normally distributed random numbers (Gaussian distribution) with mean 0.0 and standard deviation 1.0. This is a non-deterministic expression that produces different values on each evaluation unless a specific seed is provided.

Syntax¶

randn()           -- Uses random seed
randn(seed)       -- Uses specified seed for reproducibility

Arguments¶

Argument	Type	Description
seed	Long	Optional. Seed value for random number generation. When provided, ensures reproducible results

Return Type¶

DoubleType - Returns a double-precision floating point number.

Supported Data Types¶

Seed argument accepts LongType expressions
No input data transformation - generates values independently

Algorithm¶

Uses XORShiftRandom as the underlying random number generator
Calls nextGaussian() method to generate normally distributed values
Each partition gets its own RNG instance initialized with seed + partitionIndex
Maintains deterministic behavior within partitions when seed is specified
Extends NondeterministicUnaryRDG for proper non-deterministic handling

Partitioning Behavior¶

How this expression affects partitioning:

Preserves existing partitioning since it doesn't depend on input data
Does not require shuffle operations
Each partition maintains independent random state for consistent results

Edge Cases¶

Null handling: Never returns null values (isNull = FalseLiteral)
Empty input: Generates values regardless of input data presence
Range: Can generate any finite double value following normal distribution
Determinism: Non-deterministic by default, deterministic when seed provided
Partition consistency: Same seed produces same sequence within each partition

Code Generation¶

Supports full code generation (Tungsten). The generated code creates a partition-local XORShiftRandom instance and directly calls nextGaussian() method, avoiding interpreted evaluation overhead.

Examples¶

-- Generate random normally distributed values
SELECT randn() as random_normal;

-- Generate reproducible random values with seed
SELECT randn(42) as seeded_random;

-- Use in data generation scenarios
SELECT id, randn(123) as noise 
FROM range(1000);

// Example DataFrame API usage
import org.apache.spark.sql.functions._

// Generate random normal values
df.select(randn().alias("random_normal"))

// With specific seed for reproducibility
df.select(randn(lit(42L)).alias("seeded_random"))

// Add noise to existing column
df.withColumn("noisy_value", col("value") + randn(seed = 123))