SparkPartitionID¶

Overview¶

The SparkPartitionID expression returns the partition ID of the current partition during query execution. This is a non-deterministic leaf expression that provides access to Spark's internal partitioning information at runtime.

Syntax¶

SPARK_PARTITION_ID()

// DataFrame API
import org.apache.spark.sql.functions._
df.select(expr("SPARK_PARTITION_ID()"))

Arguments¶

Argument	Type	Description
None	-	This function takes no arguments

Return Type¶

IntegerType - Returns an integer representing the partition ID.

Supported Data Types¶

This expression does not accept input data as it is a leaf expression. It operates independently of any input data types and always returns an integer partition ID.

Algorithm¶

The expression is initialized once per partition with the partition index value
During initialization, the partitionIndex parameter is stored in the partitionId field
For each row evaluation, it simply returns the stored partition ID value
The partition ID remains constant for all rows within the same partition
The expression is marked as Nondeterministic because the same logical row can have different partition IDs across different query executions

Partitioning Behavior¶

Preserves partitioning: This expression does not affect the existing partitioning scheme
No shuffle required: Since it's a leaf expression that only reads partition metadata, no data movement is needed
The returned value directly corresponds to Spark's internal partition numbering (0-indexed)

Edge Cases¶

Null handling: This expression never returns null (nullable = false)
Empty partitions: Even empty partitions will have a valid partition ID
Single partition: In single-partition datasets, always returns 0
Partition reassignment: If partitions are redistributed during query execution, the partition ID reflects the current partition assignment

Code Generation¶

This expression fully supports Tungsten code generation: - Uses doGenCode method to generate efficient Java code - Creates an immutable state variable partitionId in the generated code - Initializes the partition ID once per partition using addPartitionInitializationStatement - Generates inline code that directly returns the partition ID value without method calls

Examples¶

-- Get partition ID for each row
SELECT SPARK_PARTITION_ID(), * FROM my_table;

-- Count rows per partition
SELECT SPARK_PARTITION_ID() as partition_id, COUNT(*) as row_count 
FROM my_table 
GROUP BY SPARK_PARTITION_ID();

-- Filter data from specific partitions
SELECT * FROM my_table WHERE SPARK_PARTITION_ID() IN (0, 1, 2);

// DataFrame API usage
import org.apache.spark.sql.functions._

// Add partition ID column
val dfWithPartitionId = df.withColumn("partition_id", expr("SPARK_PARTITION_ID()"))

// Group by partition ID
val partitionCounts = df.groupBy(expr("SPARK_PARTITION_ID()")).count()

// Filter by partition ID
val partition0Data = df.filter(expr("SPARK_PARTITION_ID() = 0"))