GroupingID¶

Overview¶

The GroupingID expression computes a grouping identifier that indicates which columns are being grouped in GROUPING SETS, CUBE, and ROLLUP operations. It returns a bitmask where each bit represents whether a specific grouping column is included in the current grouping combination, enabling identification of different aggregation levels in hierarchical grouping operations.

Syntax¶

GROUPING_ID(column1, column2, ...)

// DataFrame API - typically used internally with cube/rollup operations
df.cube("col1", "col2").agg(expr("grouping_id(col1, col2)"))

Arguments¶

Argument	Type	Description
groupByExprs	Seq[Expression]	The sequence of expressions that represent the grouping columns for which the grouping ID is calculated

Return Type¶

Integer data type - returns a bitmask as an integer value representing the grouping combination.

Supported Data Types¶

The GroupingID expression itself works with any data types for the input columns since it operates on the grouping metadata rather than the actual column values. The input expressions can be of any supported Spark data type.

Algorithm¶

Maintains references to a virtual groupingIdAttribute column rather than the actual input columns
Acts as an Unevaluable expression, meaning it cannot be directly evaluated in normal expression contexts
The actual evaluation is handled by specialized grouping operators (Cube, Rollup, GroupingSets)
Generates a bitmask where each bit position corresponds to a grouping expression
Returns an integer representing which columns are active in the current grouping level

Partitioning Behavior¶

Does not preserve partitioning as it's typically used in aggregation contexts
Requires shuffle operations when used with CUBE, ROLLUP, or GROUPING SETS
The expression itself doesn't directly affect partitioning but is used within aggregation operators that do

Edge Cases¶

Null handling: The expression itself is marked as non-nullable (nullable = false) since it returns grouping metadata rather than data-dependent values
Empty input: When used with empty grouping expressions, would return 0
Unevaluable context: Throws evaluation exceptions if used outside of proper grouping contexts since it extends Unevaluable
Virtual column dependency: Relies on the presence of the virtual groupingIdAttribute in the execution context

Code Generation¶

This expression does not support direct code generation since it extends Unevaluable. The code generation is handled by the parent grouping operators (Aggregate with Cube/Rollup/GroupingSets) that replace this expression with actual computed values during physical plan execution.

Examples¶

-- Example with CUBE - grouping_id identifies aggregation levels
SELECT col1, col2, SUM(value), GROUPING_ID(col1, col2)
FROM table
GROUP BY CUBE(col1, col2);

-- Example with ROLLUP
SELECT dept, category, SUM(sales), GROUPING_ID(dept, category)
FROM sales_table  
GROUP BY ROLLUP(dept, category);

// DataFrame API usage
import org.apache.spark.sql.functions._

df.cube("department", "category")
  .agg(
    sum("sales").as("total_sales"),
    expr("grouping_id(department, category)").as("grouping_level")
  )