GroupingID¶
Overview¶
The GroupingID expression computes a grouping identifier that indicates which columns are being grouped in GROUPING SETS, CUBE, and ROLLUP operations. It returns a bitmask where each bit represents whether a specific grouping column is included in the current grouping combination, enabling identification of different aggregation levels in hierarchical grouping operations.
Syntax¶
// DataFrame API - typically used internally with cube/rollup operations
df.cube("col1", "col2").agg(expr("grouping_id(col1, col2)"))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| groupByExprs | Seq[Expression] | The sequence of expressions that represent the grouping columns for which the grouping ID is calculated |
Return Type¶
Integer data type - returns a bitmask as an integer value representing the grouping combination.
Supported Data Types¶
The GroupingID expression itself works with any data types for the input columns since it operates on the grouping metadata rather than the actual column values. The input expressions can be of any supported Spark data type.
Algorithm¶
- Maintains references to a virtual
groupingIdAttributecolumn rather than the actual input columns - Acts as an
Unevaluableexpression, meaning it cannot be directly evaluated in normal expression contexts - The actual evaluation is handled by specialized grouping operators (Cube, Rollup, GroupingSets)
- Generates a bitmask where each bit position corresponds to a grouping expression
- Returns an integer representing which columns are active in the current grouping level
Partitioning Behavior¶
- Does not preserve partitioning as it's typically used in aggregation contexts
- Requires shuffle operations when used with CUBE, ROLLUP, or GROUPING SETS
- The expression itself doesn't directly affect partitioning but is used within aggregation operators that do
Edge Cases¶
- Null handling: The expression itself is marked as non-nullable (
nullable = false) since it returns grouping metadata rather than data-dependent values - Empty input: When used with empty grouping expressions, would return 0
- Unevaluable context: Throws evaluation exceptions if used outside of proper grouping contexts since it extends
Unevaluable - Virtual column dependency: Relies on the presence of the virtual
groupingIdAttributein the execution context
Code Generation¶
This expression does not support direct code generation since it extends Unevaluable. The code generation is handled by the parent grouping operators (Aggregate with Cube/Rollup/GroupingSets) that replace this expression with actual computed values during physical plan execution.
Examples¶
-- Example with CUBE - grouping_id identifies aggregation levels
SELECT col1, col2, SUM(value), GROUPING_ID(col1, col2)
FROM table
GROUP BY CUBE(col1, col2);
-- Example with ROLLUP
SELECT dept, category, SUM(sales), GROUPING_ID(dept, category)
FROM sales_table
GROUP BY ROLLUP(dept, category);
// DataFrame API usage
import org.apache.spark.sql.functions._
df.cube("department", "category")
.agg(
sum("sales").as("total_sales"),
expr("grouping_id(department, category)").as("grouping_level")
)
See Also¶
Grouping- Related expression for checking individual column grouping statusCube- Physical operator that utilizes GroupingIDRollup- Physical operator that utilizes GroupingIDGroupingSets- Physical operator that utilizes GroupingIDVirtualColumn.groupingIdAttribute- The virtual column reference used internally