WholeStageCodegenExec¶
Overview¶
WholeStageCodegenExec compiles a subtree of plans that support codegen together into a single Java function, eliminating virtual function calls and improving CPU efficiency. It acts as a code generation framework that combines multiple physical operators into one generated class that processes data in a pipelined fashion.
When Used¶
The query planner chooses this operator when:
- spark.sql.codegen.wholeStage is enabled (default: true)
- A subtree of operators all implement CodegenSupport and return supportCodegen = true
- No expressions contain CodegenFallback requirements
- Schema doesn't exceed spark.sql.codegen.maxFields limit
- The CollapseCodegenStages rule identifies a chain of codegen-compatible operators
Input Requirements¶
- Expected input partitioning: Inherits from child operator's output partitioning
- Expected input ordering: Inherits from child operator's output ordering
- Number of children: Unary operator (exactly one child)
Output Properties¶
- Output partitioning: Same as child's output partitioning
- Output ordering: Same as child's output ordering
- Output schema: Identical to child's output schema (pass-through)
Algorithm¶
- Generate Java source code by calling
child.produce()which recursively generates code for the entire subtree - Compile the generated code into a
BufferedRowIteratorsubclass with class name based oncodegenStageId - Create an evaluator factory that instantiates the compiled code with runtime references
- Execute by mapping over input RDDs (supports up to 2 input RDDs) using the generated iterator
- Each generated iterator processes rows in a tight loop, calling
processNext()andappend()for results - Handle fallback to interpreted execution if code generation or compilation fails
- Track code generation time and compilation statistics for monitoring
Memory Usage¶
- Spill behavior: Does not spill to disk itself (inherits child's spill behavior)
- Memory requirements: Minimal - only buffers output rows via
BufferedRowIterator.append() - Buffering: Uses
BufferedRowIteratorto buffer result rows, but actual memory usage depends on child operators
Partitioning Behavior¶
- Data distribution: Preserves child's partitioning exactly
- Shuffle requirements: No shuffling - purely a computational optimization
- Partition count: Unchanged from input
Supported Join/Aggregation Types¶
Supports any join/aggregation types that the underlying child operators support, including: - Joins: BroadcastHashJoin, ShuffledHashJoin, SortMergeJoin, BroadcastNestedLoopJoin - Aggregations: HashAggregate, SortAggregate - All other codegen-compatible operators in the subtree
Metrics¶
- pipelineTime: Total time spent executing the generated pipeline code (in milliseconds)
- Inherits all metrics from child operators
- Global code generation time tracking via
WholeStageCodegenExec.codeGenTime
Code Generation¶
This operator IS the whole-stage code generation implementation. It:
- Generates Java source code for entire operator subtrees
- Compiles code at runtime using Janino compiler
- Falls back to interpreted execution if compilation fails or bytecode size exceeds spark.sql.codegen.hugeMethodLimit
- Supports broadcasting large generated code when size exceeds threshold
Configuration Options¶
spark.sql.codegen.wholeStage- Enable/disable whole-stage codegenspark.sql.codegen.maxFields- Maximum fields in schema for codegenspark.sql.codegen.hugeMethodLimit- Bytecode size limit before disabling codegenspark.sql.codegen.fallback- Allow fallback to interpreted execution on compilation failurespark.sql.codegen.useIdInClassName- Include stage ID in generated class namesspark.sql.codegen.splitConsumeFuncByOperator- Split consume functions to avoid long methods
Edge Cases¶
- Compilation failure: Automatically falls back to child.execute() if code generation or compilation fails
- Large bytecode: Disables codegen and uses interpreted execution when compiled code exceeds huge method limit
- Complex schemas: Automatically disabled when schema has too many nested fields
- Expressions with CodegenFallback: Prevents whole-stage codegen for the subtree
Examples¶
== Physical Plan ==
*(1) Project [id#0L, (id#0L * 2) AS doubled#3L]
+- *(1) Filter (id#0L > 100)
+- *(1) Range (0, 1000, step=1, splits=8)
// The *(1) indicates WholeStageCodegenExec with codegenStageId=1
// wrapping Project -> Filter -> Range in a single generated function
See Also¶
InputAdapter- Bridges non-codegen operators into codegen subtreesCodegenSupport- Trait implemented by codegen-compatible operatorsCollapseCodegenStages- Rule that inserts WholeStageCodegenExec operators