ListQuery¶
Overview¶
ListQuery is a Catalyst expression that represents a subquery used in list-based operations, most commonly for IN and EXISTS clauses. It encapsulates a logical plan that produces a list of values to be compared against outer query expressions, enabling correlated and non-correlated subquery execution within Spark SQL.
Syntax¶
-- IN subquery
SELECT * FROM table1 WHERE column IN (SELECT column FROM table2)
-- EXISTS subquery
SELECT * FROM table1 WHERE EXISTS (SELECT 1 FROM table2 WHERE table1.id = table2.id)
// Created internally by Catalyst during query planning
// Not directly accessible via DataFrame API
Arguments¶
| Argument | Type | Description |
|---|---|---|
plan |
LogicalPlan |
The logical plan representing the subquery |
outerAttrs |
Seq[Expression] |
References to outer query attributes used in correlation |
exprId |
ExprId |
Unique identifier for this expression instance |
numCols |
Int |
Number of columns in the original plan (-1 if unset) |
joinCond |
Seq[Expression] |
Join conditions for correlated subqueries |
hint |
Option[HintInfo] |
Optional query hints for optimization |
Return Type¶
The return type depends on the number of columns in the subquery:
- Single column: Returns the data type of the first output column
- Multiple columns: Returns a
StructTypecontaining all column types
Supported Data Types¶
Supports all Spark SQL data types as the subquery can contain any valid expressions. The specific data types depend on the columns selected in the subquery plan.
Algorithm¶
- Executes the encapsulated logical plan to produce a dataset
- For single-column subqueries, returns the column values directly
- For multi-column subqueries, constructs struct values from all columns
- Handles correlation by binding outer attributes to the subquery execution context
- Applies join conditions when processing correlated subqueries
Partitioning Behavior¶
- Does not preserve partitioning as it represents a subquery operation
- May require shuffle operations depending on the subquery complexity
- Correlated subqueries typically require broadcast or shuffle joins
- The actual partitioning behavior depends on the underlying logical plan execution
Edge Cases¶
- Null handling: Nullability is undefined for standalone
ListQueryexecution and throws an assertion error in Spark 3.5+ unless legacy mode is enabled - Empty subquery: Returns empty result set, which affects
INclause evaluation (returns false for all comparisons) - Unresolved state: Expression remains unresolved until
numColsis set to a valid value (not -1) - Multi-column correlation: Properly handles complex correlation scenarios with multiple outer attributes
Code Generation¶
This expression is marked as Unevaluable, meaning it cannot be directly evaluated and does not support code generation. It must be transformed into executable operations (like joins or broadcasts) during query planning phases.
Examples¶
-- Simple IN subquery
SELECT name FROM employees
WHERE department_id IN (SELECT id FROM departments WHERE budget > 100000);
-- Correlated EXISTS subquery
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
-- Multi-column subquery
SELECT * FROM products
WHERE (category, supplier) IN (SELECT category, supplier FROM featured_products);
// Internal usage during Catalyst optimization
// ListQuery instances are created by the analyzer when processing subquery expressions
// Not directly instantiated in user code
See Also¶
SubqueryExpression- Base class for all subquery expressionsScalarSubquery- For scalar subquery expressionsExists- For EXISTS clause implementationsIn- For IN clause expressions that may contain ListQuery instances