ListQuery¶

Overview¶

ListQuery is a Catalyst expression that represents a subquery used in list-based operations, most commonly for IN and EXISTS clauses. It encapsulates a logical plan that produces a list of values to be compared against outer query expressions, enabling correlated and non-correlated subquery execution within Spark SQL.

Syntax¶

-- IN subquery
SELECT * FROM table1 WHERE column IN (SELECT column FROM table2)

-- EXISTS subquery  
SELECT * FROM table1 WHERE EXISTS (SELECT 1 FROM table2 WHERE table1.id = table2.id)

// Created internally by Catalyst during query planning
// Not directly accessible via DataFrame API

Arguments¶

Argument	Type	Description
`plan`	`LogicalPlan`	The logical plan representing the subquery
`outerAttrs`	`Seq[Expression]`	References to outer query attributes used in correlation
`exprId`	`ExprId`	Unique identifier for this expression instance
`numCols`	`Int`	Number of columns in the original plan (-1 if unset)
`joinCond`	`Seq[Expression]`	Join conditions for correlated subqueries
`hint`	`Option[HintInfo]`	Optional query hints for optimization

Return Type¶

The return type depends on the number of columns in the subquery:

Single column: Returns the data type of the first output column
Multiple columns: Returns a StructType containing all column types

Supported Data Types¶

Supports all Spark SQL data types as the subquery can contain any valid expressions. The specific data types depend on the columns selected in the subquery plan.

Algorithm¶

Executes the encapsulated logical plan to produce a dataset
For single-column subqueries, returns the column values directly
For multi-column subqueries, constructs struct values from all columns
Handles correlation by binding outer attributes to the subquery execution context
Applies join conditions when processing correlated subqueries

Partitioning Behavior¶

Does not preserve partitioning as it represents a subquery operation
May require shuffle operations depending on the subquery complexity
Correlated subqueries typically require broadcast or shuffle joins
The actual partitioning behavior depends on the underlying logical plan execution

Edge Cases¶

Null handling: Nullability is undefined for standalone ListQuery execution and throws an assertion error in Spark 3.5+ unless legacy mode is enabled
Empty subquery: Returns empty result set, which affects IN clause evaluation (returns false for all comparisons)
Unresolved state: Expression remains unresolved until numCols is set to a valid value (not -1)
Multi-column correlation: Properly handles complex correlation scenarios with multiple outer attributes

Code Generation¶

This expression is marked as Unevaluable, meaning it cannot be directly evaluated and does not support code generation. It must be transformed into executable operations (like joins or broadcasts) during query planning phases.

Examples¶

-- Simple IN subquery
SELECT name FROM employees 
WHERE department_id IN (SELECT id FROM departments WHERE budget > 100000);

-- Correlated EXISTS subquery
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

-- Multi-column subquery
SELECT * FROM products 
WHERE (category, supplier) IN (SELECT category, supplier FROM featured_products);

// Internal usage during Catalyst optimization
// ListQuery instances are created by the analyzer when processing subquery expressions
// Not directly instantiated in user code