InSubquery¶
Overview¶
The InSubquery expression evaluates to true if the specified values are found in the result set of a subquery. This is a predicate expression that implements the SQL IN operator with subquery functionality, allowing for membership testing against dynamically computed result sets.
Syntax¶
expression IN (SELECT ...)
(expr1, expr2, ...) IN (SELECT col1, col2, ... FROM table WHERE condition)
Arguments¶
| Argument | Type | Description |
|---|---|---|
values |
Seq[Expression] |
The expressions to test for membership in the subquery results |
query |
ListQuery |
The subquery that returns a list of values to check against |
Return Type¶
Returns BooleanType - true if the values exist in the subquery result set, false otherwise, or null if nullability conditions are met.
Supported Data Types¶
Supports any data types that can be compared for equality and have ordering semantics. The expression requires:
- Values and subquery columns must have matching data types
- All data types must support ordering operations
- Number of values must match the number of columns returned by the subquery
Algorithm¶
-
Creates a structured representation of multiple values using
CreateNamedStructfor multi-column comparisons or uses the single value directly -
Validates that the number of input values matches the number of columns returned by the subquery
-
Performs structural data type comparison between values and subquery output, ignoring nullability
-
Checks that all data types support ordering operations required for membership testing
-
Defers actual evaluation to the query execution engine as this is an
Unevaluableexpression
Partitioning Behavior¶
This expression affects partitioning as follows:
- Does not preserve partitioning due to subquery dependency
- May require shuffle operations depending on the subquery implementation
- Subquery evaluation strategy determines the actual partitioning impact
Edge Cases¶
-
Null handling: Nullability depends on the
LEGACY_IN_SUBQUERY_NULLABILITYconfiguration setting. When disabled (default), considers nullability of both values and subquery outputs. When enabled, only considers left-hand side nullability for legacy compatibility -
Length mismatch: Throws
DataTypeMismatcherror when the number of values doesn't match the number of subquery columns -
Type mismatch: Validates structural data type equality and reports specific mismatched columns in error messages
-
Empty subquery: Returns
falsewhen subquery returns no results (standard SQL behavior)
Code Generation¶
This expression is marked as Unevaluable, meaning it does not support direct code generation. The expression is transformed during query planning phases into executable operators that handle the subquery evaluation and membership testing.
Examples¶
-- Single value IN subquery
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM premium_customers);
-- Multiple values IN subquery
SELECT * FROM products WHERE (category_id, supplier_id) IN (
SELECT cat_id, sup_id FROM active_combinations
);
// DataFrame API usage (typically generated internally)
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans.logical._
// This expression is typically created during SQL parsing
// and not directly instantiated in DataFrame API
val inSubquery = InSubquery(
values = Seq(col("customer_id").expr),
query = ListQuery(subqueryPlan)
)
See Also¶
In- For IN expressions with literal value listsExists- For existential subquery predicatesScalarSubquery- For subqueries returning single valuesListQuery- The subquery container used by this expression