NthValue¶
Overview¶
NthValue is an aggregate window function that returns the value of the input expression at the nth row within the window frame. It supports optional null handling where null values can be ignored when determining which row constitutes the nth position.
Syntax¶
// DataFrame API usage would typically be through window functions
import org.apache.spark.sql.expressions.Window
df.select(nth_value(col("column"), lit(2)).over(Window.partitionBy("partition_col")))
Arguments¶
| Argument | Type | Description |
|---|---|---|
input |
Any | The expression whose value should be returned from the nth row |
offset |
Integer | The 1-based position of the row to retrieve (must be > 0 and foldable) |
ignoreNulls |
Boolean | Optional flag to ignore null values when counting rows (default: false) |
Return Type¶
Returns the same data type as the input expression. If no nth row exists or the nth row contains null, returns null.
Supported Data Types¶
- Input: Any data type (
AnyDataType) - Offset: Integer types only (
IntegerType) - Return: Same as input data type
Algorithm¶
- Maintains an internal counter starting at 1 for each window partition
- When
ignoreNullsis false: increments counter for every row and captures input value when counter equals offset - When
ignoreNullsis true: only increments counter for non-null input values and captures value when counter equals offset for non-null input - Uses an aggregation buffer with two attributes: result value and current count
- Evaluates to the captured result value or null if nth position never reached
Partitioning Behavior¶
- Preserves input partitioning as this is a window function that operates within partitions
- Does not require shuffle if proper partitioning already exists
- Works with
UnspecifiedFrame- operates on the entire window partition by default
Edge Cases¶
- Null handling: When
ignoreNullsis true, null values don't count toward the offset position - Invalid offset: Offset must be a positive integer (> 0), compile-time validated
- Non-foldable offset: Offset expression must be evaluable at compile time (constant)
- Insufficient rows: Returns null if there aren't enough (non-null) rows to reach the nth position
- Empty partitions: Returns null for partitions with no rows
Code Generation¶
This expression extends AggregateWindowFunction which supports Tungsten code generation through the Catalyst framework. The update expressions and evaluation logic are code-generated for optimal performance.
Examples¶
-- Get the 2nd highest salary per department
SELECT employee_name, salary,
nth_value(salary, 2) OVER (
PARTITION BY department
ORDER BY salary DESC
) as second_highest
FROM employees;
-- Get 3rd non-null value, ignoring nulls
SELECT nth_value(commission, 3) IGNORE NULLS OVER (
PARTITION BY region
ORDER BY hire_date
) as third_commission
FROM sales_reps;
// DataFrame API usage
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("department").orderBy(desc("salary"))
df.select(
col("employee_name"),
col("salary"),
nth_value(col("salary"), lit(2)).over(windowSpec).alias("second_highest")
)
See Also¶
FirstValue- Gets the first value in a window frameLastValue- Gets the last value in a window frameLag/Lead- Access values at relative offsets from current rowRowNumber- Assigns sequential numbers within window partitions