Last¶

Overview¶

The Last aggregate function returns the last value in a group of rows, optionally ignoring null values. The function is non-deterministic because its results depend on the order of the rows, which may be non-deterministic after a shuffle.

Syntax¶

LAST(expression [, ignoreNulls])

// DataFrame API
df.agg(last("column_name"))
df.agg(last("column_name", ignoreNulls = true))

Arguments¶

Argument	Type	Description
`child`	`Expression`	The expression to evaluate for each row
`ignoreNulls`	`Boolean`	Whether to ignore null values (default: false)

Return Type¶

Returns the same data type as the input expression. The result is always nullable (nullable = true).

Supported Data Types¶

Supports any data type (AnyDataType) for the main expression:

Numeric types (integers, decimals, floating-point)
String types
Date and timestamp types
Boolean types
Complex types (arrays, maps, structs)
Binary types

Algorithm¶

The Last aggregate function maintains two buffer attributes during evaluation:

Initializes buffer with null value for last and false for valueSet
For each row, updates the last value with the current expression value
If ignoreNulls is true, only updates when the current value is not null
If ignoreNulls is false, updates with every value including nulls
During merge operations, prefers the right-hand side buffer if it has been set

Partitioning Behavior¶

This expression affects partitioning behavior:

Does not preserve partitioning as it's an aggregate function
May require shuffle operations to collect all values in each group
Results are non-deterministic across different shuffle operations
Order dependency makes it sensitive to partition boundaries

Edge Cases¶

Null handling behavior varies based on the ignoreNulls parameter:

When ignoreNulls = false: Returns the last value encountered, even if null
When ignoreNulls = true: Skips null values and returns the last non-null value
Empty input groups return null
If all values are null and ignoreNulls = true, returns null
The valueSet flag tracks whether any valid value has been encountered

Code Generation¶

This expression extends DeclarativeAggregate, which supports Catalyst's code generation framework (Tungsten). The aggregate operations are expressed declaratively through updateExpressions, mergeExpressions, and evaluateExpression, allowing for efficient code generation.

Examples¶

-- Get the last salary in each department
SELECT department, LAST(salary) as last_salary
FROM employees 
GROUP BY department;

-- Get the last non-null salary in each department
SELECT department, LAST(salary, true) as last_non_null_salary
FROM employees 
GROUP BY department;

// DataFrame API examples
import org.apache.spark.sql.functions.last

// Get last value (including nulls)
df.groupBy("department").agg(last("salary"))

// Get last non-null value
df.groupBy("department").agg(last("salary", ignoreNulls = true))

// With column expressions
df.groupBy("department").agg(last(col("salary") * 1.1))