NullIndex¶

Overview¶

NullIndex is a window function expression that tracks consecutive null values in a sequence, resetting the counter to 0 when a non-null value is encountered. This expression is specifically designed for the Pandas API on Spark to provide pandas-compatible null indexing behavior within window operations.

Syntax¶

null_index(column_expression)

Arguments¶

Argument	Type	Description
input	Expression	The input expression to analyze for null values

Return Type¶

IntegerType - Returns integer values representing the index of consecutive nulls.

Supported Data Types¶

All data types are supported as input since the expression only checks for null values using IsNull, which works with any data type.

Algorithm¶

Maintains an internal counter (index) initialized to 0
For each row in the window frame:
If the input value is null, increment the counter by 1
If the input value is non-null, reset the counter to 0
Returns the current counter value for each row

Partitioning Behavior¶

As an AggregateWindowFunction, this expression: - Operates within window partitions defined by the window specification - Does not require shuffle if used with existing partitioning that matches the window partition columns - Maintains state independently across different window partitions

Edge Cases¶

Null handling: Null values are the primary focus - they increment the internal counter
Non-null values: Any non-null value (including empty strings, zeros, false) resets the counter to 0
Empty partitions: Would not produce any output rows
Integer overflow: Theoretical possibility if consecutive nulls exceed Integer.MAX_VALUE, but would require extremely large datasets

Code Generation¶

This expression extends AggregateWindowFunction and uses standard Catalyst expressions (If, IsNull, Literal) in its updateExpressions, which should support Tungsten code generation for the underlying operations.

Examples¶

-- Example with window function
SELECT 
  value,
  null_index(value) OVER (ORDER BY id) as null_idx
FROM table_name

// DataFrame API usage (Pandas API on Spark context)
import org.apache.spark.sql.expressions.Window

val windowSpec = Window.orderBy("id")
df.select($"value", 
  expr("null_index(value)").over(windowSpec).as("null_idx"))

Example Input/Output:

Input:  null, 1, 2, 3, null, null, null, 5, null, null
Output: 1,    0, 0, 0, 1,    2,    3,    0, 1,    2