RegrSlope¶

Overview¶

RegrSlope is a declarative aggregate expression that calculates the slope of the least-squares fit linear equation for non-null pairs of dependent and independent variables. It computes the regression slope using the ratio of population covariance to population variance of the independent variable.

Syntax¶

REGR_SLOPE(y, x)

// DataFrame API
df.agg(expr("regr_slope(y_column, x_column)"))

Arguments¶

Argument	Type	Description
left (y)	DoubleType	The dependent variable (y-values) for regression calculation
right (x)	DoubleType	The independent variable (x-values) for regression calculation

Return Type¶

DoubleType - Returns a double precision floating point number representing the slope, or null if the variance of x is zero.

Supported Data Types¶

Input: Both arguments must be DoubleType (implicit casting is supported via ImplicitCastInputTypes)
Output: DoubleType

Algorithm¶

Maintains internal state using CovPopulation for computing covariance between x and y variables
Maintains internal state using VariancePop for computing population variance of x variable
Only processes pairs where both x and y values are non-null
Calculates slope as the ratio: covariance(x,y) / variance(x)
Returns null when variance of x equals zero (vertical line case)

Partitioning Behavior¶

As a declarative aggregate function:

Requires shuffle operations to collect data across partitions for final aggregation
Does not preserve partitioning since it produces a single aggregate result
Uses merge expressions to combine partial aggregates from different partitions

Edge Cases¶

Null handling: Ignores pairs where either x or y is null; maintains previous buffer state for such pairs
Zero variance: Returns null when the variance of x is zero (all x values are identical)
Empty input: Returns null when no valid (non-null) pairs are available
Numerical stability: Inherits numerical behavior from underlying CovPopulation and VariancePop implementations

Code Generation¶

This is a DeclarativeAggregate expression, which means it supports Catalyst's code generation (Tungsten) by expressing computation through other expressions rather than custom imperative code.

Examples¶

-- Calculate regression slope for sales vs advertising spend
SELECT REGR_SLOPE(sales, advertising_spend) as slope
FROM sales_data;

-- Group by region to get slope per region  
SELECT region, REGR_SLOPE(sales, advertising_spend) as slope
FROM sales_data 
GROUP BY region;

// DataFrame API usage
import org.apache.spark.sql.functions._

df.agg(expr("regr_slope(sales, advertising_spend)").alias("slope"))

// With grouping
df.groupBy("region")
  .agg(expr("regr_slope(sales, advertising_spend)").alias("slope"))