RegrSlope¶
Overview¶
RegrSlope is a declarative aggregate expression that calculates the slope of the least-squares fit linear equation for non-null pairs of dependent and independent variables. It computes the regression slope using the ratio of population covariance to population variance of the independent variable.
Syntax¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
| left (y) | DoubleType | The dependent variable (y-values) for regression calculation |
| right (x) | DoubleType | The independent variable (x-values) for regression calculation |
Return Type¶
DoubleType - Returns a double precision floating point number representing the slope, or null if the variance of x is zero.
Supported Data Types¶
- Input: Both arguments must be
DoubleType(implicit casting is supported viaImplicitCastInputTypes) - Output:
DoubleType
Algorithm¶
- Maintains internal state using
CovPopulationfor computing covariance between x and y variables - Maintains internal state using
VariancePopfor computing population variance of x variable - Only processes pairs where both x and y values are non-null
- Calculates slope as the ratio: covariance(x,y) / variance(x)
- Returns null when variance of x equals zero (vertical line case)
Partitioning Behavior¶
As a declarative aggregate function:
- Requires shuffle operations to collect data across partitions for final aggregation
- Does not preserve partitioning since it produces a single aggregate result
- Uses merge expressions to combine partial aggregates from different partitions
Edge Cases¶
- Null handling: Ignores pairs where either x or y is null; maintains previous buffer state for such pairs
- Zero variance: Returns null when the variance of x is zero (all x values are identical)
- Empty input: Returns null when no valid (non-null) pairs are available
- Numerical stability: Inherits numerical behavior from underlying
CovPopulationandVariancePopimplementations
Code Generation¶
This is a DeclarativeAggregate expression, which means it supports Catalyst's code generation (Tungsten) by expressing computation through other expressions rather than custom imperative code.
Examples¶
-- Calculate regression slope for sales vs advertising spend
SELECT REGR_SLOPE(sales, advertising_spend) as slope
FROM sales_data;
-- Group by region to get slope per region
SELECT region, REGR_SLOPE(sales, advertising_spend) as slope
FROM sales_data
GROUP BY region;
// DataFrame API usage
import org.apache.spark.sql.functions._
df.agg(expr("regr_slope(sales, advertising_spend)").alias("slope"))
// With grouping
df.groupBy("region")
.agg(expr("regr_slope(sales, advertising_spend)").alias("slope"))
See Also¶
CovPopulation- Used internally for covariance calculationVariancePop- Used internally for variance calculation- Other regression functions:
regr_intercept,regr_r2,regr_count - Linear regression statistical functions