RegrAvgX¶

Overview¶

RegrAvgX is an aggregate function that computes the average of the independent variable (x-values) in linear regression analysis. It calculates the mean of the second argument (right expression) for all rows where both the dependent and independent variables are not null.

Syntax¶

REGR_AVGX(y, x)

// DataFrame API usage
df.agg(expr("regr_avgx(y_column, x_column)"))

Arguments¶

Argument	Type	Description
y	NumericType	The dependent variable (used only for null filtering)
x	NumericType	The independent variable whose average will be calculated

Return Type¶

Returns a numeric type corresponding to the average of the independent variable values, or null if no valid pairs exist.

Supported Data Types¶

Numeric types (IntegerType, LongType, FloatType, DoubleType, DecimalType)
Both arguments must be numeric types due to inputTypes: Seq(NumericType, NumericType)

Algorithm¶

Filters out rows where either the left (y) or right (x) expression is null using And(IsNotNull(left), IsNotNull(right))
For remaining valid pairs, extracts only the right (x) values using conditional logic
Applies the Average aggregate function to compute the mean of the x-values
Returns null if no valid pairs exist after filtering

Partitioning Behavior¶

How this expression affects partitioning:

Does not preserve partitioning as it's an aggregate function that reduces multiple rows to a single value
Requires shuffle operation when used in distributed computing to collect all valid x-values for averaging

Edge Cases¶

Null handling: Rows where either y or x is null are completely excluded from the calculation
Empty input: Returns null when no rows remain after null filtering
All nulls: Returns null if all input rows contain at least one null value
Single valid pair: Returns the x-value itself as the average

Code Generation¶

This expression uses RuntimeReplaceableAggregate, which means it's replaced at runtime with a combination of Average, If, And, and IsNotNull expressions. Code generation support depends on the underlying replacement expressions, which typically support Tungsten code generation.

Examples¶

-- Calculate average of x-values in linear regression
SELECT REGR_AVGX(sales, advertising_spend) FROM quarterly_data;

-- With grouping
SELECT region, REGR_AVGX(revenue, marketing_cost) 
FROM sales_data 
GROUP BY region;

// DataFrame API usage
import org.apache.spark.sql.functions._

df.agg(expr("regr_avgx(dependent_var, independent_var)").alias("avg_x"))

// With grouping
df.groupBy("category")
  .agg(expr("regr_avgx(y_col, x_col)").alias("average_x_value"))