RegrSXX¶

Overview¶

The RegrSXX expression computes the sum of squares of the independent variable in linear regression analysis. It is an aggregate function that calculates the sum of squared deviations of the X values from their mean, which is used as a building block for regression statistics.

Syntax¶

REGR_SXX(y, x)

// DataFrame API usage
df.agg(expr("regr_sxx(y_column, x_column)"))

Arguments¶

Argument	Type	Description
left (y)	DoubleType	The dependent variable (Y values) - used for null filtering
right (x)	DoubleType	The independent variable (X values) - the variable for which sum of squares is calculated

Return Type¶

DoubleType - Returns a double precision floating point number representing the sum of squares of the X values.

Supported Data Types¶

Numeric types that can be implicitly cast to DoubleType
Both arguments must be convertible to double precision floating point numbers

Algorithm¶

Implements RuntimeReplaceableAggregate, delegating computation to RegrReplacement
Filters out rows where either left (Y) or right (X) values are null
Uses the right expression (X values) as input to the replacement aggregate
Returns null if any input pair contains null values
Calculates the sum of squared deviations using the replacement implementation

Partitioning Behavior¶

As an aggregate function:

Requires data shuffling for global aggregation across partitions
Does not preserve existing partitioning schemes
Partial aggregation may occur within partitions before final shuffle

Edge Cases¶

Null handling: Returns null if any row contains null values in either Y or X columns
Empty input: Behavior depends on the underlying RegrReplacement implementation
Single value: May return zero if only one non-null pair exists
Numeric precision: Subject to double precision floating point limitations and potential precision loss

Code Generation¶

This expression uses RuntimeReplaceableAggregate, which means:

The actual computation is delegated to the replacement expression
Code generation support depends on the underlying RegrReplacement implementation
No direct Tungsten code generation in this class itself

Examples¶

-- Calculate sum of squares for regression analysis
SELECT REGR_SXX(sales, advertising_spend) as sxx
FROM sales_data;

-- Use in combination with other regression functions
SELECT 
    REGR_SXX(y, x) as sum_squares_x,
    REGR_SXY(y, x) as sum_products,
    REGR_SYY(y, x) as sum_squares_y
FROM regression_table;

// DataFrame API usage
import org.apache.spark.sql.functions._

val df = spark.table("sales_data")
val result = df.agg(
  expr("regr_sxx(sales, advertising_spend)").alias("sum_squares_x")
)

// Multiple regression statistics
val stats = df.agg(
  expr("regr_sxx(y, x)").alias("sxx"),
  expr("regr_sxy(y, x)").alias("sxy"),
  expr("regr_syy(y, x)").alias("syy")
)