CheckConstraint¶

Overview¶

CheckConstraint is a Catalyst expression that represents a constraint validation mechanism in Spark SQL. It enforces data quality rules and business logic constraints on table data. This expression is typically used in table definitions to ensure data integrity by validating that rows satisfy specified conditions.

Syntax¶

-- SQL constraint definition
ALTER TABLE table_name ADD CONSTRAINT constraint_name CHECK (condition)

-- In CREATE TABLE statements
CREATE TABLE table_name (
  column1 datatype,
  column2 datatype,
  CONSTRAINT constraint_name CHECK (condition)
)

Arguments¶

Argument	Type	Description
name	String	The name identifier for the constraint
condition	Expression	The boolean expression that defines the constraint logic
enforcementType	EnforcementType	Specifies how the constraint is enforced (e.g., strict, warning)

Return Type¶

Returns Boolean - true if the constraint is satisfied, false if violated.

Supported Data Types¶

Supports all Spark SQL data types as input since the constraint condition can operate on:

Numeric types (IntegerType, LongType, DoubleType, DecimalType, etc.)
String types (StringType)
Date and timestamp types (DateType, TimestampType)
Boolean types (BooleanType)
Complex types (ArrayType, MapType, StructType)
Any combination of the above in complex expressions

Algorithm¶

The CheckConstraint evaluation follows these steps:

Parse and compile the constraint condition expression into Catalyst expression tree
Evaluate the condition expression against input row data
Return boolean result indicating constraint satisfaction
Handle null values according to three-valued logic (null conditions typically pass)
Cache compiled expressions for performance optimization

Partitioning Behavior¶

CheckConstraint has the following partitioning characteristics:

Preserves existing partitioning as it operates row-by-row
Does not require shuffle operations since validation is local to each row
Can be applied per-partition independently
Does not affect partition pruning or partition-wise operations

Edge Cases¶

Null handling: Null constraint conditions are typically treated as satisfied (follows SQL standard)
Empty input: Returns true for empty datasets (vacuous truth)
Complex expressions: Supports nested expressions and function calls within constraint conditions
Runtime failures: Expression evaluation errors may cause constraint violation
Type coercion: Automatic type casting follows standard Spark SQL coercion rules

Code Generation¶

CheckConstraint supports Tungsten code generation for optimal performance. The constraint condition expressions are compiled to Java bytecode when possible, falling back to interpreted mode for complex expressions that cannot be code-generated.

Examples¶

-- Example SQL usage
CREATE TABLE employees (
  id INT,
  age INT,
  salary DECIMAL(10,2),
  CONSTRAINT age_check CHECK (age >= 18 AND age <= 65),
  CONSTRAINT salary_check CHECK (salary > 0)
);

-- Adding constraint to existing table
ALTER TABLE products 
ADD CONSTRAINT price_positive CHECK (price > 0);

// Example DataFrame API usage
import org.apache.spark.sql.catalyst.expressions._

val ageConstraint = CheckConstraint(
  name = "age_validation",
  condition = And(
    GreaterThanOrEqual(col("age"), Literal(18)),
    LessThanOrEqual(col("age"), Literal(65))
  ),
  enforcementType = EnforcementType.STRICT
)

// Apply constraint validation in DataFrame operations
df.filter(ageConstraint.condition)