Skip to content

ArraysOverlap

Overview

The ArraysOverlap expression determines whether two arrays have any elements in common. It returns true if at least one element exists in both arrays, false if no common elements are found, and null if either array contains null elements but no overlap is detected.

Syntax

arrays_overlap(array1, array2)
// DataFrame API
arrays_overlap(col("array1"), col("array2"))

Arguments

Argument Type Description
left ArrayType The first array to compare
right ArrayType The second array to compare

Return Type

BooleanType - Returns true, false, or null

Supported Data Types

Supports arrays of any element type, including:

  • Numeric types (IntegerType, LongType, DoubleType, etc.)
  • String types
  • Date and timestamp types
  • Complex types (StructType, ArrayType, MapType)
  • All primitive and reference types that support ordering or equality comparison

Algorithm

  • Optimizes performance by identifying the smaller and larger arrays
  • For data types with proper equals() implementation, uses a fast HashSet-based approach
  • For complex data types, falls back to a brute-force nested loop with ordering comparison
  • Short-circuits on first match found, returning true immediately
  • Tracks null presence throughout evaluation for proper three-valued logic

Partitioning Behavior

  • Preserves input partitioning as it operates element-wise on co-located data
  • Does not require shuffle operations
  • Can be pushed down in query optimization
  • Maintains data locality for efficient distributed processing

Edge Cases

Null Handling:

  • Returns null if any element in either array is null and no overlap is found
  • Follows SQL three-valued logic for null propagation
  • Expression is null-intolerant, meaning null inputs produce null outputs

Empty Arrays:

  • Returns false when comparing empty arrays
  • Returns false when one array is empty regardless of the other array's contents

Special Cases:

  • Arrays with different element types are cast to a common type if possible
  • Requires elements to have ordering capability for comparison
  • Uses TypeUtils.typeWithProperEquals() to determine evaluation strategy

Code Generation

Supports Tungsten code generation for optimal performance:

  • Generates specialized code paths for fast (HashSet) vs brute-force evaluation
  • Produces efficient nested loops with early termination conditions
  • Implements null-safe element access with runtime checks
  • Falls back to interpreted mode for unsupported data types

Examples

-- Basic usage
SELECT arrays_overlap(array(1, 2, 3), array(3, 4, 5)); -- true
SELECT arrays_overlap(array(1, 2), array(3, 4)); -- false

-- With string arrays
SELECT arrays_overlap(array('a', 'b'), array('b', 'c')); -- true

-- Null handling
SELECT arrays_overlap(array(1, null, 3), array(4, 5)); -- null
SELECT arrays_overlap(array(1, null, 3), array(1, 4)); -- true
// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(arrays_overlap(col("tags1"), col("tags2")))

// With literal arrays
df.select(arrays_overlap(array(lit(1), lit(2)), col("numbers")))

See Also

  • array_intersect - Returns the intersection of two arrays
  • array_union - Returns the union of two arrays
  • array_except - Returns elements in first array but not in second
  • array_contains - Checks if array contains a specific element