Skip to content

RLike

Overview

RLike is a regular expression matching predicate function that determines whether a string matches a given regular expression pattern. It performs pattern matching using Java's Pattern class and returns true if the pattern is found anywhere within the input string.

Syntax

RLIKE(string_expr, pattern_expr)
// DataFrame API
col("string_column").rlike("pattern")

Arguments

Argument Type Description
left Expression The input string expression to be matched against the pattern
right Expression The regular expression pattern as a string expression

Return Type

Boolean - returns true if the pattern matches, false otherwise, or null if either input is null.

Supported Data Types

  • Input string: String/UTF8String types
  • Pattern: String/UTF8String types
  • Output: BooleanType

Algorithm

  • Compiles the right-hand expression into a Java Pattern object with collation-specific regex flags
  • Uses Pattern.matcher() to create a Matcher for the input string
  • Calls find(0) to search for the pattern starting from position 0 in the string
  • Returns true if any occurrence of the pattern is found anywhere in the string
  • Handles null values by returning null if either operand is null

Partitioning Behavior

How this expression affects partitioning:

  • Preserves existing partitioning as it operates row-by-row without requiring data movement
  • Does not require shuffle operations
  • Can be pushed down to individual partitions for parallel execution

Edge Cases

  • Null handling: Returns null if either the input string or pattern is null
  • Empty string behavior: Empty string ("") will match patterns that can match zero characters
  • Invalid regex patterns: Will throw PatternSyntaxException during compilation
  • Case sensitivity: Follows collation-specific regex flags for case matching rules
  • Unicode handling: Supports Unicode characters through UTF8String processing

Code Generation

This expression supports full Tungsten code generation with optimizations:

  • When the pattern is foldable (constant), it pre-compiles the Pattern object as a mutable state
  • For non-constant patterns, generates code to compile the pattern at runtime
  • Uses nullSafeCodeGen for dynamic pattern scenarios to handle null safety efficiently

Examples

-- Match strings containing digits
SELECT name FROM users WHERE RLIKE(name, '[0-9]+');

-- Match email patterns
SELECT email FROM contacts WHERE RLIKE(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$');

-- Case-sensitive pattern matching
SELECT product FROM inventory WHERE RLIKE(product, 'iPhone.*Pro');
// DataFrame API examples
import org.apache.spark.sql.functions._

// Match strings with specific pattern
df.filter(col("description").rlike("\\d{3}-\\d{3}-\\d{4}"))

// Find records with alphanumeric codes
df.where($"code".rlike("[A-Z]{2}[0-9]{4}"))

See Also

  • Like - for simple pattern matching with wildcards
  • RegExpReplace - for pattern-based string replacement
  • RegExpExtract - for extracting substrings using regular expressions
  • StringRegexExpression - the parent class for regex-based string operations