StringInstr¶
Overview¶
The StringInstr expression finds the position of the first occurrence of a substring within a string. It returns a 1-based index where the substring begins, or 0 if the substring is not found.
Syntax¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
| str | Expression (String) | The input string to search within |
| substr | Expression (String) | The substring to search for |
Return Type¶
IntegerType - Returns an integer representing the 1-based position of the substring.
Supported Data Types¶
- Input:
StringTypeNonCSAICollationwith trim collation support for both arguments - The expression supports collation-aware string matching based on the collation ID of the left operand
- Both string arguments must be compatible string types
Algorithm¶
- Extracts the collation ID from the left operand's StringType
- Uses
CollationSupport.StringInstr.exec()to perform collation-aware substring search - The internal search returns a 0-based index which is incremented by 1 for SQL compatibility
- Handles UTF8String operations for efficient string processing
- Preserves collation semantics throughout the evaluation process
Partitioning Behavior¶
This expression does not affect partitioning behavior:
- Preserves existing partitioning as it operates on individual rows
- Does not require data shuffling
- Can be executed locally on each partition
Edge Cases¶
- Null handling: Expression is null intolerant - returns null if either input is null
- Empty string behavior: Returns 1 when searching for empty string in any non-null string
- Case sensitivity: Behavior depends on the collation settings of the input string type
- Not found: Returns 0 when substring is not found in the target string
- Collation mismatch: Uses the collation ID from the left operand for comparison logic
Code Generation¶
This expression supports Tungsten code generation through the doGenCode method:
- Generates optimized bytecode using
defineCodeGen - Leverages
CollationSupport.StringInstr.genCode()for collation-aware code generation - Adds 1 to the generated result to maintain 1-based indexing
- Falls back to interpreted mode if code generation is disabled
Examples¶
-- Find position of substring
SELECT INSTR('hello world', 'world'); -- Returns 7
-- Substring not found
SELECT INSTR('hello', 'xyz'); -- Returns 0
-- Case sensitivity depends on collation
SELECT INSTR('Hello', 'hello'); -- Result varies by collation
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(col("text").instr("pattern").as("position"))
// With string literals
df.select(instr(col("message"), lit("error")).as("error_pos"))
See Also¶
SubstringIndex- Extract substring based on delimiter and occurrenceStringLocate- Similar functionality with different parameter orderingStringSplit- Split strings into arrays based on patternsRegExpExtract- Pattern-based string extraction with regular expressions