VectorCosineSimilarity¶
Overview¶
The VectorCosineSimilarity expression computes the cosine similarity between two float arrays (vectors). This expression is implemented as a RuntimeReplaceable that delegates to the VectorFunctionImplUtils.vectorCosineSimilarity method at runtime.
Syntax¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
| left | array<float> |
The first vector (array of float values) |
| right | array<float> |
The second vector (array of float values) |
Return Type¶
float - Returns a float value representing the cosine similarity between the two input vectors.
Supported Data Types¶
- Input:
array<float>only - Both left and right expressions must be arrays of float type
- No automatic type coercion is performed
Algorithm¶
- Validates that both inputs are arrays of float type during analysis phase
- Delegates actual computation to
VectorFunctionImplUtils.vectorCosineSimilarityat runtime - Uses
StaticInvoketo call the external implementation method - Passes the expression's pretty name as an additional parameter for error reporting
- Type checking occurs before code generation, ensuring type safety
Partitioning Behavior¶
This expression preserves partitioning behavior:
- Does not require data shuffling
- Can be computed independently on each partition
- Row-level operation that doesn't affect data distribution
Edge Cases¶
- Type mismatch: Throws
DataTypeMismatcherror withUNEXPECTED_INPUT_TYPEsubclass if inputs are notarray<float> - Null handling: Null behavior is handled by the underlying
VectorFunctionImplUtilsimplementation - Array length mismatch: Error handling depends on the runtime implementation
- Empty arrays: Behavior determined by the runtime utility method
- Zero vectors: May result in undefined cosine similarity (division by zero in magnitude calculation)
Code Generation¶
This expression uses runtime replacement pattern:
- Does not generate inline code during Catalyst code generation
- Uses
StaticInvoketo call pre-compiled Java method - Falls back to interpreted execution through method invocation
- The actual computation is handled by
VectorFunctionImplUtils.vectorCosineSimilarity
Examples¶
-- Calculate cosine similarity between two feature vectors
SELECT vector_cosine_similarity(
array(1.0, 2.0, 3.0),
array(4.0, 5.0, 6.0)
) AS similarity;
-- Use with table columns
SELECT id, vector_cosine_similarity(features1, features2) AS sim
FROM vectors_table;
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(
expr("vector_cosine_similarity(vector_col1, vector_col2)").alias("similarity")
)
// With literal arrays
df.select(
expr("vector_cosine_similarity(array(1.0f, 2.0f), array(3.0f, 4.0f))")
)
See Also¶
- Other vector functions in the
vector_funcsgroup ArrayTypeexpressions for array manipulationStaticInvokefor understanding runtime replacement patternsVectorFunctionImplUtilsfor the actual implementation details