Skip to content

TryParseUrl

Overview

TryParseUrl is a Spark Catalyst expression that safely extracts components from URL strings without throwing exceptions on malformed URLs. It serves as a runtime-replaceable wrapper around ParseUrl with error handling, returning null when URL parsing fails instead of raising exceptions.

Syntax

try_parse_url(url_string, part_to_extract)
try_parse_url(url_string, part_to_extract, key)

Arguments

Argument Type Description
url_string String The URL string to parse
part_to_extract String The URL component to extract (HOST, PATH, QUERY, REF, PROTOCOL, FILE, AUTHORITY, USERINFO)
key String Optional. The query parameter key when part_to_extract is 'QUERY'

Return Type

String - Returns the extracted URL component as a string, or null if parsing fails.

Supported Data Types

  • Input: String expressions for URL and component parameters
  • Output: String type representing the extracted URL component

Algorithm

  • Delegates to the underlying ParseUrl expression with failOnError = false
  • Parses the input URL string using standard URL parsing logic
  • Extracts the requested component based on the part_to_extract parameter
  • Returns null instead of throwing exceptions when URL parsing fails
  • Handles query parameter extraction when a key is specified

Partitioning Behavior

  • Preserves partitioning: Yes, this is a deterministic per-row transformation
  • Requires shuffle: No, operates independently on each row
  • Can be pushed down to individual partitions without data movement

Edge Cases

  • Null handling: Returns null when any input argument is null
  • Empty input: Returns null for empty or whitespace-only URL strings
  • Malformed URLs: Returns null instead of throwing exceptions (key difference from parse_url)
  • Invalid part names: Returns null for unrecognized part_to_extract values
  • Missing query keys: Returns null when specified query parameter doesn't exist

Code Generation

This expression extends RuntimeReplaceable, meaning it is replaced by its underlying ParseUrl implementation during analysis phase. Code generation support depends on the underlying ParseUrl expression's codegen capabilities.

Examples

-- Extract hostname from URL
SELECT try_parse_url('https://spark.apache.org/docs', 'HOST');
-- Result: 'spark.apache.org'

-- Handle malformed URL gracefully
SELECT try_parse_url('not-a-valid-url', 'HOST');
-- Result: null

-- Extract query parameter
SELECT try_parse_url('https://example.com?param1=value1&param2=value2', 'QUERY', 'param1');
-- Result: 'value1'
// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(expr("try_parse_url(url_column, 'HOST')").as("hostname"))

// Using with when() for null handling
df.select(
  when(expr("try_parse_url(url_column, 'HOST')").isNotNull, 
       expr("try_parse_url(url_column, 'HOST')"))
    .otherwise("unknown_host")
    .as("safe_hostname")
)

See Also

  • ParseUrl - The strict version that throws exceptions on invalid URLs
  • regexp_extract - For custom URL parsing with regular expressions
  • URL manipulation functions in the url_funcs group