XmlToStructs¶

Overview¶

XmlToStructs is a Spark Catalyst expression that parses XML strings and converts them into Spark SQL struct data types. This function enables structured querying of XML data by transforming XML content into nested struct columns that can be accessed using standard SQL operations.

Syntax¶

xml_to_struct(xml_string, schema)

// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(xml_to_struct($"xml_column", schema))

Arguments¶

Argument	Type	Description
xml_string	StringType	The XML string to be parsed into a struct
schema	StructType	The target schema defining the structure of the output

Return Type¶

Returns a StructType that matches the provided schema parameter, containing the parsed XML data as nested fields.

Supported Data Types¶

Input: StringType (XML string data)
Output: StructType with nested fields of various types including:
Primitive types (StringType, IntegerType, etc.)
ArrayType for XML arrays
Nested StructType for complex XML elements

Algorithm¶

Parses the input XML string using XML parsing libraries
Maps XML elements and attributes to corresponding struct fields based on the provided schema
Handles nested XML structures by creating nested struct types
Converts XML arrays (repeated elements) into Spark ArrayType columns
Applies type coercion to match the target schema data types

Partitioning Behavior¶

This expression preserves partitioning behavior:

Does not require data shuffle as it operates on individual rows
Maintains existing partition boundaries since it's a row-level transformation
Can be executed in parallel across partitions without cross-partition dependencies

Edge Cases¶

Null input: Returns null when the input XML string is null
Empty XML: Returns a struct with null fields when XML is empty or whitespace-only
Malformed XML: Throws parsing exception or returns null based on configuration
Schema mismatch: Fields not present in XML are set to null in the resulting struct
Extra XML fields: XML elements not defined in schema are ignored during parsing

Code Generation¶

This expression likely falls back to interpreted mode due to the complexity of XML parsing operations, which are difficult to express in generated Java code. The XML parsing logic requires external libraries that are better suited for interpreted execution.

Examples¶

-- Parse XML string into struct
SELECT xml_to_struct(
  '{"teacher":"Alice","student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}',
  'teacher STRING, student ARRAY<STRUCT<name: STRING, rank: INT>>'
) AS parsed_data;

// DataFrame API usage
import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("teacher", StringType),
  StructField("student", ArrayType(StructType(Seq(
    StructField("name", StringType),
    StructField("rank", IntegerType)
  ))))
))

df.select(xml_to_struct($"xml_column", lit(schema.json)))