JsonTuple¶

Overview¶

JsonTuple is a generator expression that extracts multiple fields from a JSON string and returns them as separate columns in a tuple format. It takes a JSON string as the first argument followed by one or more field path expressions, returning a row with the extracted values in the order specified.

Syntax¶

json_tuple(json_string, field1, field2, ...)

Arguments¶

Argument	Type	Description
json_string	String	The JSON string to parse and extract fields from
field1, field2, ...	String	One or more field names/paths to extract from the JSON

Return Type¶

Returns a StructType with columns named c0, c1, c2, etc., where each column corresponds to the extracted field values. All output columns have nullable string type matching the input JSON expression's data type.

Supported Data Types¶

Input arguments must be string types with collation support (specifically StringTypeWithCollation with trimming support). All children expressions must accept string data types.

Algorithm¶

Parses the input JSON string using an internal JsonTupleEvaluator
Evaluates field name expressions, with foldable (constant) field names pre-computed for optimization
Extracts the requested fields from the parsed JSON structure
Returns an iterable of InternalRow objects containing the extracted values
Missing or invalid fields result in null values in the corresponding output columns

Partitioning Behavior¶

As a Generator expression, JsonTuple:

Does not preserve partitioning since it can produce multiple output rows
May require shuffle operations when used in contexts that need specific partitioning
Each input row can generate one or more output rows

Edge Cases¶

Always returns at least one row (nullable = false), even for invalid JSON
Null JSON input results in null values for all extracted fields
Invalid or missing field paths return null for those specific columns
Foldable (constant) field expressions are pre-evaluated for performance optimization
Requires minimum 2 arguments (JSON string + at least one field name), throws QueryCompilationErrors.wrongNumArgsError otherwise

Code Generation¶

This expression supports Tungsten code generation. It generates optimized code that:

Pre-allocates UTF8String arrays for field names
Reuses the JsonTupleEvaluator instance via context reference
Handles null field name evaluation inline
Avoids interpreted evaluation overhead for better performance

Examples¶

-- Extract name and age from JSON
SELECT json_tuple('{"name":"John", "age":30, "city":"NYC"}', 'name', 'age') AS (name, age);
-- Returns: John, 30

-- Extract nested fields
SELECT json_tuple('{"user":{"id":123,"email":"test@example.com"}}', 'user.id', 'user.email') AS (id, email);

// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(json_tuple(col("json_col"), lit("field1"), lit("field2")).as(Seq("c0", "c1")))