JsonToStructs¶
Overview¶
The JsonToStructs expression parses JSON strings into Spark SQL structured data types (StructType, ArrayType, or MapType). This expression is exposed through the from_json SQL function and enables deserialization of JSON data for structured processing within Spark.
Syntax¶
// DataFrame API
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = new StructType()
.add("name", StringType)
.add("age", IntegerType)
df.select(from_json($"json_column", schema))
# PySpark API
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
df.select(from_json("json_column", schema))
Arguments¶
| Argument | Type | Description |
|---|---|---|
| jsonStr | StringType | Column containing JSON strings to parse |
| schema | DataType, StructType, ArrayType, MapType, or DDL String | Target schema defining the structure of the parsed output |
| options | Map[String, String] (optional) | JSON parsing options (same as JSON datasource options) |
Return Type¶
Returns a column matching the provided schema type: - StructType schema returns a struct column - ArrayType schema returns an array column - MapType schema returns a map column
Supported Data Types¶
Input: - StringType containing valid JSON
Schema Types: - StructType - for JSON objects - ArrayType - for JSON arrays - MapType(StringType, StringType) - for arbitrary key-value JSON objects - Nested combinations of the above
Field Types within Schema: - All primitive types (StringType, IntegerType, LongType, DoubleType, BooleanType, etc.) - DateType and TimestampType (with configurable formatting) - DecimalType - BinaryType (Base64 encoded) - Nested StructType, ArrayType, MapType
Options¶
| Option | Default | Description |
|---|---|---|
| mode | PERMISSIVE | Parse mode: PERMISSIVE, FAILFAST |
| columnNameOfCorruptRecord | (none) | Field name to store malformed records (PERMISSIVE mode only) |
| dateFormat | yyyy-MM-dd | Format for parsing date fields |
| timestampFormat | yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] | Format for parsing timestamp fields |
| primitivesAsString | false | Infer all primitive values as StringType |
| prefersDecimal | false | Infer floating-point values as DecimalType |
| allowComments | false | Allow Java/C++ style comments in JSON |
| allowUnquotedFieldNames | false | Allow unquoted JSON field names |
| allowSingleQuotes | true | Allow single quotes instead of double quotes |
| allowNumericLeadingZeros | false | Allow leading zeros in numbers |
| allowBackslashEscapingAnyCharacter | false | Allow backslash escaping any character |
| allowUnquotedControlChars | false | Allow unquoted control characters |
| multiLine | false | Parse multi-line JSON records |
| encoding | UTF-8 | Character encoding |
| locale | en-US | Locale for parsing |
Algorithm¶
- Parses the input JSON string using Jackson parser
- Validates JSON structure against the provided schema
- Converts JSON values to corresponding Spark SQL types
- Applies configured options for date/timestamp formatting and error handling
- Returns structured data matching the schema
Partitioning Behavior¶
- Preserves existing partitioning as it operates row-wise
- Does not require shuffle operations
- Can be executed locally on each partition independently
- Timezone-aware: uses session timezone for timestamp parsing
Edge Cases¶
Null Handling¶
- Null input returns null output
- Missing fields in JSON are set to null in the output struct
- Empty string input returns null (not an empty struct)
Parse Mode Behavior (Spark 3.0+)¶
PERMISSIVE Mode (default):
- Malformed JSON records return a row with parseable fields populated and unparseable fields as null
- If columnNameOfCorruptRecord is specified in schema, malformed JSON string is stored there
- Without columnNameOfCorruptRecord, malformed records are silently converted with null fields
FAILFAST Mode:
- Throws exception immediately on malformed JSON
- Does not support columnNameOfCorruptRecord option
- Useful for strict validation requirements
Spark 3.0 Breaking Changes¶
- In Spark 2.4 and below: malformed JSON returned row with ALL fields as null
- In Spark 3.0+: malformed JSON returns row with successfully parsed fields populated, only unparseable fields as null
- JSON arrays cannot be parsed as StructType (use ArrayType instead)
Schema Mismatch¶
- Extra fields in JSON not in schema are ignored
- Field names are case-sensitive and must match exactly
- Type mismatches (e.g., string value for integer field) result in null for that field
Special Values¶
- JSON
nullmaps to Spark SQL null - Empty JSON object
{}maps to struct with all null fields - Empty JSON array
[]maps to empty array
Examples¶
-- Basic struct parsing
SELECT from_json('{"name":"Alice","age":30}', 'name STRING, age INT') AS parsed;
-- Result: {Alice, 30}
-- Parsing to MapType
SELECT from_json('{"key1":"value1","key2":"value2"}', 'MAP<STRING,STRING>') AS parsed;
-- Result: {key1 -> value1, key2 -> value2}
-- Array of structs
SELECT from_json('[{"a":1},{"a":2}]', 'ARRAY<STRUCT<a:INT>>') AS parsed;
-- Result: [{1}, {2}]
-- With options
SELECT from_json(
'{"date":"2024-01-15"}',
'date DATE',
map('dateFormat', 'yyyy-MM-dd')
) AS parsed;
-- Result: {2024-01-15}
-- FAILFAST mode
SELECT from_json('{"a":1}', 'a INT', map('mode', 'FAILFAST')) AS parsed;
-- Handling corrupt records (PERMISSIVE mode)
SELECT from_json(
'{"a": invalid}',
'a INT, _corrupt_record STRING',
map('columnNameOfCorruptRecord', '_corrupt_record')
) AS parsed;
-- Result: {null, {"a": invalid}}
// DataFrame API examples
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// Define schema
val schema = new StructType()
.add("name", StringType)
.add("age", IntegerType)
.add("address", new StructType()
.add("city", StringType)
.add("zip", StringType))
// Parse JSON column
df.select(from_json($"json_col", schema).as("parsed"))
// Expand to multiple columns
df.select(from_json($"json_col", schema).as("parsed"))
.select($"parsed.*")
// With options
val options = Map(
"mode" -> "PERMISSIVE",
"timestampFormat" -> "yyyy-MM-dd HH:mm:ss"
)
df.select(from_json($"json_col", schema, options))
// Parse to MapType for dynamic keys
df.select(from_json($"json_col", MapType(StringType, StringType)))
# PySpark examples
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import *
# Define schema
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType()),
StructField("scores", ArrayType(IntegerType()))
])
# Parse and expand
df.select(from_json(col("json_col"), schema).alias("parsed")) \
.select("parsed.*")
# With DDL string schema
df.select(from_json("json_col", "name STRING, age INT"))
See Also¶
- StructsToJson - Convert structs back to JSON strings (to_json)
- GetJsonObject - Extract single values from JSON using path expressions
- JsonTuple - Extract multiple values from JSON into columns
- SchemaOfJson - Infer schema from JSON string sample