SchemaOfJson¶

Overview¶

The SchemaOfJson expression analyzes a JSON string and returns the inferred schema as a data type string. It parses the JSON structure and determines the appropriate Spark SQL data types for all fields, including support for complex nested structures like arrays and structs.

Syntax¶

SELECT schema_of_json(json_string [, options_map])

// DataFrame API
import org.apache.spark.sql.functions._
df.select(schema_of_json(col("json_column")))
df.select(schema_of_json(col("json_column"), map("allowNumericLeadingZeros" -> "true")))

Arguments¶

Argument	Type	Description
json_string	STRING	The JSON string to analyze for schema inference
options_map	MAP	Optional parsing options like `allowNumericLeadingZeros`, `allowBackslashEscapingAnyCharacter`, etc.

Return Type¶

Returns a STRING representing the inferred Spark SQL data type schema (e.g., "STRUCT").

Supported Data Types¶

Input: STRING (JSON formatted)
Inferred types: All Spark SQL data types including BOOLEAN, BIGINT, DOUBLE, STRING, ARRAY, STRUCT, MAP

Algorithm¶

Parses the input JSON string using Jackson JSON parser
Traverses the JSON structure recursively to identify all fields and their types
Applies type inference rules (numbers default to BIGINT/DOUBLE, strings remain STRING)
Handles nested structures by creating STRUCT types for objects and ARRAY types for arrays
Applies any specified parsing options during the inference process

Partitioning Behavior¶

This expression preserves partitioning since it operates on individual rows without requiring data movement:

Preserves existing partitioning
Does not require shuffle operations
Can be executed independently on each partition

Edge Cases¶

Null input returns null result
Invalid JSON strings may throw parsing exceptions
Empty JSON objects return "STRUCT<>"
Empty JSON arrays return "ARRAY" (default array element type)
Mixed type arrays are inferred as the most general common type
Numeric values with leading zeros require allowNumericLeadingZeros option to parse correctly
Very deeply nested JSON may hit recursion limits

Code Generation¶

This expression supports Whole-Stage Code Generation (Tungsten) for the wrapper logic, but the actual JSON parsing falls back to interpreted mode using Jackson parser libraries for complex schema inference operations.

Examples¶

-- Basic schema inference
SELECT schema_of_json('{"name":"John", "age":30}');
-- Result: STRUCT<age: BIGINT, name: STRING>

-- Array schema inference  
SELECT schema_of_json('[{"col":01}]', map('allowNumericLeadingZeros', 'true'));
-- Result: ARRAY<STRUCT<col: BIGINT>>

-- Complex nested structure
SELECT schema_of_json('{"users":[{"name":"John","scores":[95,87]}]}');
-- Result: STRUCT<users: ARRAY<STRUCT<name: STRING, scores: ARRAY<BIGINT>>>>

// DataFrame API usage
import org.apache.spark.sql.functions._

val df = Seq("""{"name":"Alice","age":25}""").toDF("json_col")
df.select(schema_of_json(col("json_col"))).show(false)

// With options
val options = Map("allowNumericLeadingZeros" -> "true")
df.select(schema_of_json(col("json_col"), lit(options))).show(false)