SchemaOfCsv¶

Overview¶

The SchemaOfCsv expression analyzes a CSV string and returns its inferred schema as a struct type. It parses the provided CSV data to determine the data types and column structure, returning a schema representation that can be used for subsequent CSV processing operations.

Syntax¶

SELECT schema_of_csv(csv_string [, options_map])

// DataFrame API
import org.apache.spark.sql.functions._
df.select(schema_of_csv($"csv_column"))
df.select(schema_of_csv($"csv_column", map("delimiter" -> ",")))

Arguments¶

Argument	Type	Description
`child`	`Expression` (String)	The CSV string to analyze for schema inference
`options`	`Map[String, String]`	Optional CSV parsing options (delimiter, quote character, etc.)

Return Type¶

Returns a StructType representing the inferred schema of the CSV data. The struct contains column names and their inferred data types based on the CSV content analysis.

Supported Data Types¶

Input: Only StringType is supported for the CSV input
Output: Returns a struct type with inferred column types (INT, STRING, DOUBLE, etc.)

Algorithm¶

Evaluates the input CSV string at compile time (requires foldable expression)
Uses SchemaOfCsvEvaluator to parse and analyze the CSV content
Infers data types for each column based on the values in the CSV string
Constructs a struct schema with appropriate column names and types
Returns the schema as a literal struct type expression

Partitioning Behavior¶

Preserves partitioning: Yes, this is a deterministic expression that doesn't require data movement
Requires shuffle: No, the expression operates on individual rows independently
Can be used in partition pruning scenarios since it's foldable

Edge Cases¶

Null handling: Returns DataTypeMismatch error if input is null - nulls are not permitted
Empty input: Requires non-empty, valid CSV string for schema inference
Non-foldable input: Must be a constant/literal value - variables or column references cause compilation error
Invalid CSV format: May return schema with default string types if parsing fails
Type inference: Uses conservative type inference, defaulting to string for ambiguous values

Code Generation¶

This expression uses the RuntimeReplaceable pattern and does not support direct code generation. Instead, it: - Replaces itself with an Invoke expression at planning time - Uses SchemaOfCsvEvaluator object for actual evaluation - Falls back to interpreted mode through the invoke mechanism - Evaluation happens at compile time since input must be foldable

Examples¶

-- Basic schema inference
SELECT schema_of_csv('1,abc,2.5');
-- Returns: STRUCT<_c0: INT, _c1: STRING, _c2: DOUBLE>

-- With headers
SELECT schema_of_csv('id,name,score\n1,John,95.5');
-- Returns: STRUCT<id: INT, name: STRING, score: DOUBLE>

-- Using in CREATE TABLE
CREATE TABLE my_table 
USING DELTA
AS SELECT from_csv('1,John,95.5', schema_of_csv('1,John,95.5'));

// DataFrame API usage
import org.apache.spark.sql.functions._

// Basic schema inference
df.select(schema_of_csv(lit("1,abc,2.5")))

// With custom options
val options = Map("header" -> "true", "delimiter" -> ";")
df.select(schema_of_csv(lit("id;name\n1;John"), typedLit(options)))

// Use inferred schema for parsing
val csvData = "1,John,95.5"
val schema = schema_of_csv(lit(csvData))
df.select(from_csv($"csv_column", schema))