InputFileName¶

Overview¶

The InputFileName expression returns the name of the file currently being processed during query execution. This is a non-deterministic expression that provides access to the input file path through the InputFileBlockHolder mechanism, allowing users to identify which file contributed specific rows in their query results.

Syntax¶

INPUT_FILE_NAME()

// DataFrame API
import org.apache.spark.sql.functions._
df.select(input_file_name())

Arguments¶

Argument	Type	Description
None	-	This expression takes no arguments

Return Type¶

StringType - Returns a UTF-8 encoded string containing the input file path.

Supported Data Types¶

This expression does not operate on input data types as it takes no arguments. It generates string output regardless of the underlying data being processed.

Algorithm¶

Expression extends LeafExpression with no child expressions to evaluate
During execution, calls InputFileBlockHolder.getInputFilePath to retrieve the current file path
Returns the file path as a UTF8String without any transformation
Initialization per partition is a no-op (initializeInternal does nothing)
File path context is maintained externally by Spark's execution engine

Partitioning Behavior¶

Preserves partitioning: This expression does not affect data partitioning as it only adds metadata
No shuffle required: The expression is evaluated locally on each partition without requiring data movement
Each partition will return file paths for the files it processes locally

Edge Cases¶

Null handling: Expression is marked as non-nullable (nullable = false), so it will never return null values
Missing file context: If called outside of file-based data source context, behavior depends on InputFileBlockHolder state
Multiple files per partition: When a partition processes multiple files, the returned file name corresponds to the file being processed for each specific row
Non-file data sources: May return empty or undefined results when used with data sources that don't have file-based origins

Code Generation¶

This expression supports Tungsten code generation. The generated code directly calls InputFileBlockHolder.getInputFilePath() and sets isNull = false, avoiding the overhead of interpreted evaluation in tight loops.

Examples¶

-- Get file names along with data
SELECT input_file_name() as source_file, * 
FROM parquet.`/path/to/files/*.parquet`;

-- Group by source file to analyze data distribution
SELECT input_file_name() as file, count(*) as row_count
FROM delta.`/path/to/table`
GROUP BY input_file_name();

// DataFrame API usage
import org.apache.spark.sql.functions._

val df = spark.read.parquet("/path/to/files/*.parquet")
df.select(input_file_name().alias("source_file"), col("*")).show()

// Analyze data distribution across files
df.groupBy(input_file_name().alias("file"))
  .count()
  .show()