The Core Idea

Auto Loader is Databricks' built-in mechanism for incrementally ingesting new files from cloud storage into Delta Lake. It monitors a source location, detects new files as they arrive, and processes only the new ones — without reprocessing files it has already seen.

Think of it as a smart, stateful file watcher: it tracks exactly which files have been processed using a checkpoint, so even if your pipeline restarts, it picks up exactly where it left off.

Exam Mindset
Auto Loader questions test three things: what sources it supports, how it detects new files, and when to use it instead of COPY INTO. Get those three clear and you'll handle any scenario.

Valid Auto Loader Sources

Auto Loader reads files from cloud object storage. It supports all major cloud providers and a wide range of file formats:

Supported Cloud Storage Sources

☁️

AWS S3

Amazon Simple Storage Service — the most common source in AWS-based Databricks deployments.

s3://bucket/path
🔷

Azure Data Lake Storage

ADLS Gen2 — the standard source for Azure Databricks workloads.

abfss://container@account
🟡

Google Cloud Storage

GCS — supported for GCP-based Databricks deployments.

gs://bucket/path
📁

DBFS

Databricks File System — backed by cloud storage, can also be used as a source.

dbfs:/path

Supported File Formats

FormatNotes
JSONSchema inference works well; supports nested structures
CSVSchema inference available; headers can be used
ParquetSchema embedded in file — inferred automatically
AvroSchema embedded; efficient for streaming ingestion
ORCColumnar format; schema embedded
DeltaCan read Delta tables as a streaming source
Text / BinaryRaw ingestion of unstructured files

How Auto Loader Detects New Files

Auto Loader has two file detection modes. The exam will test which to use and why:

1. Directory Listing Mode (default)

Auto Loader periodically lists all files in the source directory and compares them against what it has already processed. Simple to set up — no cloud configuration needed.

Use directory listing when
You have a smaller volume of files, your source directory is straightforward, or you want zero cloud setup. This is the default and works out of the box.

2. File Notification Mode (recommended for scale)

Auto Loader sets up cloud event notifications (e.g. S3 Event Notifications + SQS, or Azure Event Grid) to receive instant alerts when new files arrive. No polling — files are processed as soon as they land.

Use file notification when
You have millions of files, need lower latency, or want to avoid the cost of repeatedly listing large directories. Requires cloud infrastructure permissions to set up.
Directory ListingFile Notification
SetupNoneCloud event infra required
ScaleSmall–medium file volumesMillions of files
LatencyPolling intervalNear real-time
CostLIST API callsEvent notifications (cheaper at scale)
ConfigcloudFiles.useNotifications=falsecloudFiles.useNotifications=true

Schema Inference and Evolution

One of Auto Loader's most powerful features is automatic schema inference and schema evolution — handling changing source schemas without manual intervention.

Schema Inference

Auto Loader samples incoming files to infer the schema automatically. The inferred schema is stored in a schema location you specify, so it doesn't re-infer on every run.

Python — basic Auto Loader setup df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/checkpoints/schema")
  .load("s3://my-bucket/raw/events/"))

(df.writeStream
  .format("delta")
  .option("checkpointLocation", "/checkpoints/events")
  .trigger(availableNow=True)
  .toTable("catalog.bronze.events"))
Key exam points — format and schema
The format is always "cloudFiles" — not "json" or "csv" directly. The actual file format goes in cloudFiles.format. The schemaLocation and checkpointLocation are both required for production use.

Schema Evolution

When new columns appear in incoming files, Auto Loader can handle them automatically with schema evolution. New columns are added to the target Delta table rather than causing a pipeline failure.

Python — enabling schema evolution df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/checkpoints/schema")
  .option("cloudFiles.inferColumnTypes", "true")
  .load("s3://my-bucket/raw/events/"))

(df.writeStream
  .option("mergeSchema", "true")
  .format("delta")
  .option("checkpointLocation", "/checkpoints/events")
  .toTable("catalog.bronze.events"))

Auto Loader vs. COPY INTO

This is the most tested decision in the exam for this topic. Both load files into Delta — but they suit different scenarios:

Auto LoaderCOPY INTO
Execution modelStructured StreamingBatch (SQL command)
File trackingCheckpoint-based — statefulMetadata-based — idempotent
ScaleMillions of filesThousands of files
Schema inferenceYes — automatic + evolutionLimited
TriggerContinuous or triggeredManual / scheduled
Best forContinuous ingestion pipelinesOne-time or scheduled batch loads
SyntaxPySpark Structured StreamingSQL command
Exam rule of thumb
Continuous / streaming / large-scale / schema evolution → Auto Loader.
One-time load / batch / small files / SQL-only environment → COPY INTO.
🏗️ Real-world note

In production Bronze layer pipelines, Auto Loader is the standard choice for landing zone ingestion — files arrive continuously from upstream systems, schemas change over time, and volumes grow to millions of files. COPY INTO is typically used for one-time historical backfills or when a data team only has SQL access and needs a simple idempotent load command.

The Auto Loader Flow

☁️ Cloud Storage
New files land
Auto Loader
cloudFiles format
📋 Checkpoint
Tracks processed files
🥉 Delta Table
Bronze layer

Exam-Style Practice Questions

Select an answer — green means correct, red means wrong.

Q1 What format string do you use with spark.readStream.format() to enable Auto Loader?
Q2 A pipeline ingests millions of JSON files from S3, arriving continuously. New columns appear occasionally. Which tool is most appropriate?
Q3 What is the key difference between Auto Loader's directory listing mode and file notification mode?
Q4 A data engineer needs to do a one-time load of 500 CSV files from ADLS into a Delta table. Which approach is most appropriate?
Q5 Which option enables Auto Loader to automatically add new columns to the target Delta table when they appear in incoming files?
Q6 Which of the following is NOT a valid Auto Loader source?

Common Exam Traps

These are the mistakes the exam is designed to catch:

⚡ Quick Memory Trick
Auto Loader = cloudFiles format + checkpoint + schema location. Those three things together are the complete Auto Loader setup. If any one is missing in an exam code snippet, that's the bug.