The Core Idea
Auto Loader is Databricks' built-in mechanism for incrementally ingesting new files from cloud storage into Delta Lake. It monitors a source location, detects new files as they arrive, and processes only the new ones — without reprocessing files it has already seen.
Think of it as a smart, stateful file watcher: it tracks exactly which files have been processed using a checkpoint, so even if your pipeline restarts, it picks up exactly where it left off.
Valid Auto Loader Sources
Auto Loader reads files from cloud object storage. It supports all major cloud providers and a wide range of file formats:
Supported Cloud Storage Sources
AWS S3
Amazon Simple Storage Service — the most common source in AWS-based Databricks deployments.
s3://bucket/pathAzure Data Lake Storage
ADLS Gen2 — the standard source for Azure Databricks workloads.
abfss://container@accountGoogle Cloud Storage
GCS — supported for GCP-based Databricks deployments.
gs://bucket/pathDBFS
Databricks File System — backed by cloud storage, can also be used as a source.
dbfs:/pathSupported File Formats
| Format | Notes |
|---|---|
JSON | Schema inference works well; supports nested structures |
CSV | Schema inference available; headers can be used |
Parquet | Schema embedded in file — inferred automatically |
Avro | Schema embedded; efficient for streaming ingestion |
ORC | Columnar format; schema embedded |
Delta | Can read Delta tables as a streaming source |
Text / Binary | Raw ingestion of unstructured files |
How Auto Loader Detects New Files
Auto Loader has two file detection modes. The exam will test which to use and why:
1. Directory Listing Mode (default)
Auto Loader periodically lists all files in the source directory and compares them against what it has already processed. Simple to set up — no cloud configuration needed.
2. File Notification Mode (recommended for scale)
Auto Loader sets up cloud event notifications (e.g. S3 Event Notifications + SQS, or Azure Event Grid) to receive instant alerts when new files arrive. No polling — files are processed as soon as they land.
| Directory Listing | File Notification | |
|---|---|---|
| Setup | None | Cloud event infra required |
| Scale | Small–medium file volumes | Millions of files |
| Latency | Polling interval | Near real-time |
| Cost | LIST API calls | Event notifications (cheaper at scale) |
| Config | cloudFiles.useNotifications=false | cloudFiles.useNotifications=true |
Schema Inference and Evolution
One of Auto Loader's most powerful features is automatic schema inference and schema evolution — handling changing source schemas without manual intervention.
Schema Inference
Auto Loader samples incoming files to infer the schema automatically. The inferred schema is stored in a schema location you specify, so it doesn't re-infer on every run.
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/schema")
.load("s3://my-bucket/raw/events/"))
(df.writeStream
.format("delta")
.option("checkpointLocation", "/checkpoints/events")
.trigger(availableNow=True)
.toTable("catalog.bronze.events"))
"cloudFiles" — not "json" or "csv" directly. The actual file format goes in cloudFiles.format. The schemaLocation and checkpointLocation are both required for production use.
Schema Evolution
When new columns appear in incoming files, Auto Loader can handle them automatically with schema evolution. New columns are added to the target Delta table rather than causing a pipeline failure.
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/schema")
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://my-bucket/raw/events/"))
(df.writeStream
.option("mergeSchema", "true")
.format("delta")
.option("checkpointLocation", "/checkpoints/events")
.toTable("catalog.bronze.events"))
Auto Loader vs. COPY INTO
This is the most tested decision in the exam for this topic. Both load files into Delta — but they suit different scenarios:
| Auto Loader | COPY INTO | |
|---|---|---|
| Execution model | Structured Streaming | Batch (SQL command) |
| File tracking | Checkpoint-based — stateful | Metadata-based — idempotent |
| Scale | Millions of files | Thousands of files |
| Schema inference | Yes — automatic + evolution | Limited |
| Trigger | Continuous or triggered | Manual / scheduled |
| Best for | Continuous ingestion pipelines | One-time or scheduled batch loads |
| Syntax | PySpark Structured Streaming | SQL command |
One-time load / batch / small files / SQL-only environment → COPY INTO.
In production Bronze layer pipelines, Auto Loader is the standard choice for landing zone ingestion — files arrive continuously from upstream systems, schemas change over time, and volumes grow to millions of files. COPY INTO is typically used for one-time historical backfills or when a data team only has SQL access and needs a simple idempotent load command.
The Auto Loader Flow
New files land
cloudFiles format
Tracks processed files
Bronze layer
Exam-Style Practice Questions
Select an answer — green means correct, red means wrong.
spark.readStream.format() to enable Auto Loader?Common Exam Traps
These are the mistakes the exam is designed to catch:
- Using
"json"as the format — Auto Loader always uses"cloudFiles". The file format goes incloudFiles.format. - Choosing Auto Loader for a one-time batch load — COPY INTO is simpler and more appropriate for that scenario.
- Forgetting that Auto Loader uses Structured Streaming — it is not a batch read, it is a streaming source.
- Assuming file notification works without cloud configuration — it requires setting up event notifications (SQS, Event Grid) on the storage account.
- Thinking
schemaLocationalone enables schema evolution — you also needmergeSchema=trueon the write stream. - Confusing Auto Loader with Kafka — Auto Loader is for cloud object storage only, not message queues or streaming platforms.