Classify valid Auto Loader sources and use cases

The Core Idea

Auto Loader is Databricks' built-in mechanism for incrementally ingesting new files from cloud storage into Delta Lake. It monitors a source location, detects new files as they arrive, and processes only the new ones — without reprocessing files it has already seen.

Think of it as a smart, stateful file watcher: it tracks exactly which files have been processed using a checkpoint, so even if your pipeline restarts, it picks up exactly where it left off.

Exam Mindset

Auto Loader questions test three things: what sources it supports, how it detects new files, and when to use it instead of COPY INTO. Get those three clear and you'll handle any scenario.

Valid Auto Loader Sources

Auto Loader reads files from cloud object storage. It supports all major cloud providers and a wide range of file formats:

Supported Cloud Storage Sources

☁️

AWS S3

Amazon Simple Storage Service — the most common source in AWS-based Databricks deployments.

s3://bucket/path

🔷

Azure Data Lake Storage

ADLS Gen2 — the standard source for Azure Databricks workloads.

abfss://container@account

🟡

Google Cloud Storage

GCS — supported for GCP-based Databricks deployments.

gs://bucket/path

📁

DBFS

Databricks File System — backed by cloud storage, can also be used as a source.

dbfs:/path

Supported File Formats

Format	Notes
`JSON`	Schema inference works well; supports nested structures
`CSV`	Schema inference available; headers can be used
`Parquet`	Schema embedded in file — inferred automatically
`Avro`	Schema embedded; efficient for streaming ingestion
`ORC`	Columnar format; schema embedded
`Delta`	Can read Delta tables as a streaming source
`Text / Binary`	Raw ingestion of unstructured files

How Auto Loader Detects New Files

Auto Loader has two file detection modes. The exam will test which to use and why:

1. Directory Listing Mode (default)

Auto Loader periodically lists all files in the source directory and compares them against what it has already processed. Simple to set up — no cloud configuration needed.

Use directory listing when

You have a smaller volume of files, your source directory is straightforward, or you want zero cloud setup. This is the default and works out of the box.

2. File Notification Mode (recommended for scale)

Auto Loader sets up cloud event notifications (e.g. S3 Event Notifications + SQS, or Azure Event Grid) to receive instant alerts when new files arrive. No polling — files are processed as soon as they land.

Use file notification when

You have millions of files, need lower latency, or want to avoid the cost of repeatedly listing large directories. Requires cloud infrastructure permissions to set up.

	Directory Listing	File Notification
Setup	None	Cloud event infra required
Scale	Small–medium file volumes	Millions of files
Latency	Polling interval	Near real-time
Cost	LIST API calls	Event notifications (cheaper at scale)
Config	`cloudFiles.useNotifications=false`	`cloudFiles.useNotifications=true`

Schema Inference and Evolution

One of Auto Loader's most powerful features is automatic schema inference and schema evolution — handling changing source schemas without manual intervention.

Schema Inference

Auto Loader samples incoming files to infer the schema automatically. The inferred schema is stored in a schema location you specify, so it doesn't re-infer on every run.

      Python — basic Auto Loader setup
      df = (spark.readStream

        .format("cloudFiles")

        .option("cloudFiles.format", "json")

        .option("cloudFiles.schemaLocation", "/checkpoints/schema")

        .load("s3://my-bucket/raw/events/"))

      (df.writeStream

        .format("delta")

        .option("checkpointLocation", "/checkpoints/events")

        .trigger(availableNow=True)

        .toTable("catalog.bronze.events"))

Key exam points — format and schema

The format is always "cloudFiles" — not "json" or "csv" directly. The actual file format goes in cloudFiles.format. The schemaLocation and checkpointLocation are both required for production use.

Schema Evolution

When new columns appear in incoming files, Auto Loader can handle them automatically with schema evolution. New columns are added to the target Delta table rather than causing a pipeline failure.

      Python — enabling schema evolution
      df = (spark.readStream

        .format("cloudFiles")

        .option("cloudFiles.format", "json")

        .option("cloudFiles.schemaLocation", "/checkpoints/schema")

        .option("cloudFiles.inferColumnTypes", "true")

        .load("s3://my-bucket/raw/events/"))

      (df.writeStream

        .option("mergeSchema", "true")

        .format("delta")

        .option("checkpointLocation", "/checkpoints/events")

        .toTable("catalog.bronze.events"))

Auto Loader vs. COPY INTO

This is the most tested decision in the exam for this topic. Both load files into Delta — but they suit different scenarios:

	Auto Loader	COPY INTO
Execution model	Structured Streaming	Batch (SQL command)
File tracking	Checkpoint-based — stateful	Metadata-based — idempotent
Scale	Millions of files	Thousands of files
Schema inference	Yes — automatic + evolution	Limited
Trigger	Continuous or triggered	Manual / scheduled
Best for	Continuous ingestion pipelines	One-time or scheduled batch loads
Syntax	PySpark Structured Streaming	SQL command

Exam rule of thumb

Continuous / streaming / large-scale / schema evolution → Auto Loader.
One-time load / batch / small files / SQL-only environment → COPY INTO.

🏗️ Real-world note

In production Bronze layer pipelines, Auto Loader is the standard choice for landing zone ingestion — files arrive continuously from upstream systems, schemas change over time, and volumes grow to millions of files. COPY INTO is typically used for one-time historical backfills or when a data team only has SQL access and needs a simple idempotent load command.

The Auto Loader Flow

☁️ Cloud Storage
New files land

→

Auto Loader
cloudFiles format

→

📋 Checkpoint
Tracks processed files

→

🥉 Delta Table
Bronze layer

Exam-Style Practice Questions

Select an answer — green means correct, red means wrong.

Q1 What format string do you use with spark.readStream.format() to enable Auto Loader?

Q2 A pipeline ingests millions of JSON files from S3, arriving continuously. New columns appear occasionally. Which tool is most appropriate?

Q3 What is the key difference between Auto Loader's directory listing mode and file notification mode?

Q4 A data engineer needs to do a one-time load of 500 CSV files from ADLS into a Delta table. Which approach is most appropriate?

Q5 Which option enables Auto Loader to automatically add new columns to the target Delta table when they appear in incoming files?

Q6 Which of the following is NOT a valid Auto Loader source?

Common Exam Traps

These are the mistakes the exam is designed to catch:

Using "json" as the format — Auto Loader always uses "cloudFiles". The file format goes in cloudFiles.format.
Choosing Auto Loader for a one-time batch load — COPY INTO is simpler and more appropriate for that scenario.
Forgetting that Auto Loader uses Structured Streaming — it is not a batch read, it is a streaming source.
Assuming file notification works without cloud configuration — it requires setting up event notifications (SQS, Event Grid) on the storage account.
Thinking schemaLocation alone enables schema evolution — you also need mergeSchema=true on the write stream.
Confusing Auto Loader with Kafka — Auto Loader is for cloud object storage only, not message queues or streaming platforms.

⚡ Quick Memory Trick

Auto Loader = cloudFiles format + checkpoint + schema location. Those three things together are the complete Auto Loader setup. If any one is missing in an exam code snippet, that's the bug.