The Core Idea

This topic is about the syntax of Auto Loader — not just knowing it exists, but being able to read a code snippet and identify what each option does, what is required vs. optional, and what happens when something is missing or misconfigured.

The exam will show you Auto Loader code and ask you to spot the bug, complete the missing option, or explain what a specific line does. This article covers every option you need to know cold.

Exam Mindset
Four things are tested on syntax: the cloudFiles format, the checkpoint location, schema hints, and the rescue data column. Know all four in detail — what they do, where they go in the code, and what breaks without them.

The Complete Auto Loader Template

This is the full production-ready pattern. Every exam question is a variation of this structure:

Python — complete Auto Loader read + write stream # READ STREAM — source config
df = (spark.readStream
  .format("cloudFiles")                      # ① always "cloudFiles"
  .option("cloudFiles.format", "json")      # ② actual file format
  .option("cloudFiles.schemaLocation", "/ckpt/schema")  # ③ schema storage
  .option("cloudFiles.inferColumnTypes", "true")   # ④ infer exact types
  .schema(mySchema)                           # ⑤ optional schema hint
  .load("s3://my-bucket/landing/events/"))

# WRITE STREAM — target config
(df.writeStream
  .format("delta")
  .option("checkpointLocation", "/ckpt/events")  # ⑥ required checkpoint
  .option("mergeSchema", "true")                # ⑦ schema evolution
  .trigger(availableNow=True)               # ⑧ batch trigger
  .toTable("catalog.bronze.events"))

Each numbered line is a potential exam question. Let's go through all of them.

① The cloudFiles Format

The outer format is always "cloudFiles" — this is what activates Auto Loader. The actual file format (JSON, CSV, Parquet, etc.) is specified separately inside cloudFiles.format.

Exam trap — two formats, two places
.format("cloudFiles") goes on readStream. .option("cloudFiles.format", "json") specifies the file type. If an exam question uses .format("json") directly on a streaming read, that is the bug — it will not activate Auto Loader.
OptionWhereRequiredExample value
format("cloudFiles")readStreamYesAlways "cloudFiles"
cloudFiles.format.option()Yes"json", "csv", "parquet", "avro"
cloudFiles.schemaLocation.option()Yes*Cloud storage path for schema
cloudFiles.inferColumnTypes.option()Optional"true" — infers exact types vs. all strings
cloudFiles.useNotifications.option()Optional"true" for event-based detection

* Required when using schema inference. Can be omitted only if you provide a full explicit schema.

② Checkpointing

The checkpoint is what makes Auto Loader stateful. It records exactly which files have been processed so the pipeline can resume exactly where it left off after a restart — with no duplicate processing and no missed files.

There are actually two separate checkpoint-related options and the exam tests whether you know the difference:

OptionWhere it goesWhat it stores
cloudFiles.schemaLocation On readStream The inferred schema — column names and types. Prevents re-inferring on every run.
checkpointLocation On writeStream The streaming state — which files have been processed, offsets, progress metadata.
Exam trap — missing checkpointLocation
If checkpointLocation is missing from the writeStream, the pipeline will throw an error: checkpointLocation must be specified. This is one of the most common "what's wrong with this code" exam questions.
Key rule — separate paths
The schemaLocation and checkpointLocation paths must be different. Using the same path for both will cause conflicts. Always give each its own directory.
Python — correct checkpoint setup # ✓ Correct — separate paths for schema and checkpoint
.option("cloudFiles.schemaLocation", "/checkpoints/my_pipeline/schema")
.option("checkpointLocation",        "/checkpoints/my_pipeline/state")

# ✗ Wrong — same path causes conflicts
.option("cloudFiles.schemaLocation", "/checkpoints/my_pipeline")
.option("checkpointLocation",        "/checkpoints/my_pipeline")

③ Schema Hints

By default, Auto Loader infers the schema by sampling incoming files. But sometimes you know part of the schema in advance — or you need specific columns to be a particular type. That's where schema hints come in.

Schema hints let you override the inferred type for specific columns without providing a full explicit schema. You define hints as a comma-separated string of column_name type pairs:

Python — schema hints df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/ckpt/schema")
  # Override specific columns — leave rest to inference
  .option("cloudFiles.schemaHints",
          "event_timestamp TIMESTAMP, user_id LONG, amount DOUBLE")
  .load("s3://my-bucket/landing/events/"))

Schema hints vs. full schema

Schema HintsFull Schema (.schema())
SyntaxcloudFiles.schemaHints option.schema(StructType(...))
Columns coveredOnly specified columns — rest inferredAll columns explicitly defined
New columnsStill handled by inference + evolutionGoes to rescue data column
Best forFixing inferred types for key columnsFully controlled, stable schemas
Key exam point
Schema hints do not prevent new unknown columns from being ingested — they just fix the type of known columns. Unknown columns still flow through (or go to the rescue data column if a full schema is in use).

④ The Rescue Data Column

🚑 What is the rescue data column?

When Auto Loader uses schema inference or a full explicit schema, any data that doesn't fit the schema — wrong types, unexpected columns, malformed values — is not dropped or rejected. Instead, it's captured in a special column called _rescued_data as a JSON string. This means you never silently lose data.

The rescue data column is enabled automatically when you use cloudFiles.schemaLocation. You don't have to turn it on — but you need to know what it contains and when it fires.

What gets rescued?

ScenarioGoes to _rescued_data?
Column exists in schema but value is wrong typeYes
Column in file not present in schemaYes
Column name has different capitalisationYes
Column in schema but missing from fileNo — returns null
Entire row is malformed JSONNo — goes to corrupt record
Python — querying the rescue data column # Read stream with schema enforcement
from pyspark.sql.types import StructType, StructField, StringType, LongType

mySchema = StructType([
  StructField("event_id", LongType()),
  StructField("event_type", StringType()),
  StructField("user_id", LongType())
])

df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/ckpt/schema")
  .schema(mySchema)
  .load("s3://my-bucket/landing/"))

# Check for rescued data — good data quality practice
rescued = df.filter(df._rescued_data.isNotNull())
rescued.display()
Key exam point — rescue vs. schema evolution
These are complementary, not alternatives. Schema evolution (mergeSchema=true) adds new columns to the Delta table over time. The rescue data column captures data that can't fit the current schema right now. Both can be active simultaneously.

⑤ Trigger Modes

Auto Loader is a Structured Streaming source, so the write stream trigger controls how often it processes new files:

TriggerBehaviourBest for
trigger(availableNow=True)Processes all new files in one micro-batch, then stopsScheduled Jobs — run on a cron, process everything, shut down
trigger(processingTime="5 minutes")Runs every 5 minutes continuouslyNear-real-time ingestion with bounded latency
trigger(once=True)Legacy version of availableNow — processes once then stopsOlder pipelines — prefer availableNow in DBR 10.1+
No trigger (default)Runs as fast as possible — continuous micro-batchesLow-latency streaming
Key exam point — availableNow vs. once
trigger(availableNow=True) is the modern replacement for trigger(once=True). Both process all available files and stop — but availableNow is more efficient as it can use multiple micro-batches. If an exam question asks about processing all pending files in a scheduled job, availableNow is the answer.
🏗️ Real-world note

In production Bronze pipelines, the typical pattern is Auto Loader with trigger(availableNow=True) triggered by a Databricks Workflow on a schedule — say, every 15 minutes. This gives you a balance between latency and cost: you're not running a continuous stream 24/7, but you're still ingesting regularly. The checkpoint ensures each run picks up exactly where the last one left off.

Exam-Style Practice Questions

Select an answer — green means correct, red means wrong.

Q1 The following code is missing something critical. What is wrong?

spark.readStream.format("cloudFiles").option("cloudFiles.format","json").load("s3://bucket/data/")
Q2 A data engineer sets cloudFiles.schemaLocation and checkpointLocation to the same path. What happens?
Q3 An incoming JSON file contains a column promo_code that is not in the defined schema. With Auto Loader's rescue data column enabled, what happens to this data?
Q4 You want Auto Loader to infer the schema but force event_timestamp to be a TIMESTAMP type instead of a string. Which option do you use?
Q5 A pipeline uses trigger(availableNow=True). What does this mean?
Q6 Which of the following correctly describes the difference between cloudFiles.schemaLocation and checkpointLocation?

Common Exam Traps

These are the mistakes the exam is designed to catch:

⚡ Quick Memory Trick
The Auto Loader checklist: cloudFiles format → cloudFiles.format → schemaLocation → checkpointLocation. Every valid Auto Loader pipeline has all four. If one is missing in an exam code snippet — that is the bug.