The Core Idea
This topic is about the syntax of Auto Loader — not just knowing it exists, but being able to read a code snippet and identify what each option does, what is required vs. optional, and what happens when something is missing or misconfigured.
The exam will show you Auto Loader code and ask you to spot the bug, complete the missing option, or explain what a specific line does. This article covers every option you need to know cold.
The Complete Auto Loader Template
This is the full production-ready pattern. Every exam question is a variation of this structure:
df = (spark.readStream
.format("cloudFiles") # ① always "cloudFiles"
.option("cloudFiles.format", "json") # ② actual file format
.option("cloudFiles.schemaLocation", "/ckpt/schema") # ③ schema storage
.option("cloudFiles.inferColumnTypes", "true") # ④ infer exact types
.schema(mySchema) # ⑤ optional schema hint
.load("s3://my-bucket/landing/events/"))
# WRITE STREAM — target config
(df.writeStream
.format("delta")
.option("checkpointLocation", "/ckpt/events") # ⑥ required checkpoint
.option("mergeSchema", "true") # ⑦ schema evolution
.trigger(availableNow=True) # ⑧ batch trigger
.toTable("catalog.bronze.events"))
Each numbered line is a potential exam question. Let's go through all of them.
① The cloudFiles Format
The outer format is always "cloudFiles" — this is what activates Auto Loader. The actual file format (JSON, CSV, Parquet, etc.) is specified separately inside cloudFiles.format.
.format("cloudFiles") goes on readStream. .option("cloudFiles.format", "json") specifies the file type. If an exam question uses .format("json") directly on a streaming read, that is the bug — it will not activate Auto Loader.
| Option | Where | Required | Example value |
|---|---|---|---|
| format("cloudFiles") | readStream | Yes | Always "cloudFiles" |
| cloudFiles.format | .option() | Yes | "json", "csv", "parquet", "avro" |
| cloudFiles.schemaLocation | .option() | Yes* | Cloud storage path for schema |
| cloudFiles.inferColumnTypes | .option() | Optional | "true" — infers exact types vs. all strings |
| cloudFiles.useNotifications | .option() | Optional | "true" for event-based detection |
* Required when using schema inference. Can be omitted only if you provide a full explicit schema.
② Checkpointing
The checkpoint is what makes Auto Loader stateful. It records exactly which files have been processed so the pipeline can resume exactly where it left off after a restart — with no duplicate processing and no missed files.
There are actually two separate checkpoint-related options and the exam tests whether you know the difference:
| Option | Where it goes | What it stores |
|---|---|---|
| cloudFiles.schemaLocation | On readStream |
The inferred schema — column names and types. Prevents re-inferring on every run. |
| checkpointLocation | On writeStream |
The streaming state — which files have been processed, offsets, progress metadata. |
checkpointLocation is missing from the writeStream, the pipeline will throw an error: checkpointLocation must be specified. This is one of the most common "what's wrong with this code" exam questions.
schemaLocation and checkpointLocation paths must be different. Using the same path for both will cause conflicts. Always give each its own directory.
.option("cloudFiles.schemaLocation", "/checkpoints/my_pipeline/schema")
.option("checkpointLocation", "/checkpoints/my_pipeline/state")
# ✗ Wrong — same path causes conflicts
.option("cloudFiles.schemaLocation", "/checkpoints/my_pipeline")
.option("checkpointLocation", "/checkpoints/my_pipeline")
③ Schema Hints
By default, Auto Loader infers the schema by sampling incoming files. But sometimes you know part of the schema in advance — or you need specific columns to be a particular type. That's where schema hints come in.
Schema hints let you override the inferred type for specific columns without providing a full explicit schema. You define hints as a comma-separated string of column_name type pairs:
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/ckpt/schema")
# Override specific columns — leave rest to inference
.option("cloudFiles.schemaHints",
"event_timestamp TIMESTAMP, user_id LONG, amount DOUBLE")
.load("s3://my-bucket/landing/events/"))
Schema hints vs. full schema
| Schema Hints | Full Schema (.schema()) | |
|---|---|---|
| Syntax | cloudFiles.schemaHints option | .schema(StructType(...)) |
| Columns covered | Only specified columns — rest inferred | All columns explicitly defined |
| New columns | Still handled by inference + evolution | Goes to rescue data column |
| Best for | Fixing inferred types for key columns | Fully controlled, stable schemas |
④ The Rescue Data Column
When Auto Loader uses schema inference or a full explicit schema, any data that doesn't fit the schema — wrong types, unexpected columns, malformed values — is not dropped or rejected. Instead, it's captured in a special column called _rescued_data as a JSON string. This means you never silently lose data.
The rescue data column is enabled automatically when you use cloudFiles.schemaLocation. You don't have to turn it on — but you need to know what it contains and when it fires.
What gets rescued?
| Scenario | Goes to _rescued_data? |
|---|---|
| Column exists in schema but value is wrong type | Yes |
| Column in file not present in schema | Yes |
| Column name has different capitalisation | Yes |
| Column in schema but missing from file | No — returns null |
| Entire row is malformed JSON | No — goes to corrupt record |
from pyspark.sql.types import StructType, StructField, StringType, LongType
mySchema = StructType([
StructField("event_id", LongType()),
StructField("event_type", StringType()),
StructField("user_id", LongType())
])
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/ckpt/schema")
.schema(mySchema)
.load("s3://my-bucket/landing/"))
# Check for rescued data — good data quality practice
rescued = df.filter(df._rescued_data.isNotNull())
rescued.display()
mergeSchema=true) adds new columns to the Delta table over time. The rescue data column captures data that can't fit the current schema right now. Both can be active simultaneously.
⑤ Trigger Modes
Auto Loader is a Structured Streaming source, so the write stream trigger controls how often it processes new files:
| Trigger | Behaviour | Best for |
|---|---|---|
trigger(availableNow=True) | Processes all new files in one micro-batch, then stops | Scheduled Jobs — run on a cron, process everything, shut down |
trigger(processingTime="5 minutes") | Runs every 5 minutes continuously | Near-real-time ingestion with bounded latency |
trigger(once=True) | Legacy version of availableNow — processes once then stops | Older pipelines — prefer availableNow in DBR 10.1+ |
| No trigger (default) | Runs as fast as possible — continuous micro-batches | Low-latency streaming |
trigger(availableNow=True) is the modern replacement for trigger(once=True). Both process all available files and stop — but availableNow is more efficient as it can use multiple micro-batches. If an exam question asks about processing all pending files in a scheduled job, availableNow is the answer.
In production Bronze pipelines, the typical pattern is Auto Loader with trigger(availableNow=True) triggered by a Databricks Workflow on a schedule — say, every 15 minutes. This gives you a balance between latency and cost: you're not running a continuous stream 24/7, but you're still ingesting regularly. The checkpoint ensures each run picks up exactly where the last one left off.
Exam-Style Practice Questions
Select an answer — green means correct, red means wrong.
spark.readStream.format("cloudFiles").option("cloudFiles.format","json").load("s3://bucket/data/")cloudFiles.schemaLocation and checkpointLocation to the same path. What happens?promo_code that is not in the defined schema. With Auto Loader's rescue data column enabled, what happens to this data?event_timestamp to be a TIMESTAMP type instead of a string. Which option do you use?trigger(availableNow=True). What does this mean?cloudFiles.schemaLocation and checkpointLocation?Common Exam Traps
These are the mistakes the exam is designed to catch:
- Using
.format("json")instead of.format("cloudFiles")— always"cloudFiles"on the outer format. - Forgetting
checkpointLocationon the write stream — the pipeline will error without it. - Using the same path for
schemaLocationandcheckpointLocation— they must be separate directories. - Thinking schema hints replace the full schema — hints only override specific column types; the rest are still inferred.
- Thinking the rescue data column must be manually enabled — it is automatic when
schemaLocationis set. - Confusing
trigger(once=True)andtrigger(availableNow=True)— preferavailableNowfor DBR 10.1+ as it is more efficient. - Thinking
mergeSchema=trueand the rescue column do the same thing — schema evolution adds new columns to the table; the rescue column captures data that doesn't fit the current schema right now.