Demonstrate knowledge of Auto Loader syntax

The Core Idea

This topic is about the syntax of Auto Loader — not just knowing it exists, but being able to read a code snippet and identify what each option does, what is required vs. optional, and what happens when something is missing or misconfigured.

The exam will show you Auto Loader code and ask you to spot the bug, complete the missing option, or explain what a specific line does. This article covers every option you need to know cold.

Exam Mindset

Four things are tested on syntax: the cloudFiles format, the checkpoint location, schema hints, and the rescue data column. Know all four in detail — what they do, where they go in the code, and what breaks without them.

The Complete Auto Loader Template

This is the full production-ready pattern. Every exam question is a variation of this structure:

      Python — complete Auto Loader read + write stream
      # READ STREAM — source config

      df = (spark.readStream

        .format("cloudFiles")                      # ① always "cloudFiles"

        .option("cloudFiles.format", "json")      # ② actual file format

        .option("cloudFiles.schemaLocation", "/ckpt/schema")  # ③ schema storage

        .option("cloudFiles.inferColumnTypes", "true")   # ④ infer exact types

        .schema(mySchema)                           # ⑤ optional schema hint

        .load("s3://my-bucket/landing/events/"))

      # WRITE STREAM — target config

      (df.writeStream

        .format("delta")

        .option("checkpointLocation", "/ckpt/events")  # ⑥ required checkpoint

        .option("mergeSchema", "true")                # ⑦ schema evolution

        .trigger(availableNow=True)               # ⑧ batch trigger

        .toTable("catalog.bronze.events"))

Each numbered line is a potential exam question. Let's go through all of them.

① The cloudFiles Format

The outer format is always "cloudFiles" — this is what activates Auto Loader. The actual file format (JSON, CSV, Parquet, etc.) is specified separately inside cloudFiles.format.

Exam trap — two formats, two places

.format("cloudFiles") goes on readStream. .option("cloudFiles.format", "json") specifies the file type. If an exam question uses .format("json") directly on a streaming read, that is the bug — it will not activate Auto Loader.

Option	Where	Required	Example value
format("cloudFiles")	`readStream`	Yes	Always `"cloudFiles"`
cloudFiles.format	`.option()`	Yes	`"json"`, `"csv"`, `"parquet"`, `"avro"`
cloudFiles.schemaLocation	`.option()`	Yes*	Cloud storage path for schema
cloudFiles.inferColumnTypes	`.option()`	Optional	`"true"` — infers exact types vs. all strings
cloudFiles.useNotifications	`.option()`	Optional	`"true"` for event-based detection

* Required when using schema inference. Can be omitted only if you provide a full explicit schema.

② Checkpointing

The checkpoint is what makes Auto Loader stateful. It records exactly which files have been processed so the pipeline can resume exactly where it left off after a restart — with no duplicate processing and no missed files.

There are actually two separate checkpoint-related options and the exam tests whether you know the difference:

Option	Where it goes	What it stores
cloudFiles.schemaLocation	On `readStream`	The inferred schema — column names and types. Prevents re-inferring on every run.
checkpointLocation	On `writeStream`	The streaming state — which files have been processed, offsets, progress metadata.

Exam trap — missing checkpointLocation

If checkpointLocation is missing from the writeStream, the pipeline will throw an error: checkpointLocation must be specified. This is one of the most common "what's wrong with this code" exam questions.

Key rule — separate paths

The schemaLocation and checkpointLocation paths must be different. Using the same path for both will cause conflicts. Always give each its own directory.

      Python — correct checkpoint setup
      # ✓ Correct — separate paths for schema and checkpoint

      .option("cloudFiles.schemaLocation", "/checkpoints/my_pipeline/schema")

      .option("checkpointLocation",        "/checkpoints/my_pipeline/state")

      # ✗ Wrong — same path causes conflicts

      .option("cloudFiles.schemaLocation", "/checkpoints/my_pipeline")

      .option("checkpointLocation",        "/checkpoints/my_pipeline")

③ Schema Hints

By default, Auto Loader infers the schema by sampling incoming files. But sometimes you know part of the schema in advance — or you need specific columns to be a particular type. That's where schema hints come in.

Schema hints let you override the inferred type for specific columns without providing a full explicit schema. You define hints as a comma-separated string of column_name type pairs:

      Python — schema hints
      df = (spark.readStream

        .format("cloudFiles")

        .option("cloudFiles.format", "json")

        .option("cloudFiles.schemaLocation", "/ckpt/schema")

        # Override specific columns — leave rest to inference

        .option("cloudFiles.schemaHints",

                "event_timestamp TIMESTAMP, user_id LONG, amount DOUBLE")

        .load("s3://my-bucket/landing/events/"))

Schema hints vs. full schema

	Schema Hints	Full Schema (.schema())
Syntax	`cloudFiles.schemaHints` option	`.schema(StructType(...))`
Columns covered	Only specified columns — rest inferred	All columns explicitly defined
New columns	Still handled by inference + evolution	Goes to rescue data column
Best for	Fixing inferred types for key columns	Fully controlled, stable schemas

Key exam point

Schema hints do not prevent new unknown columns from being ingested — they just fix the type of known columns. Unknown columns still flow through (or go to the rescue data column if a full schema is in use).

④ The Rescue Data Column

🚑 What is the rescue data column?

When Auto Loader uses schema inference or a full explicit schema, any data that doesn't fit the schema — wrong types, unexpected columns, malformed values — is not dropped or rejected. Instead, it's captured in a special column called _rescued_data as a JSON string. This means you never silently lose data.

The rescue data column is enabled automatically when you use cloudFiles.schemaLocation. You don't have to turn it on — but you need to know what it contains and when it fires.

What gets rescued?

Scenario	Goes to `_rescued_data`?
Column exists in schema but value is wrong type	Yes
Column in file not present in schema	Yes
Column name has different capitalisation	Yes
Column in schema but missing from file	No — returns null
Entire row is malformed JSON	No — goes to corrupt record

      Python — querying the rescue data column
      # Read stream with schema enforcement

      from pyspark.sql.types import StructType, StructField, StringType, LongType

      mySchema = StructType([

        StructField("event_id", LongType()),

        StructField("event_type", StringType()),

        StructField("user_id", LongType())

      ])

      df = (spark.readStream

        .format("cloudFiles")

        .option("cloudFiles.format", "json")

        .option("cloudFiles.schemaLocation", "/ckpt/schema")

        .schema(mySchema)

        .load("s3://my-bucket/landing/"))

      # Check for rescued data — good data quality practice

      rescued = df.filter(df._rescued_data.isNotNull())

      rescued.display()

Key exam point — rescue vs. schema evolution

These are complementary, not alternatives. Schema evolution (mergeSchema=true) adds new columns to the Delta table over time. The rescue data column captures data that can't fit the current schema right now. Both can be active simultaneously.

⑤ Trigger Modes

Auto Loader is a Structured Streaming source, so the write stream trigger controls how often it processes new files:

Trigger	Behaviour	Best for
`trigger(availableNow=True)`	Processes all new files in one micro-batch, then stops	Scheduled Jobs — run on a cron, process everything, shut down
`trigger(processingTime="5 minutes")`	Runs every 5 minutes continuously	Near-real-time ingestion with bounded latency
`trigger(once=True)`	Legacy version of availableNow — processes once then stops	Older pipelines — prefer availableNow in DBR 10.1+
No trigger (default)	Runs as fast as possible — continuous micro-batches	Low-latency streaming

Key exam point — availableNow vs. once

trigger(availableNow=True) is the modern replacement for trigger(once=True). Both process all available files and stop — but availableNow is more efficient as it can use multiple micro-batches. If an exam question asks about processing all pending files in a scheduled job, availableNow is the answer.

🏗️ Real-world note

In production Bronze pipelines, the typical pattern is Auto Loader with trigger(availableNow=True) triggered by a Databricks Workflow on a schedule — say, every 15 minutes. This gives you a balance between latency and cost: you're not running a continuous stream 24/7, but you're still ingesting regularly. The checkpoint ensures each run picks up exactly where the last one left off.

Exam-Style Practice Questions

Select an answer — green means correct, red means wrong.

Q1 The following code is missing something critical. What is wrong?

spark.readStream.format("cloudFiles").option("cloudFiles.format","json").load("s3://bucket/data/")

Q2 A data engineer sets cloudFiles.schemaLocation and checkpointLocation to the same path. What happens?

Q3 An incoming JSON file contains a column promo_code that is not in the defined schema. With Auto Loader's rescue data column enabled, what happens to this data?

Q4 You want Auto Loader to infer the schema but force event_timestamp to be a TIMESTAMP type instead of a string. Which option do you use?

Q5 A pipeline uses trigger(availableNow=True). What does this mean?

Q6 Which of the following correctly describes the difference between cloudFiles.schemaLocation and checkpointLocation?

Common Exam Traps

These are the mistakes the exam is designed to catch:

Using .format("json") instead of .format("cloudFiles") — always "cloudFiles" on the outer format.
Forgetting checkpointLocation on the write stream — the pipeline will error without it.
Using the same path for schemaLocation and checkpointLocation — they must be separate directories.
Thinking schema hints replace the full schema — hints only override specific column types; the rest are still inferred.
Thinking the rescue data column must be manually enabled — it is automatic when schemaLocation is set.
Confusing trigger(once=True) and trigger(availableNow=True) — prefer availableNow for DBR 10.1+ as it is more efficient.
Thinking mergeSchema=true and the rescue column do the same thing — schema evolution adds new columns to the table; the rescue column captures data that doesn't fit the current schema right now.

⚡ Quick Memory Trick

The Auto Loader checklist: cloudFiles format → cloudFiles.format → schemaLocation → checkpointLocation. Every valid Auto Loader pipeline has all four. If one is missing in an exam code snippet — that is the bug.