The Core Idea

When something goes wrong in a Databricks pipeline — a job is slow, a task fails, a cluster crashes — there are four distinct tools for diagnosing the problem. The exam tests whether you know which tool to reach for based on the symptom, and what each tool actually shows you.

Think of them as four layers of observability: the Spark UI for job and stage performance, event logs for historical Spark execution, cluster logs for infrastructure and driver errors, and the built-in debugger for interactive code-level inspection.

Exam Mindset
Every debugging question gives you a symptom and asks which tool to use. Map symptom → tool before reading the options. Slow query = Spark UI. Cluster crashed = cluster logs. Past job failure = event logs. Code logic error in notebook = built-in debugger.

Quick Decision Map

Use this as your first pass on any exam debugging scenario:

Symptom → Tool
A job is running slow — I need to find the bottleneck Spark UI
A stage has massive data skew across tasks Spark UI · Stages tab
I need to see what happened in a job that already completed Event Logs
The cluster terminated unexpectedly — I need to know why Cluster Logs · Driver logs
A library failed to install at cluster startup Cluster Logs · Init script logs
A worker node is throwing OOM errors Cluster Logs · Executor logs
I want to step through notebook code and inspect variables Built-in Debugger
A UDF is returning wrong values on specific rows Built-in Debugger

1. The Spark UI

The Spark UI is the primary tool for understanding job performance. It gives you a live, structured view of every job, stage, and task running on the cluster — showing exactly where time is being spent and where problems are occurring.

You access it from the Clusters page → Spark UI button, or directly from a running notebook via the cluster link at the top.

The Key Tabs and What They Show

📋

Jobs Tab

Lists every Spark action triggered. Shows duration, status, and number of stages. Start here to find which job is slow.

top-level timing
🔢

Stages Tab

Breaks each job into stages. Shows shuffle read/write, spill to disk, and task distribution. Key for finding skew and shuffles.

skew · shuffle · spill
🗂️

Tasks Tab

Individual task timings within a stage. A few tasks taking much longer than others = data skew.

task-level skew
🗺️

DAG Visualisation

Shows the lineage of transformations as a directed acyclic graph. Helps spot unnecessary wide transformations and shuffle points.

DAG · wide transforms
💾

Storage Tab

Shows cached RDDs and DataFrames — size in memory, on-disk spill, and fraction cached. Check here if caching isn't working.

cache · persist
🌐

Environment Tab

Shows all Spark configuration properties, system properties, and classpath. Useful for confirming config settings took effect.

spark config

The Three Performance Problems to Spot in the Spark UI

ProblemWhere to lookWhat you'll see
Data Skew Stages tab → Tasks Most tasks finish in 1s, but 1–2 tasks take 5+ minutes. Task duration bar chart is heavily uneven.
Shuffle Spill Stages tab → Shuffle Write / Spill columns Large values in "Shuffle Spill (Disk)" column. Means executor ran out of memory and spilled to disk.
Unnecessary Shuffles DAG Visualisation Wide transformations (joins, groupBy) creating extra stages. Repartition or broadcast joins may help.
Exam trap — Spark UI availability
The Spark UI is only available while the cluster is running. Once the cluster is terminated, you must use event logs to inspect past execution. This is a very common exam distinction.

2. Event Logs

Event logs are the persistent, historical record of Spark execution. When a cluster is terminated and the Spark UI is gone, event logs are how you reconstruct what happened during a job run.

Databricks automatically writes event logs to a configured storage location (DBFS or cloud storage). They can be replayed in the Spark History Server — which looks identical to the live Spark UI but shows completed jobs.

Event logs vs. Spark UI

Spark UIEvent Logs / History Server
Available whenCluster is runningAlways — even after cluster termination
Data shownLive and recent jobsHistorical completed jobs
Access viaCluster page → Spark UISpark History Server or event log files
StorageIn-memory on driverPersisted to DBFS or cloud storage
Best forLive debuggingPost-mortem analysis of completed/failed jobs
Key exam point
If an exam question describes a job that already completed or failed and asks how to investigate — the answer is event logs, not the Spark UI. The Spark UI would no longer be available if the cluster was terminated after the run.

3. Cluster Logs

Cluster logs capture everything happening at the infrastructure level — driver output, executor errors, library installations, and init script execution. When the issue is at the cluster level rather than the Spark job level, this is where you look.

Cluster logs are configured to deliver to a DBFS path or cloud storage location, set in the cluster's Advanced Options → Logging settings.

Types of Cluster Logs

🚗

Driver Logs

Standard output and error from the driver node. Contains Python exceptions, print statements, and stack traces from your notebook or job code.

stdout · stderr
⚙️

Executor Logs

Logs from individual worker nodes. OOM errors, task failures, and worker-side exceptions appear here — not in the driver logs.

OOM · worker errors
🚀

Init Script Logs

Output from cluster init scripts. If a library install or environment setup fails at cluster start, the error will be here.

install failures
📦

Log4j Logs

Detailed internal Spark framework logs. Used for deep debugging of Spark internals — rarely needed for application-level issues. spark internals

Key exam point — driver vs. executor logs
Driver logs contain your application code errors — Python exceptions, stack traces, print output. Executor logs contain worker-side errors — OOM on tasks, shuffle failures, task-level exceptions. Knowing which is which is a direct exam question.
🏗️ Real-world note

In practice, the most common cluster log use case is init script failures — a cluster refuses to start because a custom library install failed. The error is silent in the UI (cluster just shows as "failed to start") but the init script log will have the exact pip error or dependency conflict. Always check init script logs first when a cluster won't start.

4. The Built-in Debugger

The Databricks notebook has a built-in Python debugger that lets you pause execution mid-cell, step through code line by line, and inspect variable values in real time — without leaving the notebook interface.

It is activated using the standard Python breakpoint() function (Python 3.7+) or the older pdb.set_trace(). When execution hits a breakpoint, an interactive debug console appears at the bottom of the cell.

What you can do in the debugger

CommandWhat it does
nNext line — step to the next line without entering function calls
sStep into — enter a function call and debug inside it
cContinue — resume execution until the next breakpoint or end
p variablePrint — inspect the current value of any variable
lList — show the surrounding source code lines
qQuit — exit the debugger and stop execution
When to use the built-in debugger
The debugger is best for logic errors in Python code — a transformation returning unexpected values, a UDF behaving incorrectly on certain rows, or a function that isn't doing what you think it is. It is not useful for Spark performance problems — use the Spark UI for those.
Exam trap — debugger scope
The built-in debugger works on Python driver-side code only. You cannot use it to step through code that runs on executor nodes (e.g. inside a UDF that gets distributed). For executor-side issues, use executor logs or add explicit logging.

Full Tool Comparison

ToolBest forAvailable whenAccess
Spark UI Live performance — skew, spill, slow stages, DAG inspection Cluster running only Clusters page → Spark UI
Event Logs Historical analysis of completed/failed jobs after cluster terminates Always (persisted to storage) Spark History Server / DBFS
Cluster Logs Cluster-level errors — driver crashes, OOM, init script failures Always (delivered to log path) Cluster page → View Logs / DBFS log path
Built-in Debugger Python logic errors — step through, inspect variables During interactive notebook execution breakpoint() in a notebook cell

Exam-Style Practice Questions

Select an answer — green means correct, red means wrong.

Q1 A Spark job completed successfully but took much longer than expected. The cluster is still running. Which tool should the data engineer use first to identify the bottleneck?
Q2 A scheduled overnight job failed. The cluster has since been terminated. How can the data engineer investigate the Spark execution?
Q3 In the Spark UI Stages tab, most tasks in a stage complete in under 2 seconds, but three tasks are taking over 10 minutes. What does this indicate?
Q4 A cluster fails to start after a new init script was added. Where should the engineer look to find the error?
Q5 A data engineer wants to pause execution inside a notebook cell and inspect the value of a variable mid-run. Which tool do they use?
Q6 Worker nodes are throwing OutOfMemoryError during a large shuffle operation. Which log type contains these errors?

Common Exam Traps

These are the mistakes the exam is designed to catch:

⚡ Quick Memory Trick
Slow job running now → Spark UI. Job already finished → Event Logs. Cluster won't start → Cluster Logs. Wrong logic in my code → Debugger. Four symptoms, four tools, one rule each.