The Core Idea
When something goes wrong in a Databricks pipeline — a job is slow, a task fails, a cluster crashes — there are four distinct tools for diagnosing the problem. The exam tests whether you know which tool to reach for based on the symptom, and what each tool actually shows you.
Think of them as four layers of observability: the Spark UI for job and stage performance, event logs for historical Spark execution, cluster logs for infrastructure and driver errors, and the built-in debugger for interactive code-level inspection.
Quick Decision Map
Use this as your first pass on any exam debugging scenario:
1. The Spark UI
The Spark UI is the primary tool for understanding job performance. It gives you a live, structured view of every job, stage, and task running on the cluster — showing exactly where time is being spent and where problems are occurring.
You access it from the Clusters page → Spark UI button, or directly from a running notebook via the cluster link at the top.
The Key Tabs and What They Show
Jobs Tab
Lists every Spark action triggered. Shows duration, status, and number of stages. Start here to find which job is slow.
top-level timingStages Tab
Breaks each job into stages. Shows shuffle read/write, spill to disk, and task distribution. Key for finding skew and shuffles.
skew · shuffle · spillTasks Tab
Individual task timings within a stage. A few tasks taking much longer than others = data skew.
task-level skewDAG Visualisation
Shows the lineage of transformations as a directed acyclic graph. Helps spot unnecessary wide transformations and shuffle points.
DAG · wide transformsStorage Tab
Shows cached RDDs and DataFrames — size in memory, on-disk spill, and fraction cached. Check here if caching isn't working.
cache · persistEnvironment Tab
Shows all Spark configuration properties, system properties, and classpath. Useful for confirming config settings took effect.
spark configThe Three Performance Problems to Spot in the Spark UI
| Problem | Where to look | What you'll see |
|---|---|---|
| Data Skew | Stages tab → Tasks | Most tasks finish in 1s, but 1–2 tasks take 5+ minutes. Task duration bar chart is heavily uneven. |
| Shuffle Spill | Stages tab → Shuffle Write / Spill columns | Large values in "Shuffle Spill (Disk)" column. Means executor ran out of memory and spilled to disk. |
| Unnecessary Shuffles | DAG Visualisation | Wide transformations (joins, groupBy) creating extra stages. Repartition or broadcast joins may help. |
2. Event Logs
Event logs are the persistent, historical record of Spark execution. When a cluster is terminated and the Spark UI is gone, event logs are how you reconstruct what happened during a job run.
Databricks automatically writes event logs to a configured storage location (DBFS or cloud storage). They can be replayed in the Spark History Server — which looks identical to the live Spark UI but shows completed jobs.
Event logs vs. Spark UI
| Spark UI | Event Logs / History Server | |
|---|---|---|
| Available when | Cluster is running | Always — even after cluster termination |
| Data shown | Live and recent jobs | Historical completed jobs |
| Access via | Cluster page → Spark UI | Spark History Server or event log files |
| Storage | In-memory on driver | Persisted to DBFS or cloud storage |
| Best for | Live debugging | Post-mortem analysis of completed/failed jobs |
3. Cluster Logs
Cluster logs capture everything happening at the infrastructure level — driver output, executor errors, library installations, and init script execution. When the issue is at the cluster level rather than the Spark job level, this is where you look.
Cluster logs are configured to deliver to a DBFS path or cloud storage location, set in the cluster's Advanced Options → Logging settings.
Types of Cluster Logs
Driver Logs
Standard output and error from the driver node. Contains Python exceptions, print statements, and stack traces from your notebook or job code.
stdout · stderrExecutor Logs
Logs from individual worker nodes. OOM errors, task failures, and worker-side exceptions appear here — not in the driver logs.
OOM · worker errorsInit Script Logs
Output from cluster init scripts. If a library install or environment setup fails at cluster start, the error will be here.
install failuresLog4j Logs
Detailed internal Spark framework logs. Used for deep debugging of Spark internals — rarely needed for application-level issues. spark internals
In practice, the most common cluster log use case is init script failures — a cluster refuses to start because a custom library install failed. The error is silent in the UI (cluster just shows as "failed to start") but the init script log will have the exact pip error or dependency conflict. Always check init script logs first when a cluster won't start.
4. The Built-in Debugger
The Databricks notebook has a built-in Python debugger that lets you pause execution mid-cell, step through code line by line, and inspect variable values in real time — without leaving the notebook interface.
It is activated using the standard Python breakpoint() function (Python 3.7+) or the older pdb.set_trace(). When execution hits a breakpoint, an interactive debug console appears at the bottom of the cell.
What you can do in the debugger
| Command | What it does |
|---|---|
n | Next line — step to the next line without entering function calls |
s | Step into — enter a function call and debug inside it |
c | Continue — resume execution until the next breakpoint or end |
p variable | Print — inspect the current value of any variable |
l | List — show the surrounding source code lines |
q | Quit — exit the debugger and stop execution |
Full Tool Comparison
| Tool | Best for | Available when | Access |
|---|---|---|---|
| Spark UI | Live performance — skew, spill, slow stages, DAG inspection | Cluster running only | Clusters page → Spark UI |
| Event Logs | Historical analysis of completed/failed jobs after cluster terminates | Always (persisted to storage) | Spark History Server / DBFS |
| Cluster Logs | Cluster-level errors — driver crashes, OOM, init script failures | Always (delivered to log path) | Cluster page → View Logs / DBFS log path |
| Built-in Debugger | Python logic errors — step through, inspect variables | During interactive notebook execution | breakpoint() in a notebook cell |
Exam-Style Practice Questions
Select an answer — green means correct, red means wrong.
Common Exam Traps
These are the mistakes the exam is designed to catch:
- Using the Spark UI for a terminated cluster — it is unavailable once the cluster is gone. Use event logs instead.
- Confusing driver logs and executor logs — driver logs have your Python exceptions; executor logs have OOM and worker-side errors.
- Using the debugger for Spark performance issues — the debugger is for Python logic errors, not query performance.
- Thinking event logs are only for errors — they record all Spark execution, successful or not, and are used for post-mortem analysis of any completed job.
- Forgetting that the built-in debugger only works on driver-side Python — code running inside distributed UDFs cannot be stepped through.
- Ignoring init script logs when a cluster won't start — library install failures always surface here, not in the Spark UI.