Use Databricks' built-in debugging tools to troubleshoot a given issue

The Core Idea

When something goes wrong in a Databricks pipeline — a job is slow, a task fails, a cluster crashes — there are four distinct tools for diagnosing the problem. The exam tests whether you know which tool to reach for based on the symptom, and what each tool actually shows you.

Think of them as four layers of observability: the Spark UI for job and stage performance, event logs for historical Spark execution, cluster logs for infrastructure and driver errors, and the built-in debugger for interactive code-level inspection.

Exam Mindset

Every debugging question gives you a symptom and asks which tool to use. Map symptom → tool before reading the options. Slow query = Spark UI. Cluster crashed = cluster logs. Past job failure = event logs. Code logic error in notebook = built-in debugger.

Quick Decision Map

Use this as your first pass on any exam debugging scenario:

Symptom → Tool

A job is running slow — I need to find the bottleneck → Spark UI

A stage has massive data skew across tasks → Spark UI · Stages tab

I need to see what happened in a job that already completed → Event Logs

The cluster terminated unexpectedly — I need to know why → Cluster Logs · Driver logs

A library failed to install at cluster startup → Cluster Logs · Init script logs

A worker node is throwing OOM errors → Cluster Logs · Executor logs

I want to step through notebook code and inspect variables → Built-in Debugger

A UDF is returning wrong values on specific rows → Built-in Debugger

1. The Spark UI

The Spark UI is the primary tool for understanding job performance. It gives you a live, structured view of every job, stage, and task running on the cluster — showing exactly where time is being spent and where problems are occurring.

You access it from the Clusters page → Spark UI button, or directly from a running notebook via the cluster link at the top.

The Key Tabs and What They Show

📋

Jobs Tab

Lists every Spark action triggered. Shows duration, status, and number of stages. Start here to find which job is slow.

top-level timing

🔢

Stages Tab

Breaks each job into stages. Shows shuffle read/write, spill to disk, and task distribution. Key for finding skew and shuffles.

skew · shuffle · spill

🗂️

Tasks Tab

Individual task timings within a stage. A few tasks taking much longer than others = data skew.

task-level skew

🗺️

DAG Visualisation

Shows the lineage of transformations as a directed acyclic graph. Helps spot unnecessary wide transformations and shuffle points.

DAG · wide transforms

💾

Storage Tab

Shows cached RDDs and DataFrames — size in memory, on-disk spill, and fraction cached. Check here if caching isn't working.

cache · persist

🌐

Environment Tab

Shows all Spark configuration properties, system properties, and classpath. Useful for confirming config settings took effect.

spark config

The Three Performance Problems to Spot in the Spark UI

Problem	Where to look	What you'll see
Data Skew	Stages tab → Tasks	Most tasks finish in 1s, but 1–2 tasks take 5+ minutes. Task duration bar chart is heavily uneven.
Shuffle Spill	Stages tab → Shuffle Write / Spill columns	Large values in "Shuffle Spill (Disk)" column. Means executor ran out of memory and spilled to disk.
Unnecessary Shuffles	DAG Visualisation	Wide transformations (joins, groupBy) creating extra stages. Repartition or broadcast joins may help.

Exam trap — Spark UI availability

The Spark UI is only available while the cluster is running. Once the cluster is terminated, you must use event logs to inspect past execution. This is a very common exam distinction.

2. Event Logs

Event logs are the persistent, historical record of Spark execution. When a cluster is terminated and the Spark UI is gone, event logs are how you reconstruct what happened during a job run.

Databricks automatically writes event logs to a configured storage location (DBFS or cloud storage). They can be replayed in the Spark History Server — which looks identical to the live Spark UI but shows completed jobs.

Event logs vs. Spark UI

	Spark UI	Event Logs / History Server
Available when	Cluster is running	Always — even after cluster termination
Data shown	Live and recent jobs	Historical completed jobs
Access via	Cluster page → Spark UI	Spark History Server or event log files
Storage	In-memory on driver	Persisted to DBFS or cloud storage
Best for	Live debugging	Post-mortem analysis of completed/failed jobs

Key exam point

If an exam question describes a job that already completed or failed and asks how to investigate — the answer is event logs, not the Spark UI. The Spark UI would no longer be available if the cluster was terminated after the run.

3. Cluster Logs

Cluster logs capture everything happening at the infrastructure level — driver output, executor errors, library installations, and init script execution. When the issue is at the cluster level rather than the Spark job level, this is where you look.

Cluster logs are configured to deliver to a DBFS path or cloud storage location, set in the cluster's Advanced Options → Logging settings.

Types of Cluster Logs

🚗

Driver Logs

Standard output and error from the driver node. Contains Python exceptions, print statements, and stack traces from your notebook or job code.

stdout · stderr

⚙️

Executor Logs

Logs from individual worker nodes. OOM errors, task failures, and worker-side exceptions appear here — not in the driver logs.

OOM · worker errors

🚀

Init Script Logs

Output from cluster init scripts. If a library install or environment setup fails at cluster start, the error will be here.

install failures

📦

Log4j Logs

Detailed internal Spark framework logs. Used for deep debugging of Spark internals — rarely needed for application-level issues. spark internals

Key exam point — driver vs. executor logs

Driver logs contain your application code errors — Python exceptions, stack traces, print output. Executor logs contain worker-side errors — OOM on tasks, shuffle failures, task-level exceptions. Knowing which is which is a direct exam question.

🏗️ Real-world note

In practice, the most common cluster log use case is init script failures — a cluster refuses to start because a custom library install failed. The error is silent in the UI (cluster just shows as "failed to start") but the init script log will have the exact pip error or dependency conflict. Always check init script logs first when a cluster won't start.

4. The Built-in Debugger

The Databricks notebook has a built-in Python debugger that lets you pause execution mid-cell, step through code line by line, and inspect variable values in real time — without leaving the notebook interface.

It is activated using the standard Python breakpoint() function (Python 3.7+) or the older pdb.set_trace(). When execution hits a breakpoint, an interactive debug console appears at the bottom of the cell.

What you can do in the debugger

Command	What it does
`n`	Next line — step to the next line without entering function calls
`s`	Step into — enter a function call and debug inside it
`c`	Continue — resume execution until the next breakpoint or end
`p variable`	Print — inspect the current value of any variable
`l`	List — show the surrounding source code lines
`q`	Quit — exit the debugger and stop execution

When to use the built-in debugger

The debugger is best for logic errors in Python code — a transformation returning unexpected values, a UDF behaving incorrectly on certain rows, or a function that isn't doing what you think it is. It is not useful for Spark performance problems — use the Spark UI for those.

Exam trap — debugger scope

The built-in debugger works on Python driver-side code only. You cannot use it to step through code that runs on executor nodes (e.g. inside a UDF that gets distributed). For executor-side issues, use executor logs or add explicit logging.

Full Tool Comparison

Tool	Best for	Available when	Access
Spark UI	Live performance — skew, spill, slow stages, DAG inspection	Cluster running only	Clusters page → Spark UI
Event Logs	Historical analysis of completed/failed jobs after cluster terminates	Always (persisted to storage)	Spark History Server / DBFS
Cluster Logs	Cluster-level errors — driver crashes, OOM, init script failures	Always (delivered to log path)	Cluster page → View Logs / DBFS log path
Built-in Debugger	Python logic errors — step through, inspect variables	During interactive notebook execution	`breakpoint()` in a notebook cell

Exam-Style Practice Questions

Select an answer — green means correct, red means wrong.

Q1 A Spark job completed successfully but took much longer than expected. The cluster is still running. Which tool should the data engineer use first to identify the bottleneck?

Q2 A scheduled overnight job failed. The cluster has since been terminated. How can the data engineer investigate the Spark execution?

Q3 In the Spark UI Stages tab, most tasks in a stage complete in under 2 seconds, but three tasks are taking over 10 minutes. What does this indicate?

Q4 A cluster fails to start after a new init script was added. Where should the engineer look to find the error?

Q5 A data engineer wants to pause execution inside a notebook cell and inspect the value of a variable mid-run. Which tool do they use?

Q6 Worker nodes are throwing OutOfMemoryError during a large shuffle operation. Which log type contains these errors?

Common Exam Traps

These are the mistakes the exam is designed to catch:

Using the Spark UI for a terminated cluster — it is unavailable once the cluster is gone. Use event logs instead.
Confusing driver logs and executor logs — driver logs have your Python exceptions; executor logs have OOM and worker-side errors.
Using the debugger for Spark performance issues — the debugger is for Python logic errors, not query performance.
Thinking event logs are only for errors — they record all Spark execution, successful or not, and are used for post-mortem analysis of any completed job.
Forgetting that the built-in debugger only works on driver-side Python — code running inside distributed UDFs cannot be stepped through.
Ignoring init script logs when a cluster won't start — library install failures always surface here, not in the Spark UI.

⚡ Quick Memory Trick

Slow job running now → Spark UI. Job already finished → Event Logs. Cluster won't start → Cluster Logs. Wrong logic in my code → Debugger. Four symptoms, four tools, one rule each.