The Core Idea
Databricks Connect is a client library that lets you run Spark code from your local machine — your laptop, VS Code, PyCharm, wherever you work — while the actual computation happens on a remote Databricks cluster.
Think of it as a bridge: your code stays local, your data and compute stay in Databricks. You get the full developer experience of a local IDE with the full power of a cloud cluster.
The Problem It Solves
Before Databricks Connect, your only option for writing and testing Spark code was to work directly inside a Databricks notebook. That meant:
- No local IDE — you had to use the notebook interface for everything
- No local debugging tools — breakpoints, watch windows, and step-through didn't work
- No version control integration in your editor of choice
- Every code change required uploading to Databricks to test
Databricks Connect solves this by letting you keep your existing development workflow — local IDE, local Git, local debugger — while still running Spark on a real cluster.
In enterprise data engineering teams, developers often have strict tooling requirements — VS Code with specific extensions, PyCharm Professional, or internal code review tools. Databricks Connect means teams don't have to abandon those tools to write Spark code. You write in your editor, it runs in Databricks.
How It Works
The flow is straightforward. Your local Python environment uses the Databricks Connect client to submit Spark operations to a remote cluster. Results come back to your local session.
Write code
Client library
Spark execution
Returned locally
What runs locally vs. remotely
This distinction is important — the exam may test it directly:
| Runs Locally | Runs Remotely (on cluster) |
|---|---|
| Your Python/Scala code | All Spark execution (DataFrame operations, transformations) |
| IDE, debugger, linter | Spark driver and worker nodes |
| DataFrame definitions and logic | Actual data processing and shuffles |
| Unit test runner | Reading and writing to cloud storage / Delta tables |
Setting It Up
Databricks Connect (v2, for DBR 13+) is installed as a Python package. The setup replaces the standard pyspark package in your local environment:
pip install databricks-connect==13.3.*
# Configure with your workspace
databricks configure
Once configured, you connect to a cluster using a DatabricksSession — the drop-in replacement for the standard SparkSession:
# Replaces SparkSession.builder.getOrCreate()
spark = DatabricksSession.builder.getOrCreate()
# Now use spark exactly as you would in a notebook
df = spark.read.table("catalog.schema.my_table")
df.show()
DatabricksSession is the Databricks Connect equivalent of SparkSession. It is a drop-in replacement — the rest of your PySpark code is unchanged. This is intentional by design.
Databricks Connect v1 vs v2
There are two major versions and the exam may reference both. Know the difference:
| v1 (Legacy) | v2 (Current) | |
|---|---|---|
| DBR Support | DBR 5.x – 12.x | DBR 13.0+ (including Serverless) |
| Install | databricks-connect (old) |
databricks-connect==13.x.* |
| Session object | SparkSession |
DatabricksSession |
| Unity Catalog | Limited | Full support |
| Serverless compute | No | Yes |
When to Use Databricks Connect
Local Testing
Write and test Spark transformations in your IDE with real cluster compute before deploying to production.
dev workflowDebugging
Use your IDE's debugger — breakpoints, variable inspection, step-through — on real Spark code.
local debuggerCI/CD Pipelines
Run Spark-based unit tests in a CI pipeline without needing a full Databricks notebook environment.
automated testsIDE Integration
Work in VS Code, PyCharm, or any Python-capable editor with full autocomplete and code navigation.
vs code / pycharmWhat Databricks Connect Cannot Do
These limitations are exam-worthy — especially the first one:
- Cannot run Spark Structured Streaming locally via Databricks Connect
- Cannot use SparkContext directly (use
DatabricksSessioninstead) - Cannot run %magic commands (these are notebook-only features)
- Cannot use display() or dbutils.fs in the same way as notebooks
- Cannot connect to a terminated cluster — the cluster must be running
Databricks Connect vs. Working in Notebooks
| Scenario | Best Approach |
|---|---|
| Developing reusable PySpark modules in VS Code | Databricks Connect |
| Interactive data exploration with display() | Notebook |
| Running unit tests in a CI/CD pipeline | Databricks Connect |
| Structured Streaming development | Notebook or Job |
| Using %sql, %md, or other magic commands | Notebook |
| Debugging with breakpoints and step-through | Databricks Connect |
| Collaborating with non-technical stakeholders | Notebook |
| Building production-grade ETL modules with version control | Databricks Connect |
Exam-Style Practice Questions
Select an answer — green means correct, red means wrong.
SparkSession when using Databricks Connect v2?Common Exam Traps
These are the mistakes the exam is designed to catch:
- Thinking Spark runs locally — it always runs on the remote cluster, even with Databricks Connect.
- Using
SparkSessioninstead ofDatabricksSession— the v2 client requires the new session object. - Assuming Databricks Connect supports Structured Streaming — it does not.
- Confusing Databricks Connect with Databricks CLI — Connect is for running Spark code, CLI is for managing workspace resources.
- Forgetting the cluster must be running — Databricks Connect cannot start a terminated cluster automatically.
- Thinking magic commands like
%sqlordbutils.fs.ls()work in a local script — they are notebook-only.