Create a Spark cluster

3 minutes

You can create one or more clusters in your Azure Databricks workspace by using the Azure Databricks Workspace UI.

Screenshot of the Create Cluster interface in the Azure Databricks Workspace UI.

When creating the cluster, you can specify configuration settings, including:

A name for the cluster.
An access mode, which controls how users interact with the cluster:
- Standard: Multiple users can share the cluster concurrently. Isolation between user code is enforced automatically. Suitable for collaborative data engineering and shared analytics.
- Dedicated: The cluster is assigned exclusively to a single user or group. Required for workloads that use RDD APIs, GPU acceleration, or R language support.
The cluster architecture, which determines how computation is distributed:
- Multi-node: One driver node plus one or more worker nodes. Enables distributed processing and horizontal scaling for large datasets.
- Single-node: Driver node only, with no worker nodes. Suitable for small datasets, lightweight exploration, or machine learning frameworks like scikit-learn that don't distribute across nodes.
The version of the Databricks Runtime to be used in the cluster; which dictates the version of Spark and individual components such as Python, Scala, and others that get installed.
The type of virtual machine (VM) used for the worker nodes in the cluster.
The minimum and maximum number of worker nodes in the cluster.
The type of VM used for the driver node in the cluster.
Whether the cluster supports autoscaling to dynamically resize the cluster.
How long the cluster can remain idle before being shut down automatically.

How Azure manages cluster resources

When you create an Azure Databricks workspace, a Databricks appliance is deployed as an Azure resource in your subscription. When you create a cluster in the workspace, you specify the types and sizes of the virtual machines (VMs) to use for both the driver and worker nodes, and some other configuration options, but Azure Databricks manages all other aspects of the cluster.

The Databricks appliance is deployed into Azure as a managed resource group within your subscription. This resource group contains the driver and worker VMs for your clusters, along with other required resources, including a virtual network, a security group, and a storage account. All metadata for your cluster, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.

Azure Databricks is split into two main planes: the control plane, which consists of backend services (for example, the web UI) managed by Microsoft, and the compute plane, where your data workloads run. There are two variants of compute: classic compute, which uses your own Azure subscription and virtual network (offering isolation inside your subscription), and serverless compute, which runs within Databricks’ managed environment but still in the same Azure region as your workspace, with network-and-security controls to isolate between customers. Every workspace has a storage account in your subscription that holds system data (notebooks, logs, job metadata), the distributed file system (DBFS), and catalog assets (if you have Unity Catalog enabled), with additional controls for networking, firewalling, and access to ensure security and proper isolation.

Diagram of Azure Databricks architecture.

Tip

For interactive notebook development and most ETL workloads, consider serverless compute instead of classic compute clusters. Serverless compute starts in 2–6 seconds, is fully managed by Azure Databricks, and scales automatically — with no cluster configuration required. For a deeper comparison of compute options, see Select and configure compute in Azure Databricks in Microsoft Learn.

Note

You also have the option of attaching your cluster to a pool of idle nodes to reduce cluster startup time. For more information, see Pools in the Azure Databricks documentation.

Feedback

Was this page helpful?