Use a registered community connector

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

This page shows how to use a registered community connector to ingest data from a supported source into Azure Databricks. To create a custom connector for a source that isn't supported yet, see Create a custom connector.

Requirements

  • A Azure Databricks workspace with Unity Catalog enabled
  • A connection for the source you want to ingest, or permissions to create a connection
  • Write access to a catalog and schema for the ingested tables

Create an ingestion pipeline

To use a registered community connector:

  1. In the sidebar of your Azure Databricks workspace, click +New > Add or upload data, then select the source under Community connectors.

  2. Click + Create connection or select an existing connection, then click Next.

  3. For Pipeline name, enter a name for the pipeline.

  4. For Event log location, enter a catalog name and a schema name. Azure Databricks stores the pipeline event log here. Ingested tables are also written here by default.

  5. For Root path, enter your workspace path (for example, /Workspace/Users/<your-email>/connectors). Azure Databricks clones and stores the connector source code here.

  6. Click Create pipeline.

  7. In the pipeline editor, open ingest.py and update the objects field to include the tables you want to ingest. For example:

    from databricks.labs.community_connector.pipeline import ingest
    
    pipeline_spec = {
        "connection_name": "my_stripe_connection",  # Required: UC connection name
        "objects": [
            {"table": {"source_table": "charges"}},
            {"table": {"source_table": "customers",
                       "destination_table": "stripe_customers"}},
        ],
    }
    
    ingest(spark, pipeline_spec)
    
  8. Run the pipeline manually or schedule it.

Pipeline configuration options

You can configure the following options in ingest.py:

Option Description
connection_name Required. The name of the connection that stores authentication credentials for the source.
objects Required. A list of tables to ingest. Each entry has the format {"table": {"source_table": "..."}}. You can also specify an optional destination_table inside the table object.
destination_catalog The catalog where ingested tables are written. Defaults to the catalog set during pipeline creation.
destination_schema The schema where ingested tables are written. Defaults to the schema set during pipeline creation.
scd_type The slowly changing dimension strategy: SCD_TYPE_1, SCD_TYPE_2, or APPEND_ONLY. Defaults to SCD_TYPE_1.
primary_keys Override the default primary keys for a table. Provide a list of column names.