Set up
Here is an overview of the steps required to use Mage with Databricks Cluster:- Set up Databricks cluster
- Build docker image
- Start Mage
- Configure project’s metadata settings
- Sample pipeline with PySpark code
- Verify everything worked
1. Set up Databricks cluster
Set up a Databricks workspace and cluster following the docs:2. Build docker image
Use the Dockerfile template from mage-ai/integrations/databricks/Dockerfile Update the dadabricks-connect version to match the version used in your Databricks cluster. Build the Docker image with commanddocker build -t mage_databricks .
.
3. Start Mage
Type this command in your terminal to start Mage using docker (Note:demo_project
is the name of your project, you can change it to anything you
want):
databricks-connect
following the guide.
4. Configure project’s metadata settings
Open your project’smetadata.yaml
file located at the root of your project’s
directory: demo_project/metadata.yaml
(presuming your project is named
demo_project
).
Change the value for key variables_dir
to be a S3 bucket that
you want to use to store intermediate block output.
For example, if your S3 bucket is named my-awesome-bucket
, then the value for
the key variables_dir
should be s3://my-awesome-bucket
.
5. Sample pipeline with PySpark code
- Create a new pipeline by going to
File
in the top left corner of the page and then clickingNew pipeline
. - Open the pipeline’s metadata.yaml file and update the
type
to bedatabricks
. - Click
+ Data loader
, thenGeneric (no template)
to add a new data loader block. - Paste the following sample code in the new data loader block:
- Click
+ Data exporter
, thenGeneric (no template)
to add a new data exporter block. - Paste the following sample code in the new data exporter block (change the
s3://bucket-name
to the bucket you created from a previous step):
6. Verify everything worked
Let’s load the data from S3 that we just created using Spark:- Click
+ Data loader
, thenGeneric (no template)
to add a new data loader block. - Paste the following sample code in the new data loader block (change the
s3://bucket-name
to the bucket you created from a previous step):