Setup of Databricks on AWS

Sommaire

1. Introduction
2. Steps for Setting Up a Databricks Architecture
3. Setup Databricks on AWS
4. Platform Administration Cheat-sheet
5. Ressource

1. Introduction

Databricks is a unified data analytics platform that provides a fully managed cloud-based environment for data engineering, data science, and machine learning workloads.

This article outlines the main steps to follow for setting up a Databricks architecture on AWS.

2. Steps for Setting Up a Databricks Architecture

Here are the main steps to follow for setting up a Databricks architecture on AWS as part of the migration to Databricks:

Evaluation of the Existing Environment: Start by analyzing the current infrastructure, data sources, transformations, workloads, and dependencies. This will help you understand the specific requirements and plan the migration.
Design of the Databricks Architecture: Define the Databricks architecture, including the number of clusters, node configuration (instance type, number of cores, memory), storage (S3, ADLS, etc.), network connectivity, and security (IAM, virtual networks, etc.).
Configuration of the AWS Environment: Set up the necessary AWS environment, such as accounts, IAM roles, virtual networks, security groups, etc.
Deployment of Databricks: Deploy the Databricks environment on AWS, creating the workspace, clusters, libraries, etc.
Data Migration: Migrate the source data from its current location to AWS storage (S3, ADLS, etc.) using tools like AWS DataSync, Distcp, or custom scripts.
Migration of Transformations: Convert the existing transformations into Databricks notebooks or Databricks pipeline tasks. You can use automated migration tools or perform a manual conversion.
Testing and Validation: Test the migrated transformations and validate the results to ensure that the data and business logic are correctly transferred.
Training and Documentation: Train the team on the use of Databricks and document the processes, pipelines, and best practices.
Production Deployment: Once testing is successful, deploy the Databricks clusters, pipelines, and workloads into production.
Monitoring and Maintenance: Establish monitoring processes, log management, updates, and maintenance of the Databricks environment.

During the first days of the mission, you should focus on the following points:

Understand the client’s business and technical requirements
Evaluate the existing environment
Design a preliminary Databricks architecture
Set up AWS accounts and required services
Deploy a basic Databricks environment for testing

3. Setup Databricks on AWS

3.1. Sign up to Databricks (AWS cloud)

Head over to databricks.com/try-databricks and create a new databricks account. Make sure to enter the workspace name and AWS region
After clicking on the URL sent by email, you will be redirected to this databricks page :
Figure 1. accounts.cloud.databricks.com/workspaces?showFinishQuickstartModal=true

NOTE

Email adresse of the Databricks Onboarding Desk : onboarding-help@databricks.com

3.2. Create a Databricks workspace

Click on Create a workspace button on Databricks and chose Quickstart (Recommended) option

By clicking on Start Quickstart, you will be directed to AWS.

3.3. Setup a stack on AWS

Next, we have to create a stack on AWS using a template provided by Databricks databricks-trial.template.yaml

Figure 2. AWS CloudFormation

Figure 3. Create AWS Stack

3.4. Understand the stack template

The template sets up the following (IAM Role, S3 Bucket, Lambda Function) resources for deploying a Databricks workspace on AWS:

IAM Role (workspaceIamRole): This role is assumed by Databricks to manage resources within your AWS account, such as creating VPCs, subnets, security groups, and EC2 instances.
S3 Bucket (workspaceS3Bucket): This bucket is used as the root storage location for the Databricks workspace. It stores notebook files, data, and other artifacts.
IAM Role (catalogIamRole): This role is used by Databricks to access the S3 bucket for the Unity Catalog metastore.
Lambda Function (createCredentials): This function interacts with the Databricks API to create credentials for the workspace, associating it with the workspaceIamRole.
Lambda Function (createStorageConfiguration): This function interacts with the Databricks API to create a storage configuration for the workspace, associating it with the workspaceS3Bucket and the catalogIamRole.
Lambda Function (createWorkspace): This function interacts with the Databricks API to create the Databricks workspace itself, using the credentials and storage configuration created by the previous Lambda functions.
Lambda Function (databricksApiFunction): This is a helper function that the other Lambda functions use to interact with the Databricks API.
IAM Role (functionRole): This role is assumed by the Lambda functions to perform their respective tasks.
S3 Bucket (LambdaZipsBucket): This bucket is used to store the Lambda function code (lambda.zip) during the deployment process.

The relationships between these resources can be summarized as follows:

The workspaceIamRole and catalogIamRole are associated with the Databricks workspace during its creation.
The workspaceS3Bucket is used as the root storage location for the Databricks workspace.
The createCredentials, createStorageConfiguration, and createWorkspace Lambda functions interact with the Databricks API to set up the required components (credentials, storage configuration, and workspace) for the Databricks deployment.
The databricksApiFunction Lambda function is used by the other Lambda functions to interact with the Databricks API.
The functionRole is assumed by the Lambda functions to perform their respective tasks.
The LambdaZipsBucket is used to store the Lambda function code during the deployment process.

3.5. Create a stack on AWS

Here is the list of parameters to enter in the stack creation page :

Stack name, e.g. databricks-workspace-stack-9eccc
Account ID (AccountId), e.g. 6b922be2-a681-4ca1-91f0-4069055b61e2
Session Token (SessionToken): auto generated from Databricks
Workspace name (WorkspaceName), e.g. rd
IAM role - optional

Click on Create Stack button, multiple events will be displayed regarding our stack creation

Figure 4. CREATE IN PROGRESS

Figure 5. CREATE COMPLETE

3.6. Using Databricks Workspace

Figure 6. accounts.cloud.databricks.com/workspaces

Figure 7. accounts.cloud.databricks.com/workspaces/512881696647569

Figure 8. dbc-1468cab5-5250.cloud.databricks.com

3.7. AWS Stack for different Databricks Workspace

The template for creating AWS Stack could be used to replicate a stack for as many workspace as required.

Possible workspace names:

Prod for Production
Pre Prod (pp) for Production
Dev or Rd for Development
Test or QA for Quality Assurance
Staging for Staging

or by adding a reference for country, team…

Prod-EU-DataTeam
Dev-US-ProjectX

4. Platform Administration Cheat-sheet

Table 1. Administration Cheat-Sheet
Best Practice	Impact	Docs
Enable Unity Catalog	Data governance: Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.	Set up and manage Unity Catalog
Use cluster policies	Cost: Control costs with auto-termination (for all-purpose clusters), max cluster sizes, and instance type restrictions. Observability: Set `custom_tags` in your cluster policy to enforce tagging. Security: Restrict cluster access mode to only allow users to create Unity Catalog-enabled clusters to enforce data permissions.	Create and manage cluster policies Monitor cluster usage with tags
Use Service Principals to connect to third-party software	Security: A service principal is a Databricks identity type that allows third-party services to authenticate directly to Databricks, not through an individual user’s credentials. If something happens to an individual user’s credentials, the third-party service won’t be interrupted.	Create and manage service principals
Set up SSO	Security: Instead of having users type their email and password to log into a workspace, set up Databricks SSO so users can authenticate via your identity provider.	Set up SSO for your workspace
Set up SCIM integration	Security: Instead of adding users to Databricks manually, integrate with your identity provider to automate user provisioning and deprovisioning. When a user is removed from the identity provider, they are automatically removed from Databricks too.	Sync users and groups from your identity provider
Manage access control with account-level groups	Data governance: Create account-level groups so you can bulk control access to workspaces, resources, and data. This saves you from having to grant all users access to everything or grant individual users specific permissions. You can also sync groups from your identity provider to Databricks groups.	Manage groups Control access to resources Sync groups from your IdP to Databricks Data governance guide
Set up IP access for IP whitelisting	Security: IP access lists prevent users from accessing Databricks resources in unsecured networks. Accessing a cloud service from an unsecured network can pose security risks to an enterprise, especially when the user may have authorized access to sensitive or personal data Make sure to set up IP access lists for your account console and workspaces.	Create IP access lists for workspaces Create IP access lists for the account console
Configure a customer-managed VPC with regional endpoints	Security: You can use a customer-managed VPC to exercise more control over your network configurations to comply with specific cloud security and governance standards your organization might require. Cost: Regional VPC endpoints to AWS services have a more direct connections and reduced cost compared to AWS global endpoints.	Customer-managed VPC
Use Databricks Secrets or a cloud provider secrets manager	Security: Using Databricks secrets allows you to securely store credentials for external data sources. Instead of entering credentials directly into a notebook, you can simply reference a secret to authenticate to a data source.	Manage Databricks secrets
Set expiration dates on personal access tokens (PATs)	Security: Workspace admins can manage PATs for users, groups, and service principals. Setting expiration dates for PATs reduces the risk of lost tokens or long-lasting tokens that could lead to data exfiltration from the workspace.	Manage personal access tokens
Use system tables to monitor account usage	Observability: System tables are a Databricks-hosted analytical store of your account’s operational data, including audit logs, data lineage, and billable usage. You can use system tables for observability across your account.	Monitor usage with system tables

5. Ressource

Get started: Databricks workspace onboarding | Databricks on AWS