Databricks

1. Use Medallion Architecture (Gold Standard)

Databricks strongly recommends the Bronze → Silver → Gold pattern:

  • 🥉 Bronze (Raw data)
    • Ingested as-is from sources
    • Minimal transformation
  • 🥈 Silver (Cleaned data)
    • Deduplication, cleaning, standardization
  • 🥇 Gold (Business-ready data)
    • Aggregated, modeled for analytics/BI

👉 Why it matters:

  • Clear separation of concerns
  • Easier debugging
  • Better scalability

2. Use Delta Lake for Everything

Use Delta Lake tables instead of plain files or Hive tables.

Key benefits:

  • ACID transactions
  • Time travel (data versioning)
  • Schema enforcement + evolution
  • Faster queries via optimization

👉 Best practice:
Always write ETL outputs as Delta tables, not raw Parquet.


3. Efficient Data Ingestion (Auto Loader)

For streaming or incremental ingestion use:

  • Auto Loader (cloudFiles)
  • Structured Streaming

Benefits:

  • Handles schema evolution automatically
  • Scales to millions of files
  • Incremental processing

👉 Avoid full reloads whenever possible.


4. Prefer Incremental ETL (Not Full Loads)

Use:

  • Change Data Capture (CDC)
  • MERGE INTO patterns in Delta Lake

Instead of:
❌ Reprocessing entire dataset daily
Do:
✅ Process only new/changed records

Example pattern:

  • MERGE INTO silver_table USING updates

5. Optimize Spark Jobs Properly

Since Databricks runs on Spark:

Key optimizations:

  • Use partitioning wisely
  • Avoid small files (use OPTIMIZE)
  • Use broadcast joins for small datasets
  • Cache only when reused
  • Avoid Python UDFs when SQL functions exist

👉 Rule: Push computation to Spark SQL whenever possible.


6. Design Clean Data Models

  • Star schema for analytics (Gold layer)
  • Normalize only in Silver layer if needed
  • Keep transformations reusable and modular

👉 Avoid “one giant notebook doing everything”.


7. Use Workflows for Orchestration

Use Databricks-native scheduling:

  • Databricks Workflows
  • Task dependencies (DAGs)

Or external orchestrators like:

  • Apache Airflow

👉 Keep pipelines modular:

  • Ingest → Transform → Validate → Publish

8. Data Quality Enforcement

Use:

  • Expectations (built-in or custom checks)
  • Great Expectations integration
  • Row count checks, null checks, schema validation

👉 Stop bad data early in Bronze/Silver layers.


9. Monitor Everything

Track:

  • Job duration
  • Cluster usage
  • Data volume growth
  • Failure rates

Use:

  • Databricks job metrics
  • Logs + alerts
  • Observability tools like Grafana

10. Governance with Unity Catalog

Use Unity Catalog for:

  • Fine-grained access control
  • Data lineage tracking
  • Centralized metadata

👉 Critical for enterprise ETL pipelines.


11. Avoid Small File Problem

Common Databricks issue:

❌ Too many tiny files → slow queries

Fix:

  • Use OPTIMIZE
  • Enable Auto Compaction
  • Use proper batch sizes

12. Cluster & Cost Optimization

  • Use job clusters instead of all-purpose clusters
  • Enable autoscaling
  • Use spot instances where possible
  • Shutdown idle clusters

👉 Big cost savings in production pipelines.


13. Version Control Everything

  • Store notebooks / code in Git
  • Use CI/CD pipelines
  • Parameterize notebooks (avoid hardcoding)

Databricks ETL Philosophy (Important)

Modern Databricks ETL is NOT classic ETL.

Instead:

ELT + Lakehouse + Delta + Spark optimization


Summary

A strong Databricks ETL pipeline:

  • Uses Medallion architecture
  • Relies on Delta Lake
  • Processes incrementally
  • Is orchestrated via Workflows/Airflow
  • Is governed by Unity Catalog
  • Is optimized for Spark performance + cost

Databricks notebook best practices for building clean, production-ready pipelines in Databricks using Apache Spark.


1. Keep Notebooks Single-Purpose

❌ Bad

One notebook does everything:

  • ingestion
  • cleaning
  • transformation
  • reporting

✅ Good

Split into logical stages:

  • 01_ingest
  • 02_clean
  • 03_transform
  • 04_publish

👉 Why:

  • Easier debugging
  • Reusable components
  • Better job orchestration

2. Modular Design (Avoid Monoliths)

Break logic into:

  • Functions
  • Reusable utilities
  • Parameterized workflows

Example:

 
def clean_data(df):
return df.dropDuplicates().filter("value IS NOT NULL")
 

👉 Helps scaling and testing.


3. Use Widgets Instead of Hardcoding

Avoid:

 
path = "/data/sales/2026"
 

Use:

 
dbutils.widgets.text("input_path", "")
path = dbutils.widgets.get("input_path")
 

👉 Makes notebooks reusable across environments.


4. Design for Idempotency

Your notebook should be safe to rerun:

  • No duplicate inserts
  • Use MERGE INTO instead of overwrite
  • Avoid side effects

👉 Critical for production ETL reliability.


5. Optimize Spark Usage

Since notebooks run on Spark:

✔ Prefer Spark SQL functions
✔ Avoid Python UDFs
✔ Filter early (reduce data before joins)
✔ Use broadcast joins when possible


6. Always Use Delta Lake

Use Delta tables only, not raw files:

  • ACID compliance
  • Time travel
  • Schema enforcement

Example:

 
df.write.format("delta").mode("append").save("/mnt/silver/sales")
 

7. Add Data Quality Checks Early

Validate data before downstream processing:

  • Null checks
  • Duplicate detection
  • Schema validation
  • Business rules

👉 Stop bad data in Bronze/Silver layers.


8. Use Clear Logging & Debugging

Add checkpoints:

 
print(f"Row count: {df.count()}")
 

Or structured logs for production jobs.

👉 Helps debugging failures quickly.


9. Organize Notebook Structure

A good structure:

  1. Title + Purpose (Markdown)
  2. Parameters (widgets)
  3. Imports
  4. Functions
  5. Main ETL logic
  6. Validation
  7. Output write

👉 Makes notebooks readable and maintainable.


10. Never Hardcode Secrets

❌ Bad:

 
password = "mypassword"
 

✅ Good:
Use secret scope:

 
dbutils.secrets.get(scope="prod", key="db-password")
 

11. Optimize File Handling

  • Avoid many small files
  • Use OPTIMIZE on Delta tables
  • Partition properly

👉 Prevents performance degradation.


12. Keep Notebooks Lightweight

Notebooks should:

  • Orchestrate logic
  • NOT contain heavy frameworks
  • NOT act as full applications

👉 Heavy logic → move to libraries/modules


13. Use Version Control (Git Integration)

  • Store notebooks in Git
  • Track changes
  • Enable CI/CD deployment

👉 Essential for team environments.


14. Use Databricks Workflows for Execution

Don’t manually run notebooks in production.

Use:

  • Databricks Workflows
  • Task dependencies
  • Retry policies

15. Avoid Long Running Cells

  • Break large operations into steps
  • Avoid blocking .collect() on big datasets

16. Be Cost Aware

  • Use job clusters (not always-on clusters)
  • Auto-terminate idle clusters
  • Avoid unnecessary recomputation

17. Use Clear Naming Conventions

Example:

  • bronze_sales_ingest
  • silver_sales_clean
  • gold_sales_aggregation

👉 Makes pipelines self-documenting.


18. Design for Production, Not Just Exploration

Difference:

Exploration Production
Quick queries Stable pipelines
Hardcoded paths Parameterized
Manual runs Scheduled jobs
Debug prints Logging/monitoring

Summary

Good Databricks notebooks are:

  • Modular
  • Parameterized
  • Idempotent
  • Delta-based
  • Well-organized
  • Production-ready
Scroll to Top