1. Use Medallion Architecture (Gold Standard)
Databricks strongly recommends the Bronze → Silver → Gold pattern:
- 🥉 Bronze (Raw data)
- Ingested as-is from sources
- Minimal transformation
- 🥈 Silver (Cleaned data)
- Deduplication, cleaning, standardization
- 🥇 Gold (Business-ready data)
- Aggregated, modeled for analytics/BI
👉 Why it matters:
- Clear separation of concerns
- Easier debugging
- Better scalability
2. Use Delta Lake for Everything
Use Delta Lake tables instead of plain files or Hive tables.
Key benefits:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement + evolution
- Faster queries via optimization
👉 Best practice:
Always write ETL outputs as Delta tables, not raw Parquet.
3. Efficient Data Ingestion (Auto Loader)
For streaming or incremental ingestion use:
- Auto Loader (cloudFiles)
- Structured Streaming
Benefits:
- Handles schema evolution automatically
- Scales to millions of files
- Incremental processing
👉 Avoid full reloads whenever possible.
4. Prefer Incremental ETL (Not Full Loads)
Use:
- Change Data Capture (CDC)
- MERGE INTO patterns in Delta Lake
Instead of:
❌ Reprocessing entire dataset daily
Do:
✅ Process only new/changed records
Example pattern:
MERGE INTO silver_table USING updates
5. Optimize Spark Jobs Properly
Since Databricks runs on Spark:
Key optimizations:
- Use partitioning wisely
- Avoid small files (use
OPTIMIZE) - Use broadcast joins for small datasets
- Cache only when reused
- Avoid Python UDFs when SQL functions exist
👉 Rule: Push computation to Spark SQL whenever possible.
6. Design Clean Data Models
- Star schema for analytics (Gold layer)
- Normalize only in Silver layer if needed
- Keep transformations reusable and modular
👉 Avoid “one giant notebook doing everything”.
7. Use Workflows for Orchestration
Use Databricks-native scheduling:
- Databricks Workflows
- Task dependencies (DAGs)
Or external orchestrators like:
- Apache Airflow
👉 Keep pipelines modular:
- Ingest → Transform → Validate → Publish
8. Data Quality Enforcement
Use:
- Expectations (built-in or custom checks)
- Great Expectations integration
- Row count checks, null checks, schema validation
👉 Stop bad data early in Bronze/Silver layers.
9. Monitor Everything
Track:
- Job duration
- Cluster usage
- Data volume growth
- Failure rates
Use:
- Databricks job metrics
- Logs + alerts
- Observability tools like Grafana
10. Governance with Unity Catalog
Use Unity Catalog for:
- Fine-grained access control
- Data lineage tracking
- Centralized metadata
👉 Critical for enterprise ETL pipelines.
11. Avoid Small File Problem
Common Databricks issue:
❌ Too many tiny files → slow queries
Fix:
- Use
OPTIMIZE - Enable Auto Compaction
- Use proper batch sizes
12. Cluster & Cost Optimization
- Use job clusters instead of all-purpose clusters
- Enable autoscaling
- Use spot instances where possible
- Shutdown idle clusters
👉 Big cost savings in production pipelines.
13. Version Control Everything
- Store notebooks / code in Git
- Use CI/CD pipelines
- Parameterize notebooks (avoid hardcoding)
Databricks ETL Philosophy (Important)
Modern Databricks ETL is NOT classic ETL.
Instead:
ELT + Lakehouse + Delta + Spark optimization
Summary
A strong Databricks ETL pipeline:
- Uses Medallion architecture
- Relies on Delta Lake
- Processes incrementally
- Is orchestrated via Workflows/Airflow
- Is governed by Unity Catalog
- Is optimized for Spark performance + cost
Databricks notebook best practices for building clean, production-ready pipelines in Databricks using Apache Spark.
1. Keep Notebooks Single-Purpose
❌ Bad
One notebook does everything:
- ingestion
- cleaning
- transformation
- reporting
✅ Good
Split into logical stages:
01_ingest02_clean03_transform04_publish
👉 Why:
- Easier debugging
- Reusable components
- Better job orchestration
2. Modular Design (Avoid Monoliths)
Break logic into:
- Functions
- Reusable utilities
- Parameterized workflows
Example:
def clean_data(df):
return df.dropDuplicates().filter("value IS NOT NULL")
👉 Helps scaling and testing.
3. Use Widgets Instead of Hardcoding
Avoid:
path = "/data/sales/2026"
Use:
dbutils.widgets.text("input_path", "")
path = dbutils.widgets.get("input_path")
👉 Makes notebooks reusable across environments.
4. Design for Idempotency
Your notebook should be safe to rerun:
- No duplicate inserts
- Use
MERGE INTOinstead of overwrite - Avoid side effects
👉 Critical for production ETL reliability.
5. Optimize Spark Usage
Since notebooks run on Spark:
✔ Prefer Spark SQL functions
✔ Avoid Python UDFs
✔ Filter early (reduce data before joins)
✔ Use broadcast joins when possible
6. Always Use Delta Lake
Use Delta tables only, not raw files:
- ACID compliance
- Time travel
- Schema enforcement
Example:
df.write.format("delta").mode("append").save("/mnt/silver/sales")
7. Add Data Quality Checks Early
Validate data before downstream processing:
- Null checks
- Duplicate detection
- Schema validation
- Business rules
👉 Stop bad data in Bronze/Silver layers.
8. Use Clear Logging & Debugging
Add checkpoints:
print(f"Row count: {df.count()}")
Or structured logs for production jobs.
👉 Helps debugging failures quickly.
9. Organize Notebook Structure
A good structure:
- Title + Purpose (Markdown)
- Parameters (widgets)
- Imports
- Functions
- Main ETL logic
- Validation
- Output write
👉 Makes notebooks readable and maintainable.
10. Never Hardcode Secrets
❌ Bad:
password = "mypassword"
✅ Good:
Use secret scope:
dbutils.secrets.get(scope="prod", key="db-password")
11. Optimize File Handling
- Avoid many small files
- Use
OPTIMIZEon Delta tables - Partition properly
👉 Prevents performance degradation.
12. Keep Notebooks Lightweight
Notebooks should:
- Orchestrate logic
- NOT contain heavy frameworks
- NOT act as full applications
👉 Heavy logic → move to libraries/modules
13. Use Version Control (Git Integration)
- Store notebooks in Git
- Track changes
- Enable CI/CD deployment
👉 Essential for team environments.
14. Use Databricks Workflows for Execution
Don’t manually run notebooks in production.
Use:
- Databricks Workflows
- Task dependencies
- Retry policies
15. Avoid Long Running Cells
- Break large operations into steps
- Avoid blocking
.collect()on big datasets
16. Be Cost Aware
- Use job clusters (not always-on clusters)
- Auto-terminate idle clusters
- Avoid unnecessary recomputation
17. Use Clear Naming Conventions
Example:
bronze_sales_ingestsilver_sales_cleangold_sales_aggregation
👉 Makes pipelines self-documenting.
18. Design for Production, Not Just Exploration
Difference:
| Exploration | Production |
|---|---|
| Quick queries | Stable pipelines |
| Hardcoded paths | Parameterized |
| Manual runs | Scheduled jobs |
| Debug prints | Logging/monitoring |
Summary
Good Databricks notebooks are:
- Modular
- Parameterized
- Idempotent
- Delta-based
- Well-organized
- Production-ready