Databricks ยท Data engineering tutorial

Databricks Retail Pipeline with PySpark and Delta Tables

A beginner portfolio project that documents local Databricks CLI setup, workspace authentication, CSV upload, Asset Bundle deployment, PySpark transformations, Delta table outputs, and SQL validation.

Project output Clean order Delta tables, daily revenue metrics, customer metrics, and quality-check SQL.
Local CLI install
Databricks authentication
CSV upload to volume or DBFS
Asset Bundle job deployment
PySpark to Delta tables
SQL validation screenshots

What This Teaches

A practical first Databricks workflow, with documented commands and screenshot capture points.

CLI Setup

Install and verify the Databricks CLI locally with WinGet, including the direct Windows alias path.

Asset Bundles

Deploy a Databricks job from databricks.yml and run a notebook as a repeatable workflow.

Delta Outputs

Use PySpark to clean retail orders and publish analytics-ready Delta tables with validation queries.

Files Included

FilePurposeSkill shown
data/retail_orders.csvSmall synthetic ecommerce order export.CSV source modeling
databricks.ymlDefines the Databricks Asset Bundle job.Bundle configuration
notebooks/retail_orders_pipeline.pyReads CSV data, cleans orders, derives metrics, and writes Delta tables.PySpark transformations
sql/quality_checks.sqlChecks duplicates, required fields, invalid amounts, revenue summaries, and top customers.SQL validation
docs/runbook-with-screenshots.mdDocuments each step with screenshot filenames and expected results.Implementation documentation

Run Order

# 1. Verify CLI
databricks -v

# 2. Authenticate
databricks auth login --host https://YOUR-WORKSPACE-URL

# 3. Upload sample CSV
databricks fs cp data/retail_orders.csv dbfs:/Volumes/main/default/demo/retail_orders.csv --overwrite

# 4. Deploy and run
databricks bundle validate
databricks bundle deploy
databricks bundle run retail_orders_pipeline

# 5. Validate in Databricks SQL
sql/quality_checks.sql

Resume Talking Point

Built a Databricks retail data pipeline tutorial that uses the Databricks CLI, Asset Bundles, PySpark, and Delta tables to load raw CSV orders, clean and enrich records, publish analytics-ready metrics, and validate output with screenshot-documented SQL checks.