Databricks · Data engineering tutorial

Databricks Retail Pipeline with PySpark and Delta Tables

A beginner portfolio project that documents local Databricks CLI setup, workspace authentication, CSV upload, Asset Bundle deployment, PySpark transformations, Delta table outputs, and SQL validation.

Read tutorial Screenshot runbook GitHub repo Back to portfolio

Project output Clean order Delta tables, daily revenue metrics, customer metrics, and quality-check SQL.

Local CLI install

Databricks authentication

CSV upload to volume or DBFS

Asset Bundle job deployment

PySpark to Delta tables

SQL validation screenshots

What This Teaches

A practical first Databricks workflow, with documented commands and screenshot capture points.

CLI Setup

Install and verify the Databricks CLI locally with WinGet, including the direct Windows alias path.

Asset Bundles

Deploy a Databricks job from databricks.yml and run a notebook as a repeatable workflow.

Delta Outputs

Use PySpark to clean retail orders and publish analytics-ready Delta tables with validation queries.

Files Included

File	Purpose	Skill shown
`data/retail_orders.csv`	Small synthetic ecommerce order export.	CSV source modeling
`databricks.yml`	Defines the Databricks Asset Bundle job.	Bundle configuration
`notebooks/retail_orders_pipeline.py`	Reads CSV data, cleans orders, derives metrics, and writes Delta tables.	PySpark transformations
`sql/quality_checks.sql`	Checks duplicates, required fields, invalid amounts, revenue summaries, and top customers.	SQL validation
`docs/runbook-with-screenshots.md`	Documents each step with screenshot filenames and expected results.	Implementation documentation

Run Order

# 1. Verify CLI
databricks -v

# 2. Authenticate
databricks auth login --host https://YOUR-WORKSPACE-URL

# 3. Upload sample CSV
databricks fs cp data/retail_orders.csv dbfs:/Volumes/main/default/demo/retail_orders.csv --overwrite

# 4. Deploy and run
databricks bundle validate
databricks bundle deploy
databricks bundle run retail_orders_pipeline

# 5. Validate in Databricks SQL
sql/quality_checks.sql

Resume Talking Point

Built a Databricks retail data pipeline tutorial that uses the Databricks CLI, Asset Bundles, PySpark, and Delta tables to load raw CSV orders, clean and enrich records, publish analytics-ready metrics, and validate output with screenshot-documented SQL checks.