Lannon Khau

Data Forge Plus

Jun 13, 2025 · 5 min read

T his project started as an idea to process 1.3 million rows of financial data interactively. Over time, it evolved into a powerful AI-powered web application that can accept any size dataset, clean it, perform feature engineering, and visualize it in an intuitive UI—all powered by GPT-4o, LangChain, and serverless AWS tools.

🌐 Live Architecture

EC2: Flask backend with Gunicorn and Nginx

S3: Secure file storage for datasets

RDS (MySQL): Stores metadata and session logs

Lambda: Triggers OpenAI agents for data cleaning

Secrets Manager: Secure API key management

GitHub Actions: Handles CI/CD deployment

Backend

Flask

Python

Gunicorn

Frontend

HTML

Streamlit

Jinja2

TailwindCSS

AI Agents

GPT-4o

AI Data Science Team

LangChain

Infrastructure

S3

EC2

RDS

Lambda

Secrets Manager

CI / CD

GitHub Actions

Let’s Connect

If you’re a builder, dreamer, or data explorer, reach out. Whether it's a conversation, a collaboration, or just curiosity, I’d love to connect. You can find me on LinkedIn or email me directly.

🧪 How It Works

Step 1: Register / Login

Access the Dashboard

New users create an account or log in via a secure Flask authentication flow. Once authenticated, they land in the main dashboard.

Step 2: Upload CSV

Flask Handles Upload to S3

Users drag and drop datasets directly into the dashboard. Files are uploaded to AWS S3 using secure presigned URLs to bypass local limitations.

Step 3: Clean + Engineer

Lambda + AI Agents

Users trigger an AI-powered pipeline that runs `DataCleaningAgent` and `FeatureEngineeringAgent` on the dataset asynchronously using AWS Lambda.

Step 4: Store Output

Cleaned Data in S3 + Logs in RDS

Once processed, cleaned files are saved in a `cleaned/` prefix in S3. All job metadata, timestamps, and session info are stored in RDS for traceability.

Step 5: Explore with AI

Streamlit-Powered EDA

Users launch a dynamic Streamlit interface that loads the cleaned preview and visual insights. Features include charts, agentic mind maps, and NL → SQL querying.

"This is super cool! I remember struggling to bridge that gap between insights and code. A tool that shows the 'how' is a total game-changer for learning." - Tran Tien Van

📁 Repository Structure

Cloudberry_AWS_Bootcamp/
│
├── Portfolio_V2/                  # Core Flask app
│   ├── app.py                     # Entrypoint Flask application
│   ├── templates/                 # Jinja2 HTML templates
│   ├── static/                    # TailwindCSS and JS
│   ├── utils/                     # AI pipeline, S3 handlers, and secrets manager
│   ├── data_forge_lite/           # Streamlit-powered exploration dashboard
│   └── requirements.txt
│
├── .github/workflows/            # GitHub Actions CI/CD scripts
├── README.txt                    # You're reading it!
└── start.sh                      # Launch script for Gunicorn

📁 Run Locally

git clone https://github.com/LannonTheCannon/Cloudberry_AWS_Bootcamp.git
cd Cloudberry_AWS_Bootcamp/Portfolio_V2
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 app.py

🚀 Launch the Demo 📦 View Source on GitHub