Lannon Khau

Data Forge Plus

Blue Scorpion
Lannon Khau

By Lannon Khau

· 5 min read
T his project started as an idea to process 1.3 million rows of financial data interactively. Over time, it evolved into a powerful AI-powered web application that can accept any size dataset, clean it, perform feature engineering, and visualize it in an intuitive UIβ€”all powered by GPT-4o, LangChain, and serverless AWS tools.

🌐 Live Architecture

  • EC2: Flask backend with Gunicorn and Nginx
  • S3: Secure file storage for datasets
  • RDS (MySQL): Stores metadata and session logs
  • Lambda: Triggers OpenAI agents for data cleaning
  • Secrets Manager: Secure API key management
  • GitHub Actions: Handles CI/CD deployment

Backend

Flask Flask
Python Python
Gunicorn Gunicorn

Frontend

HTML HTML
Streamlit Streamlit
Jinja2 Jinja2
TailwindCSS TailwindCSS

AI Agents

OpenAI GPT-4o
AI Data Science Team AI Data Science Team
LangChain LangChain

Infrastructure

S3 S3
EC2 EC2
RDS RDS
Lambda Lambda
Secrets Manager Secrets Manager

CI / CD

GitHub Actions GitHub Actions

Let’s Connect

If you’re a builder, dreamer, or data explorer, reach out. Whether it's a conversation, a collaboration, or just curiosity, I’d love to connect. You can find me on LinkedIn or email me directly.


πŸ§ͺ How It Works

Step 1: Register / Login

Access the Dashboard

New users create an account or log in via a secure Flask authentication flow. Once authenticated, they land in the main dashboard.

Step 2: Upload CSV

Flask Handles Upload to S3

Users drag and drop datasets directly into the dashboard. Files are uploaded to AWS S3 using secure presigned URLs to bypass local limitations.

Step 3: Clean + Engineer

Lambda + AI Agents

Users trigger an AI-powered pipeline that runs `DataCleaningAgent` and `FeatureEngineeringAgent` on the dataset asynchronously using AWS Lambda.

Step 4: Store Output

Cleaned Data in S3 + Logs in RDS

Once processed, cleaned files are saved in a `cleaned/` prefix in S3. All job metadata, timestamps, and session info are stored in RDS for traceability.

Step 5: Explore with AI

Streamlit-Powered EDA

Users launch a dynamic Streamlit interface that loads the cleaned preview and visual insights. Features include charts, agentic mind maps, and NL β†’ SQL querying.


"This is super cool! I remember struggling to bridge that gap between insights and code. A tool that shows the 'how' is a total game-changer for learning." - Tran Tien Van

πŸ“ Repository Structure

Cloudberry_AWS_Bootcamp/
β”‚
β”œβ”€β”€ Portfolio_V2/                  # Core Flask app
β”‚   β”œβ”€β”€ app.py                     # Entrypoint Flask application
β”‚   β”œβ”€β”€ templates/                 # Jinja2 HTML templates
β”‚   β”œβ”€β”€ static/                    # TailwindCSS and JS
β”‚   β”œβ”€β”€ utils/                     # AI pipeline, S3 handlers, and secrets manager
β”‚   β”œβ”€β”€ data_forge_lite/           # Streamlit-powered exploration dashboard
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ .github/workflows/            # GitHub Actions CI/CD scripts
β”œβ”€β”€ README.txt                    # You're reading it!
└── start.sh                      # Launch script for Gunicorn
            

πŸ“ Run Locally

git clone https://github.com/LannonTheCannon/Cloudberry_AWS_Bootcamp.git
cd Cloudberry_AWS_Bootcamp/Portfolio_V2
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 app.py