This is the full technical breakdown. Architecture evolution, code snippets, database schema, and every pit from 97 commits — all laid out. If you're building something similar, this should save you a few weeks.
The Real Pain Points
The client is a custom aluminum window and door manufacturer in Sydney. The boss is Chinese.
Most people's first assumption would be "he's drowning in emails." That's wrong. The real pain points were two things:
1. Cross-country workflow gap. Australian B2B runs on email. Customers send inquiries, floor plans, quote confirmations, and progress updates — all via email. But the boss spent years doing business in China, where everything runs through WeChat — voice messages, screenshots, phone calls. Email wasn't "too much" for him — it was foreign. He didn't know when to reply, what format to use, or how to chase a two-week-old email thread.
2. Language barrier. The boss can hold a casual English conversation, but can't write professional business emails. "Please confirm the quote at your convenience," "Attached is the revised drawing for your review" — these sentences don't come naturally. His previous workflow was typing content into ChatGPT one message at a time, copy-pasting back to Outlook. It worked, but a single email took 10-15 minutes.
So the requirement wasn't "automate my email." It was: let me drive an unfamiliar workflow (English email + file management) from a familiar interface (Chinese, messaging platform).
This requirement definition is crucial — it shaped every architecture decision that followed. The agent's core value isn't "automation," it's "bridging" — bridging language, bridging work habits, bridging platforms.
Architecture Overview
┌─────────────┐
│ Outlook │
│ (emails) │
└──────┬──────┘
│ Microsoft Graph API
▼
┌────────────────────────┐
│ Email Pipeline │
│ (every 5 min) │
│ │
│ 1. Incremental fetch │
│ 2. Rule pre-filter │
│ 3. LLM classify │
│ 4. Extract fields │
│ 5. Dedup & merge │
│ 6. Store → Supabase │
│ 7. Notify boss │
└────────────┬───────────┘
│
┌────────────▼───────────┐
│ Supabase PostgreSQL │
│ │
│ inquiries │
│ bot_sessions │
│ memories_global │
│ customer_memories │
└────────────┬───────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Discord │ │Telegram │ │WhatsApp │
│ Bot │ │ Bot │ │ Webhook │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────┬────────┘─────────┬───────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Chat Agent │ │ OneDrive │
│ (LLM) │ │ Watcher │
│ 20+ tools │ │ (every 10m) │
└─────────────┘ └─────────────┘
Three layers:
- Data ingestion — Email pipeline + OneDrive watcher pull external data automatically
- Storage — Supabase holds all inquiry state, session history, and extracted memories
- Interaction — The boss talks to the agent through any messaging platform; the agent executes through 20+ tools
Architecture Evolution: Three Phases, Three Rewrites
I didn't plan it all out upfront. The project went through three architecture-level rewrites.
Phase 1: Classifier + Router (v1, lasted one day)
The v1 approach was textbook:
Boss sends message → Classifier determines intent → Route to handler → Return result
The classifier output a JSON: {"action": "stats"} or {"action": "advance"}. Hardcoded handlers executed the corresponding operation.
# v1: classifier + router (deprecated)
async def handle_message(message):
intent = await classify_intent(message.content)
handler = ROUTE_MAP.get(intent["action"])
if handler:
return await handler(message, intent)
return "Unsupported operation"
ROUTE_MAP = {
"stats": handle_stats,
"advance": handle_advance,
"draft": handle_draft,
"files": handle_files,
}
The problem? Real scenarios aren't single-step.
The boss says "check Steward's floor plan." The classifier routes to "files," searches OneDrive — nothing found. Done.
But the correct approach is: search customer → check OneDrive → not there → search email attachments → found the PDF → return it. That's a multi-step reasoning chain. A classifier can't do that.
V1 broke on day one. Commit 6a3df39 ripped out the entire classifier architecture.
Phase 2: Agentic Tool-Calling (still running)
Replaced with a ReAct pattern: the agent gets a set of tools and decides the calling order itself.
Same "check Steward's floor plan" scenario:
Agent → search_customer("Steward") → found customer
Agent → list_customer_files(customer_id) → nothing on OneDrive
Agent → search_email_attachments(customer_id) → found PDF in email
Agent → download_and_show_attachments(attachment_id) → sent image to boss
Four autonomous steps. No hardcoded routing needed.
The core loop:
async def agent_loop(user_message, session, channel):
messages = build_context(session, user_message)
for round_num in range(MAX_ROUNDS := 15):
response = await llm_call(
MODEL_TIER_3,
messages=messages,
tools=ALL_TOOL_DEFINITIONS
)
if response.has_tool_calls:
for tool_call in response.tool_calls:
result = await execute_tool(tool_call, channel)
messages.append({"role": "tool", "content": result,
"tool_call_id": tool_call.id})
messages.append(response.assistant_message)
else:
await channel.send_text(response.content)
return
await channel.send_text("⚠️ Too many steps — please simplify the request")
The trade-off: cost and safety. Every LLM call costs money, and one task calling 4-5 tools adds up. And once the agent has "free will," the ways it can go wrong multiply — more on that below.
Phase 3: Platform Expansion (Discord → Telegram → WhatsApp)
The boss initially used Discord (because I developed on Discord). But he actually uses Telegram and WeChat day-to-day.
This led to the messaging abstraction layer:
class MessageChannel(ABC):
"""Unified interface for all messaging platforms"""
@abstractmethod
async def send_text(self, text: str) -> None: ...
@abstractmethod
async def send_file(self, file_path: str, filename: str) -> None: ...
@abstractmethod
async def send_preview(self, long_text: str) -> None: ...
Three implementations:
| Platform | Char limit | Key implementation details |
|---|---|---|
| Discord | 1990 chars | Native threads, emoji formatting |
| Telegram | 4096 chars | Group chat support, boss-ID whitelist for approvals, HTML formatting |
| 4096 chars | Meta Cloud API webhook, media upload, phone verification |
send_preview() chunks long responses at paragraph boundaries so nothing gets cut mid-sentence. Chunking logic adapts per platform's character limit.
Adding a new platform (Slack, Teams) only requires implementing this interface — zero changes to agent logic.
Lesson: build the MVP on the platform your user actually uses, not the one you're comfortable with.
Three-Tier Model Routing: The Key to Cost Control
The first month I used the most capable model for everything. The bill was a wake-up call.
Core principle: every task has a minimum intelligence threshold. If a cheap model can handle it, don't use an expensive one.
| Tier | Role | Use case |
|---|---|---|
| Tier 1 | Fast & cheap small model | Email classification, memory dedup, rule filtering |
| Tier 2 | Mid-tier model | Field extraction, email composition |
| Tier 3 | Full reasoning large model | Agent reasoning, complex decisions |
# config.py
MODEL_TIER_1 = "..." # fast & cheap
MODEL_TIER_2 = "..." # mid-tier
MODEL_TIER_3 = "..." # full reasoning
All calls go through a unified LLM routing API. Shared httpx connection pool, async semaphore (max 10 concurrent), exponential backoff on 429/5xx (3 retries).
# llm_client.py
_semaphore = asyncio.Semaphore(10)
async def llm_call(model, messages, tools=None, max_retries=3):
async with _semaphore:
for attempt in range(max_retries):
try:
resp = await _http_client.post(
LLM_API_ENDPOINT,
json={"model": model, "messages": messages, "tools": tools},
headers={"Authorization": f"Bearer {API_KEY}"}
)
if resp.status_code == 429:
await asyncio.sleep(2 ** attempt)
continue
return resp.json()
except httpx.TimeoutException:
if attempt == max_retries - 1:
raise
Classifying whether an email is a real inquiry or spam — a small model with a good prompt is just as accurate as a large model, at a fraction of the cost.
JSONB vs Relational Tables
The inquiries table's most important column is context_data (JSONB):
CREATE TABLE inquiries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email_id TEXT UNIQUE NOT NULL,
classification TEXT CHECK (classification IN ('A', 'B', 'C')),
confidence FLOAT,
status TEXT CHECK (status IN ('new', 'pending', 'scheduled', 'quoted', 'closed')),
customer_name TEXT,
customer_email TEXT,
customer_phone TEXT,
address TEXT,
suburb TEXT,
product_types TEXT[],
product_description TEXT,
dimensions TEXT,
urgency TEXT DEFAULT 'medium',
notes TEXT,
raw_subject TEXT,
raw_body TEXT,
context_data JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
context_data structure:
{
"emails": [
{
"email_id": "AAMkAGI...",
"subject": "Quote request - sliding doors",
"body": "Hi, I need 3 sliding doors...",
"received_at": "2024-03-15T10:00:00Z"
}
],
"pipeline": {
"current_stage": "inquiry",
"stage_updated_at": "2024-03-15T10:00:00Z",
"history": [
{
"from": "new",
"to": "inquiry",
"at": "2024-03-15T10:00:00Z",
"reason": "Auto-classified as A-type",
"triggered_by": "pipeline"
}
]
},
"memories": [
{"type": "customer_note", "content": "Budget is limited, wants basic spec"}
],
"has_dimensions": true,
"glass_type": "double IGU",
"language": "en"
}
Why not relational? I started with a normalized design — emails table, pipeline_history table, memories table, foreign keys. After a week I realized the problem: every time the agent needs to reply to a customer, it needs the full email thread, pipeline state, and memories loaded into the prompt. That means JOINing 3-4 tables every time.
With JSONB, one query gets the full context. The trade-off: you lose cross-customer relational queries ("show me all emails with subject X"). For this business, that query never exists — we always operate on a single customer dimension.
Design your database around access patterns, not normal forms.
Email Pipeline: Step by Step
Runs every 5 minutes:
Step 1: Incremental Fetch
messages = await graph_client.get(
"/me/mailFolders/inbox/messages",
params={
"$filter": "isRead eq false",
"$orderby": "receivedDateTime desc",
"$top": 50,
"$select": "id,subject,body,from,receivedDateTime,hasAttachments"
}
)
Pit: Graph API pagination uses @odata.nextLink — must handle explicitly. $top defaults to 10, easy to miss emails. Timezone: API returns UTC, boss thinks in AEST. Convert at the storage layer.
Step 2: Rule Pre-filter (Zero LLM Cost)
def pre_filter(email):
if email.headers.get("X-Auto-Reply"):
return "skip"
if email.from_domain in KNOWN_SPAM_DOMAINS:
return "skip"
if "out of office" in email.subject.lower():
return "skip"
return "needs_llm"
Step 3: LLM Classification
The Tier 1 small model classifies what gets through:
- A: Needs physical measurement (has address, custom product)
- B: Can quote remotely (has dimensions, standard product)
- R: Reply to existing thread
- C: Not an inquiry (marketing, supplier, internal)
Pit: R vs A boundary. An existing customer replying to an old thread with a new project gets classified as R → merged into the old record → almost missed as a new lead. Fix: secondary check — if the reply body contains new addresses or product types, treat as A regardless.
Step 4: Field Extraction
The Tier 2 mid-tier model parses: name, email, phone, address, suburb, product types, dimensions, urgency, language.
Pit: Australian phone numbers. Mobile: 04xx. Landline: (02) xxxx. International: +61 4xx. Same number, three formats. Normalize to E.164 before storage, or dedup treats one customer as three.
Step 5: Dedup & Merge
existing = await db.search_by_email(sender_email)
if existing:
existing.context_data["emails"].append(new_email_data)
await db.update_inquiry(existing.id, context_data=existing.context_data)
else:
await db.create_inquiry(extracted_fields)
Step 6: Strip Reply Chains
Incoming emails often carry the full reply history. Passing all of it to the LLM means paying for tokens already stored in context_data.emails[].
REPLY_PATTERNS = [
r"^>.*$", # Gmail style
r"^From:.*\nSent:.*\nTo:.*", # Outlook style
r"^-+\s*Original Message\s*-+", # Generic
r"^On .+ wrote:$", # Apple Mail
]
No universal standard. My multi-pattern stripper covers the major cases; edge cases still slip through. Good enough > perfect.
Lessons from 97 Git Commits
These aren't retrospective "best practices." They're fix records from actual production failures.
Lesson 1: The Agent Lies — "Email Sent" Hallucination
Commit bed80c7: prevent false send status reports
The agent drafted an email. The tool returned draft content. Then the agent told the boss: "Email has been sent to the customer."
It hadn't been sent. The agent confused "drafted" with "sent."
In B2B, if the boss thinks an email was sent, he won't follow up. A $50,000 order could be lost to this single hallucination.
The fix wasn't adding "don't lie" to the prompt. It was forcing the tool return value:
# draft_email tool return
return "✅ DRAFT_PENDING_APPROVAL — Email drafted, NOT yet sent. Boss must reply '发送' to confirm."
This text isn't for the human — it's for the agent. On its next reasoning step, it reads this tool result and won't misjudge the state.
Additionally, send_email only executes when matching a valid draft_id:
async def send_email(draft_id: str):
draft = await get_draft(draft_id)
if not draft or draft.status != "PENDING_APPROVAL":
return "❌ No pending draft found"
await outlook.send(draft.message_id)
draft.status = "SENT"
return f"✅ Email sent to {draft.to_email}"
Lesson 2: Ambiguous Confirmation = Accidental Send
Commit 7399e7b: tighten send trigger words
The boss says "好" (ok), "可以" (sure), "ok" — in Chinese context, this might just mean "I see," not "confirm send." But the agent interpreted it as a send command.
Fix: send triggers restricted to "发送", "发", "send", "确认发送". Removed "好", "可以", "ok", "yes" entirely.
SEND_TRIGGERS = {"发送", "发", "send", "确认发送"}
# Removed: "好", "可以", "ok", "yes", "行"
Also added sent-items dedup: if an email was sent to this customer in the last 14 days, draft_email warns proactively.
Lesson 3: Agent Amnesia — Forgetting Which Customer
Commit 7195bf8: inject current customer context
Boss asks "how's Billy doing?" Agent checks Billy's record, replies. Then boss says "draft him a reply." Agent: "who's 'him'?"
Root cause: context was rebuilt each turn without carrying over "the customer currently under discussion."
Fix:
def build_context(session, new_message):
context = []
if session.current_customer:
context.append({
"role": "system",
"content": f"[Current customer: {session.current_customer.name}]"
})
context.extend(format_memories(session.memories))
context.extend(compress_session(session.messages))
context.append({"role": "user", "content": new_message})
return context
Lesson 4: Notion Was Dead Weight — Removing an Entire Layer
Commit 486fc1a: remove Notion dependency
V1 used Discord + OneDrive + Supabase + Notion. Notion served as a CRM dashboard.
The problem: the boss never opened Notion. All his interactions happened on Discord/Telegram. Notion became a sync target that only I maintained and nobody read. Every pipeline state change triggered a Notion sync, and the sync logic was complex due to Notion API limitations.
Removed the entire Notion layer on day three. Database = Supabase only. UI = messaging platforms only.
Lesson: don't add a component because it "looks professional." If the user doesn't use it, the maintenance cost is pure liability.
Lesson 5: Context Management Is a Continuous Battle
There are a dozen+ commits related to context, spanning the entire project:
286e30c: P0+P1 context optimization
- Duplicate boss messages in context (build_context timing bug)
- Customer search returning full context_data JSONB (several KB) when only name and email were needed
- Session history with no time limit — month-old conversations still in context
Estimated savings: ~200 tokens per call + 80% reduction in database transfer.
889c09c: increase context retention limit
- Over-compressed context to save tokens → agent responses became "brainless" — not enough context for good decisions
- Found the balance: last 20 messages in full + older ones as compressed summary
d433fed: preserve last agent response in full
- Boss often sends very short follow-ups: "and then?", "how much?"
- If the agent's own previous response was truncated, it can't connect these short messages
This isn't a one-time fix. Context management is something you rebalance every time you add a feature.
Lesson 6: Batch Operations Can't Go Through the Agent
Commit cfe0e41: split batch_draft_emails
The boss wanted to send emails to 92 contacts. I initially built this as an agent tool — agent reads CSV, checks dedup, generates emails, batch sends.
The agent, processing 92 rows of CSV in its context, started fabricating email addresses. Not malformed — plausible-looking addresses that simply didn't exist.
Fix: extract batch operations into deterministic Python workflows:
# workflows/batch_email.py
async def batch_email_workflow(csv_path, template):
# Step 1: Parse CSV — pure Python, zero LLM
contacts = parse_csv(csv_path)
# Step 2: Check sent emails — pure API call, zero LLM
sent = await get_sent_emails_last_n_days(14)
sent_addresses = {e.to_email for e in sent}
# Step 3: Filter
to_send = [c for c in contacts if c.email not in sent_addresses]
# Step 4: Generate emails — only this step uses LLM, one call per email
for contact in to_send:
draft = await llm_call(MODEL_TIER_2, compose_prompt(template, contact))
await save_draft(contact, draft)
return f"Generated {len(to_send)} drafts, pending approval"
The agent initiates the workflow, but the data pipeline is deterministic Python — never passes through LLM context.
Core principle: LLM makes decisions, Python moves data. Any step that doesn't need "creativity" must never touch the LLM.
Lesson 7: Message Formatting Can't Depend on Prompts
Commit e1917d1: three-layer Telegram formatting
I initially told the system prompt: "Reply in Telegram HTML format." Sometimes the agent output Markdown, sometimes HTML, sometimes a hybrid. Telegram's HTML parser is strict — one unclosed <b> tag and the entire message fails to send.
Final solution: three-layer fallback, completely prompt-independent:
def _md_to_tg_html(text: str) -> str:
"""Markdown → Telegram HTML, done in code, not LLM"""
# 1. Protect code blocks
code_blocks = []
text = re.sub(r'```(\w*)\n(.*?)```',
lambda m: _save_code_block(m, code_blocks), text, flags=re.DOTALL)
# 2. Escape HTML special chars
text = text.replace('&', '&').replace('<', '<').replace('>', '>')
# 3. Convert Markdown syntax
text = re.sub(r'\*\*(.+?)\*\*', r'<b>\1</b>', text)
text = re.sub(r'\*(.+?)\*', r'<i>\1</i>', text)
text = re.sub(r'`(.+?)`', r'<code>\1</code>', text)
# 4. Tables → mobile-friendly format
text = convert_tables_to_dots(text)
# 5. Restore code blocks
text = restore_code_blocks(text, code_blocks)
return text
async def send_text(self, text: str):
html = _md_to_tg_html(text)
try:
await self.bot.send_message(self.chat_id, html, parse_mode="HTML")
except TelegramError:
plain = strip_all_tags(html)
await self.bot.send_message(self.chat_id, plain)
Lesson: never rely on the LLM's output format. Let the LLM output freely; handle formatting in code.
Lesson 8: Discord Concurrency Bug Chain
The densest fix sequence in Git history (commits 2b8011c → eb0bf9c → 4f03d6f) — three bugs in three days, all Discord bot concurrency issues:
- Added a lock to prevent concurrent processing → lock blocked all @mentions
- Switched to
_should_respondflag → race condition, all messages flagged as "don't process" - Switched to
_imlang_handledattribute → discord.py Message objects don't support custom attributes → crash
Final fix: removed all the "clever" concurrency control. Simple approach — each message processed independently, concurrency limited by asyncio.Semaphore.
_processing_semaphore = asyncio.Semaphore(3)
async def on_message(message):
if not should_respond(message):
return
async with _processing_semaphore:
await process_message(message)
Lesson: in concurrent systems, "simple and brutal" often beats "clever and elegant."
The Complete Tool Set
20+ tools in four categories:
Query: search_customer (fuzzy search), query_stats (today's numbers), get_pending (pending items), scan_mailbox (full email thread analysis), search_emails (conditional search)
Files: list_customer_files (OneDrive listing), search_email_attachments (email attachment search), download_and_show_attachments (download + preview), read_onedrive_file (read Excel/PDF/text), create_customer_folder
Pipeline: advance_stage (progress state), audit_pipeline (consistency check)
Email: draft_email (bilingual draft), send_email (send approved draft), forward_to_factory (package and forward), batch_draft_emails, create_calendar_event
Tools defined in OpenAI function-calling format:
{
"name": "search_customer",
"description": "Fuzzy search customers by name, email, or phone",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Customer name, email, or phone to search"
}
},
"required": ["query"]
}
}
Tool count grew from 0 (hardcoded actions in v1) to 20+ incrementally. Each tool was added because the boss hit a need that existing tools couldn't serve.
The Draft-Approval Mechanism
The most critical safety design in the entire system:
1. Boss: "Reply to Billy, quote $15,000"
2. Agent → search_customer("Billy") → found
3. Agent → draft_email(to=billy@..., body=...) → bilingual preview
4. Agent shows boss:
[Chinese translation] — for boss to review content
[English body] — what actually gets sent
⚠️ NOT SENT — reply "发送" to confirm
5. Status → DRAFT_PENDING_APPROVAL
6. Boss says "发送" → send_email(draft_id=xxx)
Boss says "change it" → back to step 3
Never auto-send. In B2B manufacturing, one wrong email can lose a $50,000 order.
Memory System
After every conversation, MemoryWatcher runs asynchronously (doesn't block the response):
async def extract_and_store_memories(conversation, customer_id=None):
memories = await llm_call(
MODEL_TIER_1, # cheapest model for extraction
f"Extract long-term facts from this conversation:\n{conversation}"
)
for memory in memories:
if await is_semantic_duplicate(memory):
continue
if memory.type == "boss_pref":
# "Always include GST in quotes" → global
await store_global_memory(memory)
elif memory.type == "customer_note":
# "Billy's budget is limited" → customer-specific
await store_customer_memory(customer_id, memory)
elif memory.type == "business_rule":
# "Melbourne shipping +$200" → global
await store_global_memory(memory)
Memories inject into the next session's system prompt. The agent accumulates business knowledge over time.
Semantic dedup uses the small model — "Billy's budget is limited" and "Billy has a tight budget" are the same memory. Not perfect, but prevents memory bloat.
I added this in week three. For the first two weeks, the boss repeated the same preferences 20+ times, and the agent acted like it was day one every time. If I started over, memory system goes in on day one.
Session Management: Progressive Compression
session = {
"messages": deque(maxlen=40),
"summary": "",
"customer_context": {}
}
def compress_session(messages):
msgs = list(messages)
# Oldest 10 → summary (~200 tokens)
old = msgs[:10]
summary = await summarize(old)
# Middle 20 → keep text but truncate tool returns
middle = msgs[10:30]
for msg in middle:
if msg["role"] == "tool":
msg["content"] = truncate(msg["content"], max_chars=500)
# Newest 10 → full fidelity (including agent's own complete response)
recent = msgs[30:]
return [{"role": "system", "content": summary}] + middle + recent
Pit: Compression is lossy. Important info can get summarized away. Mitigation: MemoryWatcher extracts key facts into persistent memories before the compression window moves. Not perfect, but a practical trade-off.
Pipeline Tracking & OneDrive Sync
8-stage state machine with full audit trail:
inquiry → measure → quoting → follow_up → ordered → production → install → done
Timeout thresholds per stage:
| Stage | Timeout | Action |
|---|---|---|
| inquiry | 2 days | Remind boss |
| measure | 5 days | Schedule measurement |
| quoting | 7 days | Follow up on quote |
| follow_up | 14 days | Suggest re-contact |
| ordered | 3 days | Confirm factory received |
| production | 21 days | Check production progress |
| install | 7 days | Schedule installation |
OneDrive folders organized by stage:
OneDrive/
├── 未报价顾客/ ← inquiry, measure
│ ├── Billy Smith/
│ └── John Lee/
├── 厂家已报价/ ← quoting
│ └── Sarah Chen/
├── 已报价顾客/ ← follow_up
│ └── Mike Wang/
└── 已确定并付定金顾客/ ← ordered through done
└── David Liu/
OneDrive Watcher checks delta API every 10 minutes. If a folder moves to a new stage directory, the database pipeline status updates automatically.
Pit: OneDrive delta API can detect "something changed" but can't distinguish "file rename" from "file move" — they look identical in the response. I wrote path-comparison logic: extract parent folder name; if it changed from 未报价顾客/ to 厂家已报价/, that's a stage transition. If the parent didn't change, it's a rename — no state update triggered.
Voice Message Transcription
The boss prefers sending voice messages over typing. Added voice transcription (commits bd8bcdc → 5605df3):
Using the Whisper model for speech recognition, connected via a free API endpoint.
Flow: receive voice message → download audio → Whisper → text → process as regular text message. The boss doesn't notice the difference — voice and text work identically.
The Bilingual Factor
This isn't one module's problem — it permeates everything:
- Two sets of prompts: Chinese for understanding, English for email generation
- Error messages in Chinese: English errors confuse the boss
- Bilingual email previews: Chinese for boss to review, English for the customer
- Business culture differences: "麻烦您尽快回复" ≠ "Please reply at your earliest convenience"
- Product terminology mapping: 推拉门 = sliding door, 平开窗 = casement window
System prompt includes product knowledge (AS2047 standards, product specs, building codes) so the agent can reference correct technical parameters.
Budget 2-3x the time of a monolingual project.
Scheduling
scheduler = AsyncIOScheduler(timezone="Australia/Sydney")
scheduler.add_job(run_email_check, "interval", minutes=5)
scheduler.add_job(run_onedrive_watch, "interval", minutes=10)
scheduler.add_job(run_evening_schedule, "cron", hour=20) # 8 PM briefing + route
scheduler.start()
Every evening at 8 PM AEST:
- Query all pending A-type inquiries
- Generate optimized measurement route from addresses
- Compile briefing: customer name, address, product interest, notes
- Push to boss's active messaging platform
The boss starts each morning knowing exactly where to go.
Deployment
| Component | Choice |
|---|---|
| Agent runtime | Railway (Python worker) |
| Database | Supabase |
| LLM calls | Unified routing API |
| Voice transcription | Whisper API |
| Website | Vercel (Next.js) |
Railway deployment: worker: python main.py. Nixpacks build with CJK font support (agent generates Chinese PDFs).
Pit: Railway cold start — 10-15 second wait after idle. Scheduled jobs aren't affected, but if the boss messages during cold start, he waits. Fix: heartbeat keepalive, worker never sleeps.
Tech Stack
| Component | Technology |
|---|---|
| Runtime | Python 3.12, asyncio |
| LLM routing | Unified routing API (three-tier models) |
| Database | Supabase PostgreSQL |
| Microsoft Graph API (Outlook) | |
| Files | OneDrive API |
| Messaging | Discord.py / python-telegram-bot / Meta WhatsApp Cloud API |
| Voice | Whisper API |
| Scheduling | APScheduler |
| HTTP | httpx (async, connection pooling) |
| Deployment | Railway |
| Website | Next.js, React, Tailwind CSS, Vercel |
If I Started Over
- Start with Telegram/WhatsApp. Build where the user is, not where you're comfortable.
- Memory system on day one. An agent that can't remember is an agent users lose trust in.
- JSONB from the start. Wasted a week on relational tables.
- Batch operations as workflows from day one. LLMs fabricate data at scale.
- Format in code, not prompts. Never rely on LLM output format.
- Context management is continuous. Every new feature requires rebalancing.
- Budget 2.5x for bilingual. It's not just translation — it's dual-track interaction at every layer.
Conclusion
97 commits, 12 days (from v1 on March 25 to the latest commit on April 6). From a classifier-router toy to a production system the boss uses daily.
The core takeaway: the hard part of building an AI agent isn't the model calls. It's:
- Understanding the user's real pain (not "too many emails" — it's "unfamiliar workflow + language barrier")
- Handling LLM unreliability (hallucination, format inconsistency, context bloat)
- Balancing agent autonomy against system safety
- Continuous iteration (in 97 commits, fixes outnumber features)
Questions welcome.
Parker Kuang | QKAi Studio | qkai.com.au
Read next
Place this article inside the wider QKAI perspective
From First Principles to an Agent Harness
I did not start with LangGraph or any agent framework. I started with a real business system, and by solving scheduling, tool orchestration, messaging abstraction, and external APIs from first principles, I ended up building my own agent harness.
The Robotics Flywheel Starts Before the Robot
The real robotics flywheel starts long before the robot — with workflow redesign, data structure, AI agents, and operational discipline. The hardware comes last.
Why High Labour Costs in Australia Are Making AI Automation a Business Priority
High labour costs in Australia are pushing SMEs to rethink admin automation, operations automation, lead qualification, quote automation, and AI agent workflows before adding more headcount.
Next step
If the problem is workflow, not just content
QKAI can help assess which enquiry, quoting, follow-up, CRM, or support workflows should be redesigned before another manual layer is added.