How the AI System Was Designed
Every design decision in Orchestrate started with the same question: what does it actually take for AI to understand a vague human goal and turn it into something actionable? The answer didn't arrive fully formed. It came through three rounds of building, breaking, and rebuilding.
How the system evolved
Stage 1 — The naive test: “Act as a goal planner”
The very first prototype was a ChatGPT conversation with a role-play prompt. The output was generic, inconsistent, and often confidently wrong — but it proved the core hypothesis. People didn't just want help planning; they couldn't start planning at all.
That test established the baseline: unstructured AI with no design produces noise. The problem wasn't the model — it was the complete absence of constraints, structure, or intent.
Stage 2 — One structured prompt via the Gemini API
A single carefully engineered prompt running through the Gemini API — defined output format, domain-aware instructions, and guardrails baked in. Testing against 50+ real goals broke it in three predictable ways:
| Failure | What happened | Example |
|---|---|---|
| Analysis failures | Vague goals returned generic lists with no clarification | "Get better at design" → generic task list, no questions asked |
| Breakdown inconsistencies | Domain-specific quality was unpredictable | PMP: accurate. Niche specialisations: hallucinated steps |
| Validation gaps | Unrealistic timelines accepted with no warning | "Launch a startup in 2 weeks" — approved |
One prompt was trying to do three cognitive jobs at once: understand the goal, generate the plan, and catch its own mistakes. In practice, it couldn't do all three reliably. A single prompt has a ceiling. That's what Stage 3 solved.
Stage 3 — Separation of concerns with Google ADK
The fix wasn't a smarter prompt. It was moving to an agent framework — Google ADK — and splitting the work into four specialised agents, each with one job and clear handoff rules.


A note on timing: when I built this in mid-2025, separating concerns across multiple specialised prompts was the reliable way to get consistent quality. Model capabilities have moved fast since — a single well-structured prompt with a frontier model could likely handle this now. But the design principle still holds: decomposition beats complexity, and the architecture made every piece independently testable and improvable.
In action
Key flow for the main JTBD:
“When I feel overwhelmed and unsure how to tackle a complex goal or task, I want to easily break it down and see a clear path forward, so I can start making progress without feeling stuck.”

Architecture + prompt design: what each agent does and why
Architecture defines what each step does. Prompt design defines how it thinks. The two can't be separated.
Orchestrator — routes, never creates
Pure routing logic. Decides which agent handles each user message. Never analyses or generates content itself. Hard rule: always routes to Validator after Breakdown, no exceptions. Re-edit requests loop back to Breakdown with the original plan attached. This is still an LLM call — natural language intent isn't truly deterministic — but the prompt is constrained to routing only.
Analyzer — asks the right questions, knows when to stop
Extracts context from the user's goal and asks clarifying questions until it has enough to proceed. The key design challenge was iteration: if a user adds context in a follow-up, the Analyzer merges it with what it already knows — not starts over.
Confidence threshold drives the loop: below 0.7, ask another question. Above 0.9, proceed to Breakdown. In between, surface what's missing and let the user decide. These thresholds came from testing — below 0.7, breakdowns were consistently too generic. Above 0.9, additional questions didn't materially improve the output; they just frustrated users.
Breakdown — UX research encoded as prompt rules
Generates the personalised task hierarchy. This is where research directly shaped AI behaviour — the prompt doesn't just ask for tasks, it enforces specific principles:
First 3 tasks: 15–30 min each. Users need an early win to overcome starting paralysis. Hard rule, not a suggestion.
Progressive difficulty: Month 1 habit-forming, Month 2–3 skill-building, Month 4+ goal-approaching. Prevents front-loading complexity.
Domain-aware confidence: marathon training scores 0.9+ (established practices); niche creative goals score 0.5–0.69 (genuine uncertainty). Mapped explicitly in the prompt.
5-task cap per milestone. Overflow tasks are surfaced separately with explicit skip-risk warnings — not silently added or dropped.
Validator — last line of defence before the user sees anything
Decides whether the user should ever see the plan. Three outcomes only: approve, approve with warnings, or reject silently and route back to Breakdown. Users never see a rejected plan.
| Outcome | When | What happens |
|---|---|---|
| Approve | Realistic timeline, actionable tasks, no dangerous advice | Show plan with supportive message |
| Approve with warnings | Ambitious but possible, or specialised domain | Show plan + specific caution (e.g. "Consult a physio first") |
| Reject | Dangerous timeline, harmful advice, or major errors | Route back to Breakdown silently. User never sees it. |
The design decision: confidence scoring as a UX principle. In A/B testing across 30 sessions, self-reported trust jumped from 3.1 to 4.3/5 when confidence scores were shown. Users didn't need the AI to be more accurate — they needed it to be honest about what it didn't know.
Core insight: complex AI behaviours require decomposition, not just more sophisticated prompts.
How I Chose the Model
When I moved to the multi-agent architecture, it was the right moment to re-evaluate. At MVP stage, fine-tuning or training a custom model was unreasonable. I used in-context learning — shaping model behaviour through prompt design: faster to iterate, easier to adjust, no training infrastructure needed.
What I tested and how I measured it
I tested three candidate models — Claude Sonnet, GPT-4o, and Gemini Pro — against 50+ real-world goals across four domains: language learning, fitness, career progression, and creative projects. Rather than relying on benchmark scores, I measured what actually mattered:
Structured output reliability — percentage of outputs parseable automatically without post-processing
Breakdown quality — rated 1–5 on specificity, sequencing, and actionability
Response consistency — variance in quality across repeated runs with the same goal
Why GPT-4o for the MVP — and what I'd change now
All three models could handle the core task. Quality differences were marginal — the real differentiators were cost and how quickly I could ship. GPT-4o was the cheapest option that didn't compromise on output quality. No exotic technical edge — it was the pragmatic MVP choice that let us focus on the product instead of the model.
Trade-off accepted: Claude had stronger reasoning on nuanced, ambiguous goals — exactly the kind Orchestrate handles most. But we needed to move and test faster.
Gemini dropped: 50% cheaper but output inconsistency would require additional validation layers, negating the cost advantage entirely.
Next iteration: Claude Sonnet for all goal processing (the reasoning gap matters more as goals get complex), extended thinking mode for high-stakes or ambiguous goals, a lighter model for simple confirmations.
Model selection is a product decision, not just an engineering one. At MVP, the question isn't “which model is best?” — it's “which model lets me ship and learn fastest?”
I Tried to Break It Before Users Could
The mindset shift that changed everything: stop asking “will this work?” and start asking “how can I make this fail?”
Once the prompts and guardrails were in place, the system looked good on paper. But ‘technically working’ and ‘safe to ship’ are very different things. The only way to know which one you have is to try to break it — systematically, before anyone else does.
100+ inputs across 10 categories
I built a corpus of 100+ challenging inputs — manually written, with another LLM used to generate variety at scale. The goal wasn't to prove the system worked. It was to find exactly where it didn't — then categorise failures by root cause, fix each one, and re-test.
| Category | Example inputs tested |
|---|---|
| Vague goals | "Be more productive", "Get better at my job", "Improve myself" |
| Unrealistic timelines | "Lose 30 lbs in 2 weeks", "Launch a startup in 1 month" |
| Specialised domains | PMP certification, niche technical topics, highly specialised professional credentials |
| Medical / dangerous goals | Weight loss plans, diet regimens, injury recovery programmes |
| Adversarial inputs | Prompt injection attempts, jailbreak patterns, instruction override attempts |
| Missing information loops | Goals with insufficient context that could trigger infinite clarification loops |
| Multi-part / conflicting goals | "Learn 3 languages + build 5 projects + get promoted — in 2 months" |
| Unusual time expressions | "By summer", "ASAP", "soon", "eventually" |
| Resource constraint assumptions | Goals that assume unlimited time, money, or access the user may not have |
| Cultural / personal context | Goals requiring cultural assumptions, regional differences, or personal circumstances |
Accuracy went from 68% to 92% — here's what changed
After each round of testing, I updated the prompts and guardrails, then re-tested. The improvement wasn't from a single fix — it came from closing gaps across every category systematically.
| Metric | Before guardrails | After guardrails |
|---|---|---|
| Overall accuracy | 68% | 92% |
| Dangerous health advice blocked | 45% | 100% |
| Specialised domain confidence surfaced | 62% | 89% |
| Vague goals clarified before breakdown | 73% | 94% |
| Medical goals declined | 52% | 100% |
| Infinite loops prevented | 81% | 97% |
These results are against the test corpus I built — 100+ inputs designed to break the system. 100% on health/medical blocking means the guardrails caught every case I could think of and generate. It doesn't mean the system is unbreakable in production — it means the known failure modes are covered.
Safety: three guardrail layers designed before anyone used the product
Guardrails weren't added after something went wrong. They were designed first — because the failure modes were predictable.
Content restrictions: Hard decline on health, medical, financial, and legal advice. Redirects to professionals. No partial answers, no hedged guesses.
Reality-checking: "Launch a startup in 2 weeks" gets phased, not accepted. Overly ambitious goals trigger a scope warning rather than cheerful enablement.
Wellbeing protection: 3–5 task cap to prevent overwhelm. No streaks or gamification — removed after a bias mapping exercise revealed they disproportionately harm users with ADHD, chronic illness, or unpredictable schedules — exactly the people most likely to struggle with activation energy.
The principle across all three layers: redirect, don't refuse. Show what the AI can do, be honest about what it can't, and never leave the user without a next step.
Right-Sizing the Model to the Task
Getting the system to work well was one challenge. Getting it to work sustainably — at a cost that makes a real product viable — was another.
Not every step needs the same model
Every step in the pipeline uses an LLM — but not every step needs the same model or the same depth of reasoning. Matching model weight to task complexity is what keeps costs sustainable.
| Operation | Approach | Rationale |
|---|---|---|
| Request routing | LLM — routing only (Orchestrator) | Understands user intent and delegates; constrained to never generate content itself |
| Goal analysis & clarification | LLM (Analyzer) | Requires genuine language understanding and context extraction |
| Task breakdown generation | LLM (Breakdown) | Creative, personalised, domain-aware content generation |
| Feasibility validation | LLM (Validator) | Safety rules encoded in prompt; LLM applies them with nuanced judgement |
| Simple confirmations | Lighter / cheaper model (planned) | Sufficient for basic back-and-forth at a fraction of the cost |
The numbers: what it actually costs
Each goal session makes 4 LLM calls: Orchestrator routing (lightweight), Analyzer, Breakdown (heaviest), and Validator.
MVP (GPT-4o)
| Model | Input | Output |
|---|---|---|
| GPT-4o — all four agents | $2.50 per 1M tokens | $10 per 1M tokens |
Planned next iteration (Claude)
| Model | Input | Output |
|---|---|---|
| Claude Sonnet 4.5 — Analyzer, Breakdown, Validator | $3 per 1M tokens | $15 per 1M tokens |
| Claude Haiku 4.5 — Orchestrator & simple confirmations | $1 per 1M tokens | $5 per 1M tokens |
The cost optimisation in the Claude version comes from routing the Orchestrator and simple confirmations to Haiku instead of Sonnet. These two call types are short — routing is ~200 tokens in/out, confirmations ~100–300 tokens — and represent roughly a quarter of total calls per session, saving an estimated 20–25% versus running everything on Sonnet. At any realistic early-stage user volume, monthly API costs sit in the low hundreds of dollars.
These figures are based on published API pricing and are directionally accurate — exact costs require instrumented token counts from engineering and user volume projections from a PM. In a real project, I'd run this as a shared model with both.
Systematic stress testing across 100+ inputs found every known failure mode before users did. The gap between "technically working" and "safe to ship" was 24 percentage points.
Health, medical, and harmful advice guardrails were designed before a single line of production code — not bolted on after. Every edge case in the test corpus was caught.
Decomposing the system into specialised agents — Orchestrator, Analyser, Breakdown, Validator — made every failure mode visible and fixable independently. Complexity managed through separation, not sophistication.
Bias mapping, failure mode prediction, and guardrail design happened before testing began. The framework covers 10 categories of ways AI goal planners can harm users — from unrealistic timelines to vulnerable health advice.
What This Project Taught Me About Designing AI Products
Learn how LLMs actually work
Token costs, context limits, model trade-offs — not engineering concerns to hand off. Every design decision in Orchestrate required understanding what the AI could and couldn't do.
Decompose the problem
One clever prompt has a ceiling. Single-agent: 65% accuracy. Multi-agent: 85%+. Break the problem into specialised agents, each with one job. The ceiling lifts when the responsibilities are separated.
Validation is a product feature
Accuracy went 68% → 92% through guardrail design, not a better model. If validation is an afterthought, you're shipping the broken version.
Define your ethical approach before you build
The bias mapping exercise revealed assumptions I didn't know I had. Features I was planning — including streaks and gamification — would have harmed the users who needed the product most.


