
← Back to blog
Bounded LLM UX: two-hop inference, SSE streaming, structured grounding, and the product boundaries that keep a portfolio chat honest.
Most portfolios stop evolving. I wanted this one to behave a bit like a product: i18n-first, shared UI in a monorepo, and a Lucas AI layer that answers in first person about my work—only from structured context baked into the deployment. This post is for engineers who want the systems picture: inference shape, streaming, grounding, and explicit non-goals—not a tour of where files live or how to run a fork locally.
Static pages are bad at follow-ups. If your edge is judgment—scope, trade-offs, how you shipped under constraint—a PDF and a contact form do not scale curiosity. Lucas AI is a dedicated destination (/[locale]/ai): ask about experience, the site, or the feature itself, without pretending I am on the other end of the wire in real time.
The browser owns the transcript UI; the server owns policy and spend. Each send is POST /api/chat with { message, locale }—no attachments, no tool calls, no hidden fields.
The handler builds a system prompt, forwards to Groq’s OpenAI-compatible /v1/chat/completions, and proxies the upstream stream as Server-Sent Events (SSE) so tokens render incrementally without buffering the full completion in memory on the edge. That is the same pattern you would use behind any fast inference API: treat the route as a thin adapter, keep the wire format stable for the client.
Grounding lives in application code: a typed career object is formatted once into a CONTEXT string and injected into the system message. There is no vector database and no retrieval step at request time—the model only sees what you serialized when you built the prompt. That is a capacity trade-off (you cannot answer from arbitrary docs) in exchange for predictable latency, cost, and audit surface.
llama-3.3-70b-versatile on Groq (override GROQ_MODEL). Streaming stays on; the route allows up to 60s of generation—enough for a careful answer without turning the edge into an unbounded worker.llama-3.1-8b-instant (configurable as CLASSIFIER_MODEL). It is a separate, non-streaming call with max_tokens: 5, temperature: 0, and a minimal system prompt that collapses the decision to a single token: CAREER or OFF_TOPIC.If the verdict is OFF_TOPIC, the API never calls the large model. It streams a fixed, localized refusal as SSE so the client code path matches a “real” completion—no special-case UI branch for short-circuiting. If the classifier errors or times out, the handler fails open and runs the main model anyway: a cheap gate must not become a single point of failure for legitimate traffic.
The gate can still be disabled at the host (server env) when you consciously accept every turn hitting the big model—useful if the small endpoint is unhealthy, if you are load-testing the reply path alone, or if product policy changes and you temporarily drop gating. That knob lives outside the visitor’s view; it is an ops lever, not a feature flag in the UI.
Output budget: assistant max_tokens is capped (CHAT_MAX_TOKENS, clamped 256–8192, default 2048). It couples UX (“answers should end”) with unit economics on the provider bill.
Each request carries:
message: the current user turn, trimmed, with a hard cap of 2 000 characters (abuse and prompt-injection surface reduction).locale: normalized to a supported site locale. The system prompt ends with Visitor locale: … so the model replies in the site language, not whatever language the UI chrome happens to use.What we do not send:
So “memory” is: published CONTEXT + whatever is still visible in the client. That is a deliberate boundary: no server-side chat history, no cross-device sync, no training on conversations.
The UI persists messages in sessionStorage (lucas-ai-messages). Refresh in the same tab keeps the thread; a new tab or device starts clean. There is a second key, lucas-ai-pending, used when we redirect the user after a locale switch (more below).
If GROQ_API_KEY is missing, the route still returns SSE: it streams the localized config error string so the shell degrades without throwing away the layout.
The career corpus starts as typed records—roles, stacks, initiatives, metrics, kind (work vs volunteer vs education vs personal), time ranges, and optional narrative fields. A formatter turns that into one markdown-ish block wrapped in explicit delimiters, e.g.:
--- CONTEXT (ground truth; do not contradict or extend beyond it) ---
…
--- End context ---
What goes in matters as much as what stays out:
context / problem / solution / impact story beats when you want the model to speak in outcomes, not buzzwords.Layered on top, the system prompt is strict law: first person, no pretending to be live on Slack, no access to private systems, no third-person “Lucas said…”, CONTEXT as sole factual source, and no fabrication past what CONTEXT states for implementation detail.
That is how you turn “don’t hallucinate” from a vibe into testable scope: the model is only as smart as the bundle you ship, and the bundle is versioned like code.
The small model’s job is only routing, not helpfulness. CAREER is defined broadly: background, skills, shipped work, product judgment, and legitimate questions about the site itself—stack, localization pipeline, how the assistant is wired—insofar as that information exists in CONTEXT. OFF_TOPIC catches everything else (weather, homework, unrelated trivia).
Treating “meta” questions as in-scope is a product decision: a portfolio assistant should explain its own boundary conditions without opening the entire web as a knowledge source.
The route locale drives response language (via prompt). But users sometimes type in another language while staying on e.g. English UI.
On send, the client runs franc-min on the input (minimum length ~15 chars). If the detected language does not match the current locale’s expected ISO 639-3 mapping, we do not silently post to the API. We show an offer card: buttons to router.push(/${targetLocale}/ai) for matching locales, plus “continue in current language.” If they switch locale, we stash the pending message in sessionStorage, navigate, then auto-send after mount so the question runs with the right locale in the JSON body.
That is product behavior: align site language with the language the user is actually writing, instead of forcing the model to guess or mixing policies.
Lucas AI is a nav destination and a home section (badge, headline, CTA, example prompts)—not a floating widget over reading flow. Full-page chat keeps the pattern opt-in and avoids the “surprise copilot” anti-pattern where generative UI fights the rest of the layout for attention.
llm_auth to the client—no raw upstream body (avoid leaking key or model hints).Lucas AI is not a general assistant dropped onto a marketing page. It is a narrow product: one body of facts you stand behind, one streaming answer path, a small model that only decides “in scope or not,” and a server that forgets each turn on purpose. The goal is predictable behavior—what gets sent to the provider, what you pay the inference API per request, and what the visitor is allowed to treat as factual.
If you build something like this, the leverage is not picking the biggest model. It is treating the system message and CONTEXT like a spec—plain, factual, line-auditable—rather than marketing copy, and deciding in the architecture when the large model is allowed to run at all (for example, only after an on-topic gate).