Building Lucas AI: turning a portfolio into a product

← Back to blog

April 22, 20267 min read

Building Lucas AI: turning a portfolio into a product

Bounded LLM UX: two-hop inference, SSE streaming, structured grounding, and the product boundaries that keep a portfolio chat honest.

productaifrontend

Most portfolios stop evolving. I wanted this one to behave a bit like a product: i18n-first, shared UI in a monorepo, and a Lucas AI layer that answers in first person about my work—only from structured context baked into the deployment. This post is for engineers who want the systems picture: inference shape, streaming, grounding, and explicit non-goals—not a tour of where files live or how to run a fork locally.

What problem it solves

Static pages are bad at follow-ups. If your edge is judgment—scope, trade-offs, how you shipped under constraint—a PDF and a contact form do not scale curiosity. Lucas AI is a dedicated destination (/[locale]/ai): ask about experience, the site, or the feature itself, without pretending I am on the other end of the wire in real time.

High-level architecture

The browser owns the transcript UI; the server owns policy and spend. Each send is POST /api/chat with { message, locale }—no attachments, no tool calls, no hidden fields.

The handler builds a system prompt, forwards to Groq’s OpenAI-compatible /v1/chat/completions, and proxies the upstream stream as Server-Sent Events (SSE) so tokens render incrementally without buffering the full completion in memory on the edge. That is the same pattern you would use behind any fast inference API: treat the route as a thin adapter, keep the wire format stable for the client.

Grounding lives in application code: a typed career object is formatted once into a CONTEXT string and injected into the system message. There is no vector database and no retrieval step at request time—the model only sees what you serialized when you built the prompt. That is a capacity trade-off (you cannot answer from arbitrary docs) in exchange for predictable latency, cost, and audit surface.

Models and calls

  • Main reply: default llama-3.3-70b-versatile on Groq (override GROQ_MODEL). Streaming stays on; the route allows up to 60s of generation—enough for a careful answer without turning the edge into an unbounded worker.
  • Pre-flight classifier: llama-3.1-8b-instant (configurable as CLASSIFIER_MODEL). It is a separate, non-streaming call with max_tokens: 5, temperature: 0, and a minimal system prompt that collapses the decision to a single token: CAREER or OFF_TOPIC.

If the verdict is OFF_TOPIC, the API never calls the large model. It streams a fixed, localized refusal as SSE so the client code path matches a “real” completion—no special-case UI branch for short-circuiting. If the classifier errors or times out, the handler fails open and runs the main model anyway: a cheap gate must not become a single point of failure for legitimate traffic.

The gate can still be disabled at the host (server env) when you consciously accept every turn hitting the big model—useful if the small endpoint is unhealthy, if you are load-testing the reply path alone, or if product policy changes and you temporarily drop gating. That knob lives outside the visitor’s view; it is an ops lever, not a feature flag in the UI.

Output budget: assistant max_tokens is capped (CHAT_MAX_TOKENS, clamped 256–8192, default 2048). It couples UX (“answers should end”) with unit economics on the provider bill.

What we send to the server (and what we do not)

Each request carries:

  • message: the current user turn, trimmed, with a hard cap of 2 000 characters (abuse and prompt-injection surface reduction).
  • locale: normalized to a supported site locale. The system prompt ends with Visitor locale: … so the model replies in the site language, not whatever language the UI chrome happens to use.

What we do not send:

  • No prior turns on the wire. The backend is single-turn per request: system prompt + one user message. The thread you see in the UI is not replayed to the model on each send.

So “memory” is: published CONTEXT + whatever is still visible in the client. That is a deliberate boundary: no server-side chat history, no cross-device sync, no training on conversations.

Where the thread lives

The UI persists messages in sessionStorage (lucas-ai-messages). Refresh in the same tab keeps the thread; a new tab or device starts clean. There is a second key, lucas-ai-pending, used when we redirect the user after a locale switch (more below).

If GROQ_API_KEY is missing, the route still returns SSE: it streams the localized config error string so the shell degrades without throwing away the layout.

Structured grounding (not RAG)

The career corpus starts as typed records—roles, stacks, initiatives, metrics, kind (work vs volunteer vs education vs personal), time ranges, and optional narrative fields. A formatter turns that into one markdown-ish block wrapped in explicit delimiters, e.g.:

--- CONTEXT (ground truth; do not contradict or extend beyond it) ---

--- End context ---

What goes in matters as much as what stays out:

  • Bio, strengths, principles, working style—orientation without résumé noise.
  • Experience—per employer: scope, initiatives (what / impact / evidence), metrics, and optional context / problem / solution / impact story beats when you want the model to speak in outcomes, not buzzwords.
  • Duration summary—calendar-deduped months for professional work with rules that prevent double-counting overlaps or smuggling hobby projects into “years of experience.”
  • A self-describing “product + implementation” slice—not marketing copy, but facts you are willing to defend: hosting shape, i18n approach, how chat is invoked, streaming, classifier behavior, where state lives, privacy posture. The point of that slice is meta-questions answered from the same ground-truth object as career questions—so “how does this feature work?” does not become a blank check for the model to invent file trees or dependencies.

Layered on top, the system prompt is strict law: first person, no pretending to be live on Slack, no access to private systems, no third-person “Lucas said…”, CONTEXT as sole factual source, and no fabrication past what CONTEXT states for implementation detail.

That is how you turn “don’t hallucinate” from a vibe into testable scope: the model is only as smart as the bundle you ship, and the bundle is versioned like code.

Topic classifier (what counts as in-scope)

The small model’s job is only routing, not helpfulness. CAREER is defined broadly: background, skills, shipped work, product judgment, and legitimate questions about the site itself—stack, localization pipeline, how the assistant is wired—insofar as that information exists in CONTEXT. OFF_TOPIC catches everything else (weather, homework, unrelated trivia).

Treating “meta” questions as in-scope is a product decision: a portfolio assistant should explain its own boundary conditions without opening the entire web as a knowledge source.

Input language vs site locale (client-side)

The route locale drives response language (via prompt). But users sometimes type in another language while staying on e.g. English UI.

On send, the client runs franc-min on the input (minimum length ~15 chars). If the detected language does not match the current locale’s expected ISO 639-3 mapping, we do not silently post to the API. We show an offer card: buttons to router.push(/${targetLocale}/ai) for matching locales, plus “continue in current language.” If they switch locale, we stash the pending message in sessionStorage, navigate, then auto-send after mount so the question runs with the right locale in the JSON body.

That is product behavior: align site language with the language the user is actually writing, instead of forcing the model to guess or mixing policies.

UX placement

Lucas AI is a nav destination and a home section (badge, headline, CTA, example prompts)—not a floating widget over reading flow. Full-page chat keeps the pattern opt-in and avoids the “surprise copilot” anti-pattern where generative UI fights the rest of the layout for attention.

Failure modes we cared about

  • Classifier down: proceed to main model (fail open).
  • Groq 401/403/400: return JSON llm_auth to the client—no raw upstream body (avoid leaking key or model hints).
  • 5xx / 429: stream a localized generic error via mock SSE when appropriate.
  • Empty or oversized message: 400 with a stable error code.

The takeaway

Lucas AI is not a general assistant dropped onto a marketing page. It is a narrow product: one body of facts you stand behind, one streaming answer path, a small model that only decides “in scope or not,” and a server that forgets each turn on purpose. The goal is predictable behavior—what gets sent to the provider, what you pay the inference API per request, and what the visitor is allowed to treat as factual.

If you build something like this, the leverage is not picking the biggest model. It is treating the system message and CONTEXT like a spec—plain, factual, line-auditable—rather than marketing copy, and deciding in the architecture when the large model is allowed to run at all (for example, only after an on-topic gate).