Building Lucas AI: turning a portfolio into a product

← Back to blog

April 23, 20269 min read

Building Lucas AI: turning a portfolio into a product

System design for Lucas AI: layered prompts, lexical retrieval, session memory, token budgets, and a pre-flight classifier on a bounded portfolio chat surface.

productaiarchitecturefrontend

This site is a portfolio, but the Lucas AI surface is intentionally built like a small product: a dedicated /{locale}/ai experience, predictable spend, and answers that stay in first person from structured context only—not a generic assistant with web-scale ambition.

This post is a technical deep dive into how that system is wired: what gets assembled per request, why it is layered that way, and which constraints are deliberate.

The problem

Static portfolios answer one-shot questions well and fail at depth. Follow-ups (“how did you scope that?”, “what stack was that role?”) either disappear into PDFs or force a human loop.

Naive chat integrations fix the interaction model but create new failures: unbounded context, invented employers, silent scope creep, and every question billed against the largest model you wired up. A portfolio does not need an open-world chatbot; it needs a narrow interface with an auditable fact base.

Lucas AI sits in that gap: bounded UX, bounded knowledge, explicit boundaries about what “Lucas” can claim.

System design overview

End-to-end, a send is still POST /api/chat with JSON (message, locale, optional sessionSummary / recentTurns). The route owns policy and spend; the browser owns transcript UI and client-side persistence (sessionStorage).

Layers, in order of execution:

Input validation — empty messages rejected; user text hard-capped at 2 000 characters to limit abuse and prompt-injection surface.
Pre-flight classifier (optional) — a small Groq chat completion (llama-3.1-8b-instant by default), non-streaming, max_tokens: 5, temperature: 0, single-word verdict: OFF_TOPIC vs proceed. If OFF_TOPIC, the API returns mock SSE with a localized refusal so the client path matches a normal stream. On classifier error or timeout, the handler fails open and runs the main model—cheap gates should not brick legitimate traffic. Hosts can set CHAT_SKIP_CLASSIFIER to drop this call entirely when load-testing or when halving provider calls is worth losing the shortcut.
Prompt assembly — buildChatPrompt composes the system message; the main model receives system + current user message (not a multi-message chat transcript on the wire).
Main completion — streaming POST …/v1/chat/completions (default llama-3.3-70b-versatile on Groq, override GROQ_MODEL), response proxied as SSE to the client. max_tokens clamped via CHAT_MAX_TOKENS (256–8192, default 2048) ties UX to unit economics.
Client continuity — the UI keeps messages and a rolling session summary in sessionStorage; each request can attach a trimmed summary and a slice of recent turns so the server prompt carries continuity without storing chat server-side.

Locale alignment (still client-side policy): on send, franc-min classifies the visitor’s typing (minimum length ~15 characters). If the detected language does not match the current site locale’s expected ISO 639-3 mapping, the UI does not silently call the API—it shows an offer card to switch locale (router.push to the matching /{locale}/ai) or continue in the current language. That keeps locale in the JSON body aligned with the language the user is actually writing, instead of forcing the model to guess. A pending message survives a locale switch via sessionStorage and auto-send after mount.

The important architectural move is separation: stable core law (who Lucas AI is, voice, grounding rules) vs dynamic fragments (session block, retrieved chunks, optional canonical FAQ block). That split keeps the prompt auditable, and it gives prompt caching a fighting chance because the prefix stays stable while only the relevant context changes per turn.

Context strategy

The design avoids “send everything always.” Inlining the full structured profile on every request scales poorly: token cost grows with the corpus, unrelated questions pay for unrelated facts, and you edge closer to provider limits as the profile grows.

Instead:

systemPrompt (corePromptText.ts) — stable rules: first person, transparency (“not typing live”), grounding law, duration math rules, locale behavior. Kept minimal and fixed so prefix caching (where the provider deduplicates the opening system text) has a chance to help.
Dynamic CONTEXT fragments — appended after the shared prefix, clearly delimited (--- CONTEXT (canonical / FAQ) ---, session memory, retrieved excerpts) and closed with --- End CONTEXT fragments --- plus a Visitor locale: … line so the model answers in the site language.

Domain separation is enforced at chunk authoring time, not only in prose. A single lucasPersonalContext object is the source of truth; buildKnowledgeChunkIndex derives chunks with KnowledgeDomain labels (core_bio, mindset, working_style, strengths, experience, duration, portfolio_*, boundaries, etc.). That gives retrieval a typed surface to score against instead of one undifferentiated wall of text.

What the model is allowed to treat as fact is always whatever fragments made it into that request—testable scope, not a vibe.

Intent routing (cheap) vs scope gate (LLM)

There are two different “routing” ideas in the stack; conflating them loses the design.

Heuristic intent (routeIntent in router.ts) is not an LLM call. It is keyword- and regex-driven classification into coarse buckets (recruiter, engineering, project, blog, portfolio_meta, navigation_contact, conversational, general). That intent only biases lexical chunk scoring—latency stays flat, cost is zero.

Examples:

“How do I reach you / hiring / CV” → navigation_contact — boosts contact-adjacent behavior indirectly via chunk audiences.
“Next.js / i18n / Groq / SSE / Lucas AI API” → portfolio_meta — boosts portfolio_* domains in selectChunks.
“Years of experience / tenure / team size” → recruiter — boosts duration, strengths, recruiter-tagged chunks.
Short “hi / thanks” → conversational — avoids over-fetching heavy experience blocks.

Trade-off: heuristics mis-route. Retrieval compensates with a fallback: if nothing scores above zero, inject core-bio + boundaries so the model still has minimal grounding.

Pre-flight topic classifier is the LLM gate: CAREER vs OFF_TOPIC with a dedicated system prompt (buildClassifierPrompt). That is policy (“is this even the right product?”), not retrieval tuning. It is more flexible than regex but costs an extra round trip; hence skip flags and fail-open behavior.

Selective retrieval (not embedding RAG)

“RAG” often implies embeddings and a vector database. This implementation does not use those. Retrieval is lexical: tokenize the user message into a word set, score each chunk by keyword hits + substring overlap in chunk text, add intent-based boosts (e.g. recruiter audience tags, portfolio_meta boosting portfolio_* domains), sort, then take top chunks under hard caps.

Chunks are small semantic units relative to the full profile: e.g. one chunk per employer (compactExperience), separate chunks for i18n pipeline vs chat feature vs monorepo facts. Small chunks matter because:

scoring is local—irrelevant employers do not dilute a match on another company name;
compression truncates per chunk (maxChunkChars) without dropping the entire profile.

Blog-shaped queries are deprioritized in retrieval (intent === "blog" applies a negative bias) because blog MDX is not indexed as knowledge chunks here. The system does not pretend full-article retrieval exists when the corpus is not present.

Relevance is transparent code: keyword weight 3, token-in-body weight 0.35, audience/intent boosts on top. That is easy to audit and cheap to run; the trade-off is synonymy and paraphrase—if the user never overlaps lexically with chunk content, you rely on fallback chunks.

Token optimization

Naive implementations either replay full history or inline full corpora. Both explode cost and latency.

Concrete strategies in code:

estimateTokens — character-length / 4 heuristic; no tokenizer dependency; good enough for budgeting, not billing-grade accounting.
Session block ceiling — maxSessionBlockTokens: rolling summary + formatted recent turns are trimmed iteratively (slice to 85% until under cap).
Recent turns — only the last maxRecentTurnPairs pairs (default 3), each message clamped to maxCharsPerTurn (default 700). The client sends up to eight prior messages; the server enforces the stricter budget.
Session summary — client builds a rolling log line per turn (User snippet + Assistant snippet), client-capped (~2500 chars), then server sanitizeSessionSummary (maxSessionSummaryChars, default 900). No server-side LLM summarization—predictable cost, but noisy if users paste walls of text (mitigated by clamps).
Top-k retrieval — maxRetrievedChunks (default 5), maxRetrievedTokens budget (default 2200 estimated), early stop when adding another chunk would exceed budget (with a floor of at least two chunks when possible).
Deduplication — chunks picked by id into a Set so the same employer block is not duplicated.
Compression — compressChunkText truncates with ellipsis when a chunk exceeds maxChunkChars.
Last-resort — maxSystemChars truncates the assembled system string with a visible [system truncated] marker.

The product tension is always context richness vs cost. The defaults intentionally favor smaller prompts; operators tune via CHAT_MAX_* environment variables. CHAT_DEBUG_METRICS=1 logs one JSON line per request (intent, chunk counts, approximate token breakdown) so you can see the budget in production without guessing.

Session memory

Full chat replay in the system prompt does not scale: token cost grows linearly with conversation length, and long threads drown retrieved signal.

This stack uses a hybrid:

Rolling summary — lossy, client-authored, cheap; preserves “what we already talked about” in a few kilobytes max.
Recent turns — a small tail of actual dialogue for immediate coreference (“that role”, “the last answer”).

Continuity is therefore best-effort, not verbatim replay—appropriate for a portfolio Q&A surface, not pair programming. No server-side chat store: no cross-device sync, no training on conversations; privacy posture stays simple.

Prompt design details

Canonical FAQ short-circuit (tryCanonicalAnswer) — a tiny pattern library for recurring questions (“who are you”, “why Lucas AI”, “main strengths”, frontend-heavy experience). On hit, a dense ### Canonical facts block is injected and retrieval is capped harder (slice to at most three chunks) to avoid redundant tokens.

Static vs dynamic is explicit in assembly order: systemPrompt + date, then optional dynamic sections, then end markers + locale line. That ordering is deliberate for caching and for human diffing when CONTEXT drifts.

Classifier vs main — classifier prompt is narrowly scoped (single word out); main prompt carries the full behavioral law. Keeping those separate avoids “helpful classifier” drift and keeps the big model’s system message focused on voice and grounding.

Trade-offs and non-goals

Intentionally not built:

Embedding retrieval / vector DB for the profile or blog.
Tool calling, attachments, or server-side memory across sessions.
Second-stage reranking (cross-encoder / small LLM rerank) on retrieved chunks.

Simplified on purpose:

Lexical retrieval over neural retrieval—acceptable for a bounded corpus where chunk keywords mirror how recruiters and engineers actually ask questions.
Client-authored summary over LLM summarization—saves a call and avoids summary hallucinations polluting ground truth.

Future extensions (if needed): embeddings for blog content once it is indexed; server session store if cross-tab continuity must be trusted without the client; reranker if chunk cardinality grows enough that lexical scoring feels brittle.

The takeaway

Lucas AI is best read as a product surface with a spec: the lucasPersonalContext object and derived chunks are the contract, the system prompt is law, token budgets are guardrails, and the classifier plus off-topic stream path define what the product refuses to be.

If you are building something similar, the leverage is not the largest general model—it is deciding what enters the prompt each turn and measuring it. A portfolio does not need infinite context; it needs honest, bounded answers—and an architecture boring enough to keep them that way.