Design Proposal

NLI classifier
design pass

Visual and UX improvements for the intent classifier comparison tool at nli.dearlarry.co.

Audit

What needs work

The tool works well functionally. Two classifiers, side-by-side results, 66-example benchmark that auto-runs. The issues are visual consistency and information hierarchy — the page reads like a prototype that grew organically.

1
Mixed type systems

Instrument Serif for headlines, IBM Plex for body, IBM Plex Mono for labels, Space Mono loaded but unused. Four font families is two too many for a single-page tool. The serif gives it an editorial feel that clashes with the technical content.

2
Header is underpowered

The title "Direct vs Agentic" is set in Instrument Serif at 48px — elegant but not assertive. The "INTENT CLASSIFIER" super-label at 11px mono disappears. The whole header section feels like a blog post, not a tool.

3
Info section has no container

The two-column Heuristic/NLI info floats without a card or visual boundary. At a glance it's hard to tell this is a discrete section. The badges (CLIENT-SIDE, SERVER-SIDE) are the only visual anchors.

4
Result cards lack weight

The result intent word ("agentic" / "direct") at 32px serif is soft. The confidence bars are 4px thin with no labels explaining what they mean to a new user. The gradient top-bar is subtle to the point of invisible.

5
Benchmark table is dense

66 rows across three expandable groups. Each row shows prompt text + expected + two results, but the results are tiny colored pills. The "H: 100% NLI: 100%" stat line per group is cramped and hard to parse.

6
No design system alignment

The page uses its own color tokens (const C) with no overlap with the colony design system. Space Grotesk + Space Mono are the colony standard; this page uses IBM Plex + Instrument Serif. Feels like a different project.

Typography

Consolidate to two families

Drop Instrument Serif and IBM Plex entirely. Use Space Grotesk for all display and body text, Space Mono for labels, data, and technical content. This aligns with the colony design system and gives the tool a crisper, more technical feel.

Current
Title: Instrument Serif 48px Cut
Super-label: IBM Plex Mono 11px Cut
Card headings: Instrument Serif 24px Cut
Body: IBM Plex Sans 14-15px Cut
Labels/data: IBM Plex Mono 10-13px Cut
Result intent: Instrument Serif 32px Cut
Unused: Space Mono (loaded, never rendered) Cut
Proposed
Title: Space Grotesk 700, 2.4rem New
Super-label: Space Mono 700, 0.6rem, tracked New
Card headings: Space Grotesk 700, 1.1rem New
Body: Space Grotesk 400, 0.85rem New
Labels/data: Space Mono 400/700, 0.55-0.65rem New
Result intent: Space Grotesk 700, 1.5rem New
Input text: Space Mono 400, 0.78rem Keep

The result intent word loses the serif elegance but gains legibility and coherence. The title trades height (48px → ~43px) for weight — Space Grotesk 700 with tight letter-spacing hits harder than the wispy serif.

Color

Keep the palette, tighten the system

The rose/indigo pair for direct/agentic is strong and well-separated. The green/red benchmark pass/fail is standard. The issue isn't the hues — it's how they're applied. Too many near-identical grays with no clear hierarchy.

Rose — Direct
#c0437f • Result display, confidence bars, benchmark pills
Used exclusively for "direct" classification results. Never for UI chrome or actions.
Indigo — Agentic
#6366f1 • Result display, confidence bars, benchmark pills, score weights
Used exclusively for "agentic" classification results and signal decomposition weights.
Green — Pass / Client
#059669 • Benchmark checkmarks, CLIENT-SIDE badge
Semantic "correct" and "local/instant" meanings. Dual role is fine since they never appear adjacent.
Red — Fail
#dc2626 • Benchmark crosses only
Never used for classification results. Strictly semantic "incorrect."
Blue — Server
#2563eb • SERVER-SIDE badge only
Could potentially be retired — a simple label "Server-side" without a colored badge would reduce color noise.

Proposed change: Collapse the 7 text grays (111827 through 9ca3af) into 4 clear levels. Cut textFaint and textMedium — use textMuted (#6b7280) for secondary content and text (#111827) for everything primary. The visual difference between textStrong (#1f2937) and text (#111827) is negligible on screens.

Layout

Sharpen the sections

The page flows well top-to-bottom but the sections bleed into each other. Each section should be a visually distinct unit with clear boundaries.

Header
Make the title assertive. Space Grotesk 700 with negative letter-spacing. Drop the "vs" span color treatment (gray "vs" reads as decorative, not structural). Use a proper em-dash or just the word "vs" at full weight.
Info cards
Wrap each classifier description in a card. Same white background + border treatment as result cards. The badges stay but move to a more prominent position, left-aligned with the card title rather than floating right.
Details toggle
Rethink the "How do they work?" button. Currently it's a full-width button that looks like a CTA. Make it a text link with a disclosure triangle, or fold the content into the info cards as a second paragraph that's always visible. The heuristic signals table and NLI architecture details are useful context — hiding them behind a toggle means most users never see them.
Input area
Give the input more presence. The "TEST A PROMPT" label at 10px mono is too small. Make it the same scale as info card headers. The Randomize button should be a subtle inline action, not a separate button competing for attention.
Result cards
Thicken the top accent bar from 3px gradient to a solid 4px bar in the result color. The gradient fades to transparent at edges, which makes it almost invisible. A solid bar gives immediate color-coded feedback.
Confidence bars
Increase bar height from 4px to 6px. Add the percentage inline with the bar label (not in a separate centered line below). The current "62% confident" centered below the bars is disconnected from the data it describes.
Agreement pill
Move to between the result cards, vertically centered. Currently it's below both cards, which breaks the visual relationship. On mobile (single column), show it above the card that's currently visible.
Benchmark
Make the accuracy stats more prominent. The "97%" and "100%" numbers are there but buried. Pull them into a clear stat bar at the top of the benchmark section — two large numbers, each labeled with the classifier name and latency. The per-group stats (H: 100% NLI: 100%) should use the same format, just smaller.
Benchmark table
Tighten the row layout. The prompt text wraps awkwardly when long. Give it more width by making the result columns narrower — they only need to show a colored pill, not a full word. Use icons or abbreviated labels instead of repeating "direct" / "agentic" 66 times.
Mobile
Keep the floating switcher — it works. But style it to match: Space Mono, tracked uppercase, same border radius as cards. Currently it's visually disconnected from the rest of the page.
Mockup

Proposed direction

Static mockup showing the proposed typography, layout, and color treatment. Not interactive — just enough to evaluate the direction before implementation.

Mockup — nli.dearlarry.co
Intent Classifier
Direct vs Agentic
Can a diffusion model handle this prompt directly, or does it need an LLM/agent layer first?
Heuristic
Client-side
  • Scores prompts against handcrafted rules
  • Instant — runs in your browser
  • No model, no server needed
NLI Model
Server-side
  • Trained on 66 labeled examples
  • Understands meaning, not just keywords
  • ~5ms on GPU
Test a prompt
take the concept of solitude and turn it into a landscape
Heuristic
agentic
direct
15%
agentic
85%
NLI Model
agentic
direct
8%
agentic
92%
Agree

Key differences from current: Space Grotesk throughout, info sections in cards, thicker accent bars on results, higher-contrast labels. The overall density is similar — this isn't about adding whitespace, it's about making the existing space work harder through typography and containment.

Implementation

Scope and approach

This is a CSS/font pass on a single JSX file. No structural React changes, no new components, no API changes. The inline styles in app.jsx would be updated in place.

Font swap
Replace the Google Fonts link: drop IBM Plex Mono, IBM Plex Sans, Instrument Serif, Space Mono. Load Space Grotesk (400, 500, 700) and Space Mono (400, 700).
Color tokens
Collapse const C text grays from 7 to 4. Keep all intent colors (rose, indigo) and semantic colors (green, red) unchanged.
Info section
Wrap each classifier description in a card div with the same background: C.cardBg, border, borderRadius as result cards.
Result top bar
Change from linear-gradient(90deg, transparent, accent, transparent) to solid background: accent, increase height from 3px to 4px.
Confidence bars
Increase height from 4px to 6px. Remove the centered "X% confident" line — the per-bar percentages are sufficient.
Details section
Convert from full-width button to a disclosure triangle link. Consider making always-visible if the content is short enough.
Mobile switcher
Update font to Space Mono, match border radius and spacing to card system.

Estimated changes: ~40 lines in const C, ~30 lines of font-family swaps, ~20 lines of layout tweaks. The benchmark section and signal decomposition don't need structural changes — they just inherit the new fonts and tightened grays.

Summary

One pass, cohesive output

This proposal is a visual consistency pass, not a redesign. The tool's structure, interactions, and information architecture are already good. The changes are:

  • Typography: 4 families → 2 (Space Grotesk + Space Mono)
  • Colors: 7 text grays → 4, intent colors unchanged
  • Layout: Info sections get cards, result bars get thickened, confidence bars get taller
  • Details: Full-width button → disclosure link
  • Alignment: Matches colony design system (Space Grotesk / Space Mono / paper palette)

The result should feel like it belongs to the same family as the other dearlarry.co tools without losing the specialized character of a comparison tool.