NLI Classifier Design Proposal

Audit

What needs work

The tool works well functionally. Two classifiers, side-by-side results, 66-example benchmark that auto-runs. The issues are visual consistency and information hierarchy — the page reads like a prototype that grew organically.

1

Mixed type systems

Instrument Serif for headlines, IBM Plex for body, IBM Plex Mono for labels, Space Mono loaded but unused. Four font families is two too many for a single-page tool. The serif gives it an editorial feel that clashes with the technical content.

2

Header is underpowered

The title "Direct vs Agentic" is set in Instrument Serif at 48px — elegant but not assertive. The "INTENT CLASSIFIER" super-label at 11px mono disappears. The whole header section feels like a blog post, not a tool.

3

Info section has no container

The two-column Heuristic/NLI info floats without a card or visual boundary. At a glance it's hard to tell this is a discrete section. The badges (CLIENT-SIDE, SERVER-SIDE) are the only visual anchors.

4

Result cards lack weight

The result intent word ("agentic" / "direct") at 32px serif is soft. The confidence bars are 4px thin with no labels explaining what they mean to a new user. The gradient top-bar is subtle to the point of invisible.

5

Benchmark table is dense

66 rows across three expandable groups. Each row shows prompt text + expected + two results, but the results are tiny colored pills. The "H: 100% NLI: 100%" stat line per group is cramped and hard to parse.

6

No design system alignment

The page uses its own color tokens (const C) with no overlap with the colony design system. Space Grotesk + Space Mono are the colony standard; this page uses IBM Plex + Instrument Serif. Feels like a different project.

Typography

Consolidate to two families

Drop Instrument Serif and IBM Plex entirely. Use Space Grotesk for all display and body text, Space Mono for labels, data, and technical content. This aligns with the colony design system and gives the tool a crisper, more technical feel.

Current

Title: Instrument Serif 48px Cut

Super-label: IBM Plex Mono 11px Cut

Card headings: Instrument Serif 24px Cut

Body: IBM Plex Sans 14-15px Cut

Labels/data: IBM Plex Mono 10-13px Cut

Result intent: Instrument Serif 32px Cut

Unused: Space Mono (loaded, never rendered) Cut

Proposed

Title: Space Grotesk 700, 2.4rem New

Super-label: Space Mono 700, 0.6rem, tracked New

Card headings: Space Grotesk 700, 1.1rem New

Body: Space Grotesk 400, 0.85rem New

Labels/data: Space Mono 400/700, 0.55-0.65rem New

Result intent: Space Grotesk 700, 1.5rem New

Input text: Space Mono 400, 0.78rem Keep

The result intent word loses the serif elegance but gains legibility and coherence. The title trades height (48px → ~43px) for weight — Space Grotesk 700 with tight letter-spacing hits harder than the wispy serif.

Color

Keep the palette, tighten the system

The rose/indigo pair for direct/agentic is strong and well-separated. The green/red benchmark pass/fail is standard. The issue isn't the hues — it's how they're applied. Too many near-identical grays with no clear hierarchy.

Rose — Direct

#c0437f • Result display, confidence bars, benchmark pills

Used exclusively for "direct" classification results. Never for UI chrome or actions.

Indigo — Agentic

#6366f1 • Result display, confidence bars, benchmark pills, score weights

Used exclusively for "agentic" classification results and signal decomposition weights.

Green — Pass / Client

#059669 • Benchmark checkmarks, CLIENT-SIDE badge

Semantic "correct" and "local/instant" meanings. Dual role is fine since they never appear adjacent.

Red — Fail

#dc2626 • Benchmark crosses only

Never used for classification results. Strictly semantic "incorrect."

Blue — Server

#2563eb • SERVER-SIDE badge only

Could potentially be retired — a simple label "Server-side" without a colored badge would reduce color noise.

Proposed change: Collapse the 7 text grays (111827 through 9ca3af) into 4 clear levels. Cut textFaint and textMedium — use textMuted (#6b7280) for secondary content and text (#111827) for everything primary. The visual difference between textStrong (#1f2937) and text (#111827) is negligible on screens.

Layout

Sharpen the sections

The page flows well top-to-bottom but the sections bleed into each other. Each section should be a visually distinct unit with clear boundaries.

Header

Make the title assertive. Space Grotesk 700 with negative letter-spacing. Drop the "vs" span color treatment (gray "vs" reads as decorative, not structural). Use a proper em-dash or just the word "vs" at full weight.

Info cards

Wrap each classifier description in a card. Same white background + border treatment as result cards. The badges stay but move to a more prominent position, left-aligned with the card title rather than floating right.

Details toggle

Rethink the "How do they work?" button. Currently it's a full-width button that looks like a CTA. Make it a text link with a disclosure triangle, or fold the content into the info cards as a second paragraph that's always visible. The heuristic signals table and NLI architecture details are useful context — hiding them behind a toggle means most users never see them.

Input area

Give the input more presence. The "TEST A PROMPT" label at 10px mono is too small. Make it the same scale as info card headers. The Randomize button should be a subtle inline action, not a separate button competing for attention.

Result cards

Thicken the top accent bar from 3px gradient to a solid 4px bar in the result color. The gradient fades to transparent at edges, which makes it almost invisible. A solid bar gives immediate color-coded feedback.

Confidence bars

Increase bar height from 4px to 6px. Add the percentage inline with the bar label (not in a separate centered line below). The current "62% confident" centered below the bars is disconnected from the data it describes.

Agreement pill

Move to between the result cards, vertically centered. Currently it's below both cards, which breaks the visual relationship. On mobile (single column), show it above the card that's currently visible.

Benchmark

Make the accuracy stats more prominent. The "97%" and "100%" numbers are there but buried. Pull them into a clear stat bar at the top of the benchmark section — two large numbers, each labeled with the classifier name and latency. The per-group stats (H: 100% NLI: 100%) should use the same format, just smaller.

Benchmark table

Tighten the row layout. The prompt text wraps awkwardly when long. Give it more width by making the result columns narrower — they only need to show a colored pill, not a full word. Use icons or abbreviated labels instead of repeating "direct" / "agentic" 66 times.

Mobile

Keep the floating switcher — it works. But style it to match: Space Mono, tracked uppercase, same border radius as cards. Currently it's visually disconnected from the rest of the page.

Mockup

Proposed direction

Static mockup showing the proposed typography, layout, and color treatment. Not interactive — just enough to evaluate the direction before implementation.

Mockup — nli.dearlarry.co

Intent Classifier

Direct vs Agentic

Can a diffusion model handle this prompt directly, or does it need an LLM/agent layer first?

Heuristic

Client-side

Scores prompts against handcrafted rules
Instant — runs in your browser
No model, no server needed

NLI Model

Server-side

Trained on 66 labeled examples
Understands meaning, not just keywords
~5ms on GPU

Test a prompt

take the concept of solitude and turn it into a landscape

Heuristic

agentic

direct

15%

agentic

85%

NLI Model

agentic

direct

8%

agentic

92%

Agree

Key differences from current: Space Grotesk throughout, info sections in cards, thicker accent bars on results, higher-contrast labels. The overall density is similar — this isn't about adding whitespace, it's about making the existing space work harder through typography and containment.

Implementation

Scope and approach

This is a CSS/font pass on a single JSX file. No structural React changes, no new components, no API changes. The inline styles in app.jsx would be updated in place.

Font swap

Replace the Google Fonts link: drop IBM Plex Mono, IBM Plex Sans, Instrument Serif, Space Mono. Load Space Grotesk (400, 500, 700) and Space Mono (400, 700).

Color tokens

Collapse const C text grays from 7 to 4. Keep all intent colors (rose, indigo) and semantic colors (green, red) unchanged.

Info section

Wrap each classifier description in a card div with the same background: C.cardBg, border, borderRadius as result cards.

Result top bar

Change from linear-gradient(90deg, transparent, accent, transparent) to solid background: accent, increase height from 3px to 4px.

Confidence bars

Increase height from 4px to 6px. Remove the centered "X% confident" line — the per-bar percentages are sufficient.

Details section

Convert from full-width button to a disclosure triangle link. Consider making always-visible if the content is short enough.

Mobile switcher

Update font to Space Mono, match border radius and spacing to card system.

Estimated changes: ~40 lines in const C, ~30 lines of font-family swaps, ~20 lines of layout tweaks. The benchmark section and signal decomposition don't need structural changes — they just inherit the new fonts and tightened grays.

Summary

One pass, cohesive output

This proposal is a visual consistency pass, not a redesign. The tool's structure, interactions, and information architecture are already good. The changes are:

Typography: 4 families → 2 (Space Grotesk + Space Mono)
Colors: 7 text grays → 4, intent colors unchanged
Layout: Info sections get cards, result bars get thickened, confidence bars get taller
Details: Full-width button → disclosure link
Alignment: Matches colony design system (Space Grotesk / Space Mono / paper palette)

The result should feel like it belongs to the same family as the other dearlarry.co tools without losing the specialized character of a comparison tool.