Skip to content

Labs / wiki-surface-precision

Wiki surfacer precision

A hook auto-injects a wiki note on every prompt, spending context tokens you never asked it to. This measures how often it nags on off-topic prompts (the false-positive rate) and reads the firing threshold off the sweep.

The question

On a real dev prompt the surfacer fires a concept you'd actually use; on an off-topic one it stays quiet. Only then does the auto-surface hook earn the context tokens it spends, instead of just nagging.

False-positive rate 0%
Threshold sweep · precision vs recall
precision@1 recall@3
0.0 0.5 1.0 fires here 74% 0.15 43% 0.25 26% 0.35 9% 0.45 78%

False-positive (nag) rate · 0% across all thresholds — the off-topic prompts never cross the bar.

Per-prompt · top concept vs gate
On-topic · n=23
how do durable execution frameworks help with consistency in distributed systems
0.51 hit
how do I keep my flaky tests from blocking CI
0.50 hit
using clamp to scale font sizes smoothly across viewport widths
0.40 hit
why do my multi-agent setups fail so often, is it the coordination or the synthesis
0.40 hit
wiring up a model context protocol server so the assistant can call my functions
0.38 hit
worried AI coding assistants are atrophying my skills and hurting verification
0.37 hit
how do I design a narrow stable interface that hides the messy internals
0.34 hit
how do I expose my tools and context to an AI agent over a standard protocol
0.30 hit
which two typefaces pair well to give my design system contrast and hierarchy
0.26 hit
the model is being sycophantic and agreeing with everything, how do I constrain that at inference time
0.25 hit
making my type scale readable and keeping semantic HTML for screen readers
0.24 hit
worried about a malicious dependency sneaking through code review with hidden unicode
0.24 hit
coordinating several LLM agents that keep disagreeing on subtasks
0.23 miss
looking for a single-binary UI to see my k8s topology and helm releases
0.23 hit
the agent's context window keeps filling with junk, how do I budget tokens better
0.20 hit
I want my layout to adapt without writing a dozen media query breakpoints
0.20 hit
RAG keeps missing facts that span multiple documents, any alternatives
0.17 hit
I need real-time visibility into what my cluster is actually doing
0.17 hit
running a quantized model locally but I'm hitting VRAM limits
0.17 miss
best way to set up a merge queue so the main branch stays green
0.13 miss
I want the model to pull in relevant docs at query time before answering
0.13 hit
my test suite passes locally but randomly fails on the build server, what's going on
0.11 miss
what should I think about when deciding what to stuff into the prompt
0.05 miss
Off-topic · n=9 · 0 nags
how long should I marinate chicken thighs before grilling
0.10 silent
what's a good sourdough hydration ratio for a crusty loaf
0.07 silent
best way to remove a red wine stain from a wool rug
0.05 silent
what year did the Beatles release Abbey Road
0.03 silent
is it cheaper to lease or buy a car for a 3 year horizon
0.03 silent
tips for keeping a basil plant alive on a windowsill in winter
0.03 silent
recommend a beginner-friendly hiking trail near Portland with a waterfall
0.00 silent
what's the capital of Mongolia
0.00 silent
how do I change a flat bicycle tire on a road bike
0.00 silent

The wiki surfacer comes in two modes. One is the /wiki command you run on purpose, and it costs tokens only when you ask. The other is a hook that auto-injects a relevant concept on every prompt, asked for or not. The command is easy to justify. The hook is the gamble: it pays only when the concept it surfaces is one you’d have wanted anyway. Fire it on a question about sourdough and you’ve burned context tokens for nothing and split attention for less. That’s the nag, and it’s the thing this measures.

The headline is the false-positive rate. There are nine off-topic prompts here, none with a matching concept anywhere in the wiki. How many make the hook fire anyway? At the recommended threshold, zero. Precision@1 on the real dev prompts holds at 78% — the top-ranked concept is one you’d use — and recall@3 is highest at the lowest threshold. That’s why the sweep fires the hook at 0.15: low enough to keep coverage, high enough that nothing off-topic ever clears it.

The scorer is lexical. No model calls, nothing to drift, $0 a run. That last part is the point, because it means the check can gate the hook on every change to the corpus. The rule it feeds is blunt. If some later corpus or scorer change pushes the nag rate up and you can’t tune it back toward zero without gutting precision, you keep the /wiki command and kill the hook. Ship the surface people ask for; drop the one that interrupts.

back to /labs