Labs / wiki-surface-precision

Wiki surfacer precision

A hook auto-injects a wiki note on every prompt, spending context tokens you never asked it to. This measures how often it nags on off-topic prompts (the false-positive rate) and reads the firing threshold off the sweep.

The question

On a real dev prompt the surfacer fires a concept you'd actually use; on an off-topic one it stays quiet. Only then does the auto-surface hook earn the context tokens it spends, instead of just nagging.

False-positive rate 0%

Threshold sweep · precision vs recall

precision@1 recall@3

False-positive (nag) rate · 0% across all thresholds — the off-topic prompts never cross the bar.

Per-prompt · top concept vs gate gate · 0.15 →

On-topic · n=23

how do durable execution frameworks help with consistency in distributed systems

0.51 hit

how do I keep my flaky tests from blocking CI

0.50 hit

using clamp to scale font sizes smoothly across viewport widths

0.40 hit

why do my multi-agent setups fail so often, is it the coordination or the synthesis

0.40 hit

wiring up a model context protocol server so the assistant can call my functions

0.38 hit

worried AI coding assistants are atrophying my skills and hurting verification

0.37 hit

how do I design a narrow stable interface that hides the messy internals

0.34 hit

how do I expose my tools and context to an AI agent over a standard protocol

0.30 hit

which two typefaces pair well to give my design system contrast and hierarchy

0.26 hit

the model is being sycophantic and agreeing with everything, how do I constrain that at inference time

0.25 hit

making my type scale readable and keeping semantic HTML for screen readers

0.24 hit

worried about a malicious dependency sneaking through code review with hidden unicode

0.24 hit

coordinating several LLM agents that keep disagreeing on subtasks

0.23 miss

looking for a single-binary UI to see my k8s topology and helm releases

0.23 hit

the agent's context window keeps filling with junk, how do I budget tokens better

0.20 hit

I want my layout to adapt without writing a dozen media query breakpoints

0.20 hit

RAG keeps missing facts that span multiple documents, any alternatives

0.17 hit

I need real-time visibility into what my cluster is actually doing

0.17 hit

running a quantized model locally but I'm hitting VRAM limits

0.17 miss

best way to set up a merge queue so the main branch stays green

0.13 miss

I want the model to pull in relevant docs at query time before answering

0.13 hit

my test suite passes locally but randomly fails on the build server, what's going on

0.11 miss

what should I think about when deciding what to stuff into the prompt

0.05 miss

Off-topic · n=9 · 0 nags

how long should I marinate chicken thighs before grilling

0.10 silent

what's a good sourdough hydration ratio for a crusty loaf

0.07 silent

best way to remove a red wine stain from a wool rug

0.05 silent

what year did the Beatles release Abbey Road

0.03 silent

is it cheaper to lease or buy a car for a 3 year horizon

0.03 silent

tips for keeping a basil plant alive on a windowsill in winter

0.03 silent

recommend a beginner-friendly hiking trail near Portland with a waterfall

0.00 silent

what's the capital of Mongolia

0.00 silent

how do I change a flat bicycle tire on a road bike

0.00 silent

The wiki surfacer comes in two modes. One is the /wiki command you run on purpose, and it costs tokens only when you ask. The other is a hook that auto-injects a relevant concept on every prompt, asked for or not. The command is easy to justify. The hook is the gamble: it pays only when the concept it surfaces is one you’d have wanted anyway. Fire it on a question about sourdough and you’ve burned context tokens for nothing and split attention for less. That’s the nag, and it’s the thing this measures.

The headline is the false-positive rate. There are nine off-topic prompts here, none with a matching concept anywhere in the wiki. How many make the hook fire anyway? At the recommended threshold, zero. Precision@1 on the real dev prompts holds at 78% — the top-ranked concept is one you’d use — and recall@3 is highest at the lowest threshold. That’s why the sweep fires the hook at 0.15: low enough to keep coverage, high enough that nothing off-topic ever clears it.

The scorer is lexical. No model calls, nothing to drift, $0 a run. That last part is the point, because it means the check can gate the hook on every change to the corpus. The rule it feeds is blunt. If some later corpus or scorer change pushes the nag rate up and you can’t tune it back toward zero without gutting precision, you keep the /wiki command and kill the hook. Ship the surface people ask for; drop the one that interrupts.

back to /labs