Skip to content
Aug 2025·14 minResearch

cognitive retrieval systems

exhausted parents don't need hallucinated advice at 3am. a retrieval system that knows when to answer, when to ask, and when to say 'i don't know.'

the origin

it started with a message at 3am.

i've spent years building AI systems. and i've noticed something that bothers me every time i interact with a health chatbot: the moment you need it most, it sounds like a medical textbook written by a robot having a bad day.

i'm not talking about accuracy—most systems get the facts roughly right. i'm talking about the delivery.

ask a health AI about infant sleep at 2pm when you're researching for an article. functional response. ask the same question at 3am when your baby has been screaming for two hours. exact same response. same clinical tone. same detached precision. same complete failure to acknowledge that you're operating on 3.2 hours of fragmented sleep.

we've built systems that know everything and understand nothing.

then i met miriam ende—germany's leading baby sleep expert, bestselling author, the person thousands of exhausted parents turn to when nothing else works. i watched how she responds to messages at 3am. same facts as any textbook. completely different impact.

she doesn't just answer questions. she reads the desperation between the lines. she knows that "he's been waking every 90 minutes" isn't a query—it's a cry for help.

and she can only help one parent at a time.

that night, while miriam typed her response to one mother, thousands of other parents across germany sat in darkened nurseries, phones illuminating their exhausted faces, typing the same desperate queries into google. getting 2.3 million results. some saying it's developmental. some blaming feeding schedules. some insisting the baby needs sleep training immediately or will "never learn to self-soothe."

the information exists. but at 3am, with a crying baby and a brain running on fumes, information isn't what parents need. they need what miriam provides: accurate guidance delivered with genuine understanding.

the bigger picture

this isn't just one mother's crisis. it's a population-level failure.

the numbers:

  • 20-30% of parents report infant sleep problems
  • 46% of mothers characterize their child's sleep as problematic
  • vast majority of sleep-related searches occur between 10pm and 5am

and here's the crucial detail: that's precisely when professional consultation is unavailable.

new mothers average 6.7 hours of total sleep with only 3.2 hours uninterrupted during weeks 2-7 postpartum. this level of sleep deprivation causes measurable cognitive damage:

functionmeasured change
reaction time+83.69ms delay
working memory-23% capacity
emotional regulation-34% inhibition
decision-makingcompromised

millions of cognitively impaired parents, seeking medical guidance during hours when human experts are asleep, getting contradictory information from systems that don't understand the difference between a clinical question and a cry for help.

the insight

traditional approach to health AI: better retrieval. find more accurate sources. rank them better. surface the right information.

this assumes the problem is informational.

it's not. it's architectural.

current systems treat every query the same way regardless of context. a researcher asking about infant sleep cycles at 2pm gets the same response structure as an exhausted parent asking at 2am. same tone. same complexity. same emotional register.

this is insane. the cognitive capacity of the user is part of the context.

cognitive retrieval systems inverts this model. instead of building a system that delivers perfect information to an idealized user, i'm building one that adapts its communication strategy to the actual cognitive and emotional state of the person in front of it.

same facts. different delivery. integrated, not competing.

the science

before designing the architecture, i had to understand what happens to a parent's brain after weeks of fragmented sleep.

the neurobiology of sleep deprivation

chronic sleep deprivation—less than 6 hours per night for more than 7 consecutive days—causes measurable changes:

cognitive functionmeasured changesource
working memory-23% capacityLim & Dinges, 2010
emotional regulation-34% inhibitionYoo et al., 2007
decision-making+47% risk aversionHarrison & Horne, 2000
language processing+12% processing timeRatcliff & Van Dongen, 2009

these findings have direct architectural implications:

  • reduced working memory → responses must be modular, short, scannable
  • impaired emotional regulation → negative framing causes disproportionate distress
  • increased risk aversion → warnings need careful calibration to avoid panic
  • slowed processing → complex sentence structures become comprehension barriers

traditional medical writing optimizes for precision. i needed to optimize for comprehension under cognitive impairment.

contextual empathy

this led to the concept i call contextual empathy: a system's ability to dynamically adapt its communicative strategy to the inferred emotional state and cognitive load of the user.

traditional approachcontextual empathy
static phrasesdynamic tone adjustment
append empathy to factsintegrate empathy with facts
rule-based ("if stressed, say X")probabilistic, context-inferred
independent of contentco-constructed with content

this isn't about making the AI "nicer." it's about building a system that understands that how you say something changes what is heard, especially under cognitive load.

the hard problems

building a system that adapts to user state. these are the research questions:

1. context inference under minimal data — how do we accurately infer a user's emotional and cognitive state from sparse signals—timestamp, message complexity, response latency, linguistic markers? current sentiment analysis achieves ~80% accuracy on well-formed text. parents at 3am don't write well-formed text.

2. adaptive tone modeling — LLMs can adjust tone when explicitly instructed. but can they learn to modulate empathy, urgency, validation, and complexity dynamically based on inferred context? this requires more than prompt engineering—it requires a tone model that operates as a continuous function across multiple dimensions.

3. hallucination detection in real-time — the vectara hallucination leaderboard reports that even the best models hallucinate 0.7-4.4% of the time. in medical contexts, that rate climbs to 50-82%. for a production health system, even 1% is unacceptable. how do we achieve sub-5% hallucination rates without sacrificing response quality or latency?

4. evidence grounding vs. empathetic generation — strict RAG implementations constrain LLM outputs to retrieved context, improving accuracy but reducing naturalness. pure generation produces fluid responses but hallucinates. the optimal balance for health communication is not known.

5. safety without sterilization — hard safety constraints ("never recommend cry-it-out methods") can make responses robotic. soft constraints risk dangerous advice slipping through. how do we enforce medical safety while preserving communicative warmth?

these are open questions. the optimal solutions aren't known. that's what makes this research.

the architecture

six-stage pipeline. user query → context-aware, verified response.

  1. context inference — temporal, linguistic, and conversational feature analysis
  2. multi-stage retrieval — user history, knowledge base, contextual re-ranking
  3. adaptive tone modeling — multi-dimensional tone targeting based on inferred state
  4. generation — LLM response with integrated tone parameters
  5. verification — claim extraction and source entailment checking
  6. safety validation — hard rules and soft guidelines enforcement

the system captures not just what the user is asking, but who is asking and when.

stage 1: context inference

the system analyzes three dimensions:

temporal features — timestamp, circadian phase, time since last message

linguistic features — sentence complexity, typo frequency, emotional markers

conversational features — response latency, topic drift, question repetition

from these signals, it infers:

  • likely stress level (0-1 scale)
  • estimated cognitive load (low/medium/high)
  • urgency of need (routine/concerned/crisis)

stage 2: multi-stage retrieval

traditional RAG retrieves once. we retrieve three times:

layer 1: individual history — has this parent asked before? what worked?

layer 2: knowledge base — miriam ende's methodology, sleep science literature

layer 3: contextual re-ranking — re-weight results based on inferred user state

a stressed parent gets simpler, more actionable results ranked higher. a researcher gets comprehensive sources.

stage 3: adaptive tone modeling

treating tone as a multi-dimensional continuous space:

dimensionlowhigh
warmthmatter-of-factheartfelt
urgencyrelaxeddirective
complexitysimple languagetechnical detail
directivenesssuggestiveprescriptive
validationneutralexplicitly validating
reassuranceinformativeactively calming

based on context inference, the system sets target values for each dimension. these are passed to the LLM as continuous parameters, not discrete instructions.

same fact, different frames:

low stress, high curiosity: "infant sleep cycles are typically 50-60 minutes, which is shorter than adult cycles. frequent waking is developmentally normal before 6 months."

high stress, low cognitive capacity: "you're not doing anything wrong. waking every 90 minutes is completely normal for a baby this age. their sleep cycles are just much shorter than ours. this will improve."

same information. radically different impact.

why claude

i chose anthropic's claude model family for three reasons:

1. constitutional AI framework — claude models embed safety constraints during training, not as post-hoc filters. for a health application, this architectural safety matters.

2. prompt injection resistance — health queries may inadvertently contain instruction-like language ("ignore the crying," "just let them scream"). claude shows stronger resistance to prompt injection than competitors.

3. multi-model orchestration:

pipeline stagemodelrationale
response generationsonnet 4.5reasoning depth, empathy, nuance
tone modelingsonnet 4.5subtle communication adjustments
claim verificationhaiku 4.5speed, cost efficiency for NLI
context inferencehaiku 4.5sub-50ms real-time analysis
edge case escalationopus 4.5maximum reasoning for ambiguity

anthropic's documented pattern: "sonnet 4.5 can break down a complex problem into multi-step plans, then orchestrate a team of multiple haiku 4.5s to complete subtasks in parallel."

the stack

componenttechnologywhy
orchestrationLangChainmodular pipeline, native claude integration
primary LLMclaude sonnet 4.5complex reasoning, empathy
verification LLMclaude haiku 4.5fast claim checking
escalation LLMclaude opus 4.5edge cases requiring maximum capability
vector databasepineconesemantic search over knowledge base
embedding modelfine-tuned sentence transformerdomain-specific sleep terminology
NLI modelfine-tuned DeBERTa-v3claim-source entailment
API layerfastAPI + openAPIstandards-compliant REST interface

latency budget

for a parent at 3am, every second feels like ten.

stagetarget
context inference50ms
retrieval layer 1100ms
retrieval layer 2150ms
tone modeling30ms
response generation2000ms
hallucination check500ms
total~3 seconds

sub-three-second responses maintain conversational flow without sacrificing verification rigor.

where it stands

iterating in public. system is partially built. the challenges are defining the research agenda.

what's working

  • context inference: 73% accuracy classifying stress level (low/medium/high)
  • claim verification: 87% of claims verified as entailed by source documents
  • multi-model orchestration: pipeline stages running with acceptable latency
  • knowledge base: miriam ende's methodology integrated with sleep science literature

what's hard

1. context inference accuracy — 73% isn't good enough. ground truth is hard to obtain. how do we validate that our inference matches actual user state?

2. hallucination rate — 87% verified means 13% either contradict sources or are unprovable. breakdown:

  • novel statements not in source: 8%
  • incorrect inferences: 3%
  • ambiguous phrasing: 2%

3. tone effectiveness validation — does adaptive tone actually improve comprehension and user satisfaction compared to static clinical tone? study design complete, awaiting ethics approval.

4. regulatory compliance — how do we communicate "this is not medical advice" without undermining the empathetic tone that makes the system useful?

the numbers

current accuracy: 73% context inference, 87% claim verification.

target: >85% context inference, >95% claim verification.

limitations

technical

1. linguistic scope — implementation tuned for german and english. other languages require separate fine-tuning.

2. population specificity — sleep norms vary across cultures. knowledge base reflects western pediatric guidelines.

3. longitudinal effects — we don't yet know if using this system improves long-term outcomes.

4. adversarial robustness — system has not been tested against adversarial prompting designed to break safety constraints.

ethical

responsibility — if the system gives advice that leads to harm, who is liable? legal frameworks for AI health counseling remain underdeveloped.

dependency risk — does availability of 24/7 AI support reduce human expert consultation? could this delay diagnosis of serious conditions?

data privacy — sleep pattern collection could reveal sensitive information about family dynamics, mental health, relationship stress.

equity of access — high-quality empathetic AI will likely be a paid service. does this exacerbate health disparities?

these questions don't have easy answers. i'm building this system with explicit uncertainty about long-term societal effects.

what's next

now

  • context inference improvement — active learning to identify high-risk patterns, iterative prompt refinement
  • tone effectiveness study — A/B testing with real parents measuring comprehension, satisfaction, behavioral outcomes
  • evidence-level display — show users the strength of evidence behind each recommendation

later

  • multilingual expansion — extend beyond german/english
  • active learning pipeline — automatically route uncertain cases to human experts for labeling
  • longitudinal tracking — correlate system usage with parent-reported outcomes over weeks/months

vision

  • proactive intervention recognition — detect patterns suggesting serious underlying conditions
  • causal inference — move beyond correlation to causation
  • personalized intervention design — generate custom sleep plans based on full family context

the point

the exhausted mother at 3:14am doesn't need another search result. she needs a system that understands that she's operating with -23% working memory capacity and -34% emotional regulation. she needs facts delivered with warmth. she needs reassurance grounded in evidence.

current AI makes her choose: accurate but cold, or warm but unreliable.

cognitive retrieval systems is my attempt to build something better—an architecture that treats empathy and accuracy not as competing objectives but as integrated capabilities.

i'm building this with miriam ende because we believe AI can do better. not AI that replaces human expertise, but AI that extends it—that can deliver evidence-based, emotionally intelligent guidance when human experts are unavailable.

for the 20-30% of parents experiencing infant sleep challenges, the difference between a cold clinical response and a warm understanding one—delivered without sacrificing accuracy—may determine whether they find the reassurance they need to make it through the night.

references

  1. Xiong, G. et al. (2024). Benchmarking RAG for Medicine. ACL 2024 Findings.
  2. Jin, Q. et al. (2023). MedCPT. Bioinformatics, 39(11).
  3. Farquhar, S. et al. (2024). Semantic Entropy. Nature, 630, 625-630.
  4. Sharma, A. et al. (2020). EPITOME Framework. EMNLP 2020.
  5. Min, S. et al. (2023). FActScore. EMNLP 2023.
  6. Omar, M. et al. (2025). LLM Hallucinations in Clinical Support. Communications Medicine, 5.
  7. Mindell, J.A. et al. (2006). Behavioral Treatment of Infant Sleep. Sleep, 29(10).
  8. Lim, J. & Dinges, D.F. (2010). A meta-analysis of the impact of short-term sleep deprivation on cognitive variables. Psychological Bulletin, 136(3).
  9. Yoo, S.S. et al. (2007). The human emotional brain without sleep. Current Biology, 17(20).
  10. Harrison, Y. & Horne, J.A. (2000). Sleep loss and temporal memory. Quarterly Journal of Experimental Psychology, 53(1).
  11. Ratcliff, R. & Van Dongen, H.P. (2009). Sleep deprivation affects multiple distinct cognitive processes. Psychonomic Bulletin & Review, 16(4).
  12. World Health Organization. (2021). AI Ethics in Health. Geneva: WHO.
  13. European Union. (2024). AI Act. Official Journal of the EU.
  14. Anthropic. (2025). Claude Model Cards.
  15. Anthropic. (2025). Constitutional AI: Harmlessness from AI Feedback.

last updated: Dec 2025