Introduction
At the GenAIHealthHack Barcelona 2026, our team built DryWeight Compass: a multi-agent LLM system designed to help nephrologists determine optimal dry weight for hemodialysis patients. Using AWS Bedrock with Claude Opus, the Model Context Protocol for database integration, and a coordinated agent architecture, we created a clinical decision support tool that synthesizes multimodal patient data into actionable insights. The result: third place and a working prototype that demonstrates how agentic AI can tackle complex medical challenges.
The Challenge
At the GenAIHealthHack in Barcelona hosted by Hospital Clínic and NTT Data, we faced a deceptively simple question:
How do you determine the optimal dry weight for hemodialysis patients?
For the uninitiated, “dry weight” is the target weight a dialysis patient should reach after treatment, their ideal weight without excess fluid. Getting this wrong has serious consequences: set it too high, and patients retain dangerous fluid leading to cardiovascular complications; set it too low, and patients experience hypotensive episodes, cramping, and poor quality of life.
Here’s the problem: despite being central to hemodialysis prescription, dry weight assessment remains highly subjective and varies significantly across clinicians. Nephrologists must mentally integrate dozens of variables across multiple sessions:
- Hemodynamic responses (blood pressure trends, relative plasma volume)
- Ultrafiltration tolerance (symptoms, cramping, intervention requirements)
- Bioimpedance analysis (tissue hydration, extracellular water ratios)
- Clinical observations (edema, weight trends, patient-reported symptoms)
With modern monitoring systems generating thousands of data points per session, manual integration becomes overwhelming. We set out to build an AI system that could do what humans struggle with: rapidly synthesize multimodal longitudinal data into actionable clinical insights.
Architecture: Multi-Agent LLM System Design
Rather than building a monolithic AI system, we architected DryWeight Compass as a coordinated multi-agent system, leveraging the latest advances in LLM orchestration frameworks.
System Architecture
Key Architectural Decisions
1. Model Context Protocol (MCP) for Tool Integration
Rather than building custom API wrappers, we leveraged Anthropic’s Model Context Protocol, a standardized interface for connecting LLMs to external systems. This gave us:
- Standardized tool definitions: DuckDB query capabilities exposed as LLM-callable tools
- Schema introspection: Agents can discover tables, columns, and relationships dynamically
- Transport abstraction: HTTP-based MCP server running locally, easily scalable to cloud deployment
mcp_client = MultiServerMCPClient({
"duckdb": {
"transport": "http",
"url": "http://127.0.0.1:8000/mcp"
}
})
worker_agent_tools = await mcp_client.get_tools()
2. LangGraph for Stateful Agent Orchestration
We used LangGraph (LangChain’s agent framework) to manage two specialized agents:
-
Core Agent: The orchestrator. Manages conversation state, routes requests to worker agents, and formats responses for clinicians. When receiving multiple patient IDs, it spins up parallel worker agents, one dedicated worker per patient, to enable concurrent analysis and report generation.
-
Worker Agent: The clinical reasoning engine. Given a patient ID, it formulates SQL queries via MCP, retrieves multimodal data (hemodynamics, bioimpedance, session history), performs clinical analysis, and generates structured reports.
Why this separation? Cognitive specialization. The worker agent operates in “deep analysis” mode with detailed clinical reasoning prompts (~5000 tokens of domain knowledge). The core agent handles user interaction, keeping conversations fluid while offloading heavy computation.
# Core agent with task orchestration capabilities
core_agent = create_agent(
model=llm,
tools=[build_report, list_reports, get_report_content],
system_prompt=orchestrator_prompt,
checkpointer=checkpointer
)
# Worker agent with clinical reasoning system prompt
worker_agent = create_agent(
model=llm,
tools=worker_agent_tools,
system_prompt=clinical_reasoning_prompt,
checkpointer=InMemorySaver() # Conversational memory
)
3. Structured Outputs with Pydantic
Clinical recommendations require deterministic schema compliance, no freeform text where we need structured data. We defined a formal DialysisSessionReport schema using Pydantic:
class DialysisSessionReport(BaseModel):
longitudinal_summary: LongitudinalSummary
hemodynamic_analysis: HemodynamicAnalysis
tolerance_analysis: ToleranceAnalysis
bioimpedance_coherence: BioimpedanceCoherence
safety_alerts: SafetyAlerts
global_clinical_impression: GlobalClinicalImpression
dry_weight_recommendation: DryWeightRecommendation
This schema is injected into the LLM’s system prompt, ensuring every analysis follows the same structure, critical for clinical workflows and downstream integration with EHR systems.
4. DuckDB for Analytical Queries
Hemodialysis data is inherently analytical: time-series hemodynamics, session comparisons, longitudinal trends. We chose DuckDB over traditional OLTP databases because:
- Columnar storage: Efficient for analytical queries on large parquet files
- In-process execution: Runs embedded in the Python process, zero network latency
- SQL-native: LLMs can generate complex analytical queries without learning custom DSLs
- Parquet integration: Direct querying of parquet files without ETL
-- Example: Worker agent generates this query to analyze RPV trends
SELECT
session_id,
datetime,
relative_plasmatic_volume,
conductivity
FROM therapy_link
WHERE patient_id = 'XXXXXXXXXXXX'
AND relative_plasmatic_volume IS NOT NULL
ORDER BY session_id, datetime
Clinical Reasoning: Prompting for Safety-Critical Domains
Building LLM systems for healthcare requires fundamentally different prompting strategies than consumer applications. Our clinical reasoning prompt (5000+ tokens) encodes:
1. Multi-Modal Data Synthesis
The agent must integrate conflicting signals:
“Neutral OH (overhydration) but hemoconcentration detected → prioritize clinical findings over single bioimpedance reading”
We teach the model to weight data sources hierarchically:
- Tier 1: Direct clinical observations (hypotension requiring intervention, severe symptoms)
- Tier 2: Objective measurements (RPV drops below 85%, ultrafiltration rate >13 mL/kg/h)
- Tier 3: Bioimpedance trends (overhydration, ECW/ICW ratios)
- Tier 4: Patient-reported symptoms (mild cramping, fatigue)
2. Minimum Clinically Meaningful Change
Medical interventions have thresholds of meaningful impact:
“Changes less than 0.5 kg are not considered clinically meaningful. If analysis suggests <0.5 kg adjustment, recommend Maintain and note evidence is below clinical relevance threshold.”
This prevents algorithmic “chasing noise”, a common pitfall in AI-driven clinical systems.
3. Explicit Uncertainty Quantification
Every recommendation includes:
- Confidence score (percentage)
- Primary uncertainty source (e.g., “Only 1 session available, no historical dry weight data”)
- Data gaps explicitly called out (e.g., “No BCM/BIA data available”)
Example output:
RECOMMENDATION: 🟢 MAINTAIN CURRENT DRY WEIGHT
Confidence: 45% — Primary uncertainty: Only 1 session available,
no historical dry weight data, no BIA data, no prior sessions for
trend analysis
4. Safety Guardrails
We implement multiple safety checks:
class SafetyAlerts(BaseModel):
alerts: list[str] = Field(
default_factory=list,
description="Explicit list of safety alerts, if present."
)
The agent flags:
- UF rate >10 mL/kg/h (aggressive ultrafiltration)
- RPV <85% (critical hemoconcentration)
- Systolic BP drop >20% (hemodynamic instability)
- Patterns suggesting equipment malfunction
The Data Challenge (Without Violating Privacy)
We can’t share specifics, but here’s what working with real clinical data taught us:
Data Heterogeneity
Modern dialysis machines generate data in wildly inconsistent formats:
- Comma vs. period decimal separators (
"85,4"vs"85.4") - Timestamps in multiple formats across machines
- Missing values represented as empty strings,
NULL,"---", or zero - Fields with mixed units (some BIA machines report in mL, others in L)
Our approach: Basic data cleaning at query time (comma-to-period conversion, null handling with pandas) got us through the hackathon, but a proper preprocessing pipeline would be a key production requirement.
Temporal Alignment
A single 4-hour dialysis session generates:
- ~240 hemodynamic measurements (every minute)
- 1 bioimpedance measurement (pre-session)
- 5-10 blood pressure readings
- 50+ alarms/events
Challenge: Aligning these heterogeneous time-series for pattern detection.
Solution: Resampling to common time grid with forward-fill for sparse measurements, carefully preserving event timing.
Data Quality Assessment
Not all sessions are analytically useful:
- Incomplete sessions (early termination)
- Missing critical measurements (no RPV data)
- Equipment malfunctions (sensor dropouts)
The agent learned to assess data quality and adjust confidence accordingly:
“Session terminated early at 142 minutes, insufficient data for meaningful RPV trend analysis. Confidence reduced.”
Results & Clinical Validation
While we didn’t have time for full clinical validation during the hackathon, we achieved:
Qualitative Assessment by Clinicians
- Recommendations aligned with expert assessment in test cases
- Identified data quality issues experts missed (e.g., sensor drift in one session)
- Generated clinically coherent justifications that matched expert reasoning patterns
System Performance
- Initial report generation: 1-2 minutes per patient (analyzing session history)
- Follow-up queries: 15-25 seconds for conversational queries once the base report is generated
- Token efficiency: ~8,000-12,000 tokens per complete patient analysis (Claude Opus 4.5)
- Query performance: DuckDB analytical queries execute in <100ms on parquet files
Generated Reports
Reports included:
- Longitudinal summary (patterns across recent sessions)
- Hemodynamic analysis (RPV curves, BP trends, UF interaction)
- Tolerance analysis (symptoms, intervention frequency)
- Bioimpedance coherence (agreement/discordance with other signals)
- Safety alerts (explicit warnings)
- Global clinical impression with hydration status classification
- Dry weight recommendation with confidence score
Exported as both Markdown (for version control) and styled PDF (for clinical workflow integration).
What We’d Do Differently: Future Optimizations
1. Patient-Centric ETL Pipeline for Data Reorganization
Current challenge: Data is organized by source (separate parquet files for sessions, bioimpedance, RPV measurements, etc.). Each patient analysis requires scanning multiple files and joining across datasets.
The Problem:
- Query latency increases with dataset size (scanning all session files for one patient)
- Redundant joins across multiple parquet files
- Network I/O overhead when querying distributed storage
- Cache inefficiency (different queries touch same files)
Proposed ETL Architecture:
Build a patient-centric data lake where all data for a single patient is co-located:
Current Structure:
├── sessions.parquet (all patients, all sessions)
├── bioimpedance.parquet (all patients, all measurements)
├── therapy_link.parquet (all patients, all RPV data)
└── bcm.parquet (all patients, all BCM data)
Proposed Structure:
├── patients/
│ ├── XXXXXXXXXXXX/
│ │ ├── metadata.json (demographics, latest dry weight)
│ │ ├── sessions.parquet (only this patient)
│ │ ├── hemodynamics.parquet (RPV, BP, conductivity)
│ │ ├── bioimpedance.parquet (BCM/BIA measurements)
│ │ └── events.parquet (alarms, interventions)
│ ├── XXXXXXXXXXXX/
│ └── ...
└── indexes/
├── patient_lookup.parquet (fast patient search)
└── date_range_index.parquet (temporal queries)
Implementation Approach:
-
Incremental ETL with Apache Airflow/Prefect:
- Daily batch job: Process new session data from source systems
- Partition by patient ID using DuckDB’s
PARTITION BYclause - Maintain patient-level metadata cache (last session date, data completeness flags)
-
Data Deduplication & Validation:
- Detect duplicate records across sources
- Validate referential integrity (all sessions have corresponding patient records)
- Quality scoring per patient (% sessions with complete RPV data, BIA coverage)
-
Compression Strategy:
- Use Parquet’s column-level compression (ZSTD for hemodynamics, SNAPPY for metadata)
- Partition large time-series data by session to enable efficient pruning
-
Query Optimization:
- Patient queries become single-file reads (10-100x faster)
- Enable DuckDB’s
hive_partitioningfor automatic partition pruning - Pre-compute common aggregations (average RPV per session, UF rate trends)
Expected Performance Gains:
- Query latency: 15-25s → 2-5s per patient analysis
- Reduced token costs: Faster queries = less wait time for LLM responses
- Scalability: Linear scaling to millions of patients (partition pruning)
- Cache efficiency: Patient-centric access patterns improve cache hit rates
Trade-offs:
- Storage overhead: ~20-30% due to duplicate indexes
- ETL complexity: Requires orchestration tooling
- Data freshness: Batch processing introduces latency (acceptable for retrospective analysis)
Why it matters: Scaling beyond the hackathon prototype to production, query performance becomes critical. Patient-centric organization aligns data layout with access patterns, dramatically reducing analytical latency.
2. Retrieval-Augmented Generation (RAG) for Clinical Guidelines
Current system relies on static prompts encoding clinical knowledge. Future version would:
- Index clinical guidelines (KDOQI, KDIGO) in vector database
- Retrieve relevant guidance dynamically based on patient presentation
- Cite specific guideline sections in justifications
Why it matters: Clinical guidelines update frequently. RAG allows knowledge updates without full prompt reengineering.
3. Fine-Tuned Embedding Model for Medical Time-Series
Current approach: LLM sees SQL results as text tables.
Problems:
- Token inefficiency for long sessions
- Pattern recognition relies on LLM’s general capabilities
Future approach: Train specialized embedding model to encode hemodynamic time-series into dense vectors. Agent queries embeddings for “similar patterns across cohort,” enabling:
- “This RPV curve resembles 147 previous sessions where dry weight was 1.2kg too low”
- Outlier detection: “This conductivity trend is anomalous compared to patient’s baseline”
4. Multi-Stage Reasoning with Tool Use
Current: Single-pass analysis by worker agent.
Proposed: Decompose into specialized sub-agents:
- Data Quality Agent: Assesses completeness, identifies anomalies
- Hemodynamic Agent: Specializes in RPV/BP/UF analysis
- Bioimpedance Agent: Deep expertise in BIA interpretation
- Synthesis Agent: Integrates sub-agent outputs into final recommendation
Benefits: More specialized prompting, parallel execution, easier debugging.
5. Uncertainty Quantification via Ensemble Methods
Currently, confidence scores are LLM self-assessment. More rigorous approach:
- Generate multiple analyses with temperature variation
- Measure consistency across outputs
- Flag areas of high disagreement as uncertain
- Use prediction intervals rather than point estimates
6. Continuous Learning from Clinician Feedback
Feedback loop architecture:
- Clinician reviews AI recommendation
- Records agreement/disagreement + justification
- System logs feedback with patient context
- Periodic fine-tuning on agreement/disagreement cases
- A/B testing of prompt variations
Critical: Feedback must be structured and high-quality. Bad feedback degrades models faster than no feedback.
7. Real-Time Inference During Sessions
Current: Retrospective analysis after sessions complete.
Future: Streaming inference during active treatment:
- Alert clinicians to concerning RPV trends in real-time
- Suggest UF rate adjustments mid-session
- Predictive warnings: “Current trajectory suggests hypotension in 20-30 minutes”
Challenges:
- Infrastructure for streaming data ingestion
- Sub-second latency requirements
- Higher false-positive tolerance (interrupting treatment has costs)
8. Explainability Visualization
Current reports are text-heavy. Future versions would include:
- Interactive RPV curve annotations: Hover over inflection points to see agent’s interpretation
- Contribution analysis: Visual breakdown of which factors drove recommendation
- Counterfactual explanations: “If bioimpedance showed +1.5L OH instead of +0.5L, recommendation would change to…“
9. Infrastructure & Deployment
Prototype limitations:
- In-memory checkpointing (conversation state lost on restart)
- Local DuckDB (single-user, no concurrency)
Production architecture:
- Persistent state: Redis/PostgreSQL for conversation history
- Distributed database: MotherDuck (DuckDB cloud) or Snowflake for multi-user access
- Authentication: OAuth2 with role-based access control
- Audit logging: Complete traceability of AI recommendations for compliance
- HIPAA compliance: Encrypted data at rest/in transit, BAA with all vendors
Reflections: AI in Healthcare is Different
Building this system reinforced some hard truths about AI in clinical domains:
Accuracy isn’t enough. Calibration matters.
A model that’s 90% accurate but overconfident on its errors is dangerous. We’d rather have 80% accuracy with perfect uncertainty quantification.
Explainability is non-negotiable.
Clinicians won’t (and shouldn’t) trust black-box recommendations. Every output needs a justification they can evaluate against their clinical judgment.
Data quality determines ceiling.
The best model can’t overcome missing/incorrect data. We spent more time on data validation than on prompt engineering.
Human-in-the-loop is essential.
This system is decision support, not decision automation. Final judgment remains with clinicians who have context we can’t encode (patient preferences, comorbidities, social factors).
Edge cases are the norm.
In consumer AI, edge cases are 1%. In clinical practice, every patient is unique. The system must gracefully handle “I’ve never seen this pattern before” scenarios.
Data access remains a critical barrier to innovation.
In Europe, where privacy regulations are rigorous (and rightfully so), accessing clinical data, even anonymized, is extremely challenging for developers. This creates a paradox: the most impactful healthcare AI projects require real-world data, yet regulatory and institutional barriers make it nearly impossible to build them as open-source initiatives or accept external contributions.
This hackathon was valuable precisely because it provided rare access to genuine quality clinical data in a controlled environment. Our hope is that demonstrating what’s possible with proper data access will inspire the healthcare industry to develop frameworks that balance patient privacy with innovation, enabling more developers to contribute to life-saving technologies while maintaining the highest standards of data protection.
Technical Stack Summary
For those building similar systems:
Core Infrastructure:
- LLM: AWS Bedrock (Claude Opus 4.5) for reasoning capabilities
- Orchestration: LangChain + LangGraph for multi-agent coordination
- Tool Protocol: Model Context Protocol (MCP) for standardized integrations
- Database: DuckDB for analytical queries on parquet files
- UI: Streamlit for rapid prototyping
Data Stack:
- Storage: AWS S3 (raw CSV) → Parquet (normalized)
- Validation: Pandas for data cleaning and type normalization
- Visualization: Plotly for interactive time-series dashboards
Development:
- Schema Validation: Pydantic for structured outputs
- Export: Markdown + WeasyPrint for PDF generation
- Version Control: Git
Conclusion
While end-stage renal disease requiring dialysis affects a relatively small patient population (5.3-10.5 million people worldwide, though many lack access to treatment), this project demonstrates a critical principle: if multi-agent LLM systems can handle the complexity of hemodialysis optimization, they can be adapted to virtually any clinical domain with similar data-rich, multi-factorial decision-making challenges.
Our third-place finish at GenAIHealthHack validated this hypothesis, the architecture, prompting strategies, and clinical reasoning frameworks we developed aren’t specific to nephrology. They’re a blueprint for building AI systems that augment clinical expertise in oncology, cardiology, intensive care, and beyond.
But this is just a prototype. The gap between “impressive demo” and “production clinical tool” involves:
- Rigorous clinical validation studies
- Regulatory compliance (FDA/CE marking for clinical decision support)
- Integration with existing EHR/EMR systems
- Real-world testing across diverse patient populations
- Continuous monitoring for model drift and data quality issues
That said, the potential is enormous. If we can help nephrologists make more data-informed dry weight decisions, the downstream effects are significant:
- Fewer hypotensive episodes → better patient quality of life
- Reduced cardiovascular complications → improved long-term outcomes
- More consistent care → reduced inter-clinician variability
The future of clinical AI isn’t replacing clinicians, it’s giving them superpowers.
Acknowledgments
This project was built during GenAIHealthHack Barcelona 2026, organized by:
- Clínic Barcelona
- Universitat de Barcelona (including the Cátedra d’Innovació en Oncologia de Precisió)
- Universitat Pompeu Fabra (UPF)
- MIT Critical Data
- Món Clínic Barcelona
- MIT Spain
- Generalitat de Catalunya - Departament de Salut
With the collaboration of:
- Barcelona Supercomputing Center (BSC-CNS)
- BAPS
- AWS
And with the generous sponsorship of:
- NTT Data
- avvale
- LOGICALIS
- Palo Alto Networks
- Vodafone Business
- Lilly
- SIRT
Special thanks to the organizing team, mentors and clinical advisors who provided invaluable domain expertise, and to my teammates who made this experience possible.
Want to discuss this work? Connect with me on LinkedIn.
Disclaimer: This system is a research prototype developed during a hackathon. It has not undergone clinical validation and should not be used for actual patient care. This system was developed using anonymized clinical data provided during the hackathon. No actual patient data is disclosed in this post.