Hackathon: Building an AI-Powered Clinical Decision Support System for Hemodialysis

Introduction

At the GenAIHealthHack Barcelona 2026, our team built DryWeight Compass: a multi-agent LLM system designed to help nephrologists determine optimal dry weight for hemodialysis patients. Using AWS Bedrock with Claude Opus, the Model Context Protocol for database integration, and a coordinated agent architecture, we created a clinical decision support tool that synthesizes multimodal patient data into actionable insights. The result: third place and a working prototype that demonstrates how agentic AI can tackle complex medical challenges.

The Challenge

At the GenAIHealthHack in Barcelona hosted by Hospital Clínic and NTT Data, we faced a deceptively simple question:

How do you determine the optimal dry weight for hemodialysis patients?

For the uninitiated, “dry weight” is the target weight a dialysis patient should reach after treatment, their ideal weight without excess fluid. Getting this wrong has serious consequences: set it too high, and patients retain dangerous fluid leading to cardiovascular complications; set it too low, and patients experience hypotensive episodes, cramping, and poor quality of life.

Here’s the problem: despite being central to hemodialysis prescription, dry weight assessment remains highly subjective and varies significantly across clinicians. Nephrologists must mentally integrate dozens of variables across multiple sessions:

Hemodynamic responses (blood pressure trends, relative plasma volume)
Ultrafiltration tolerance (symptoms, cramping, intervention requirements)
Bioimpedance analysis (tissue hydration, extracellular water ratios)
Clinical observations (edema, weight trends, patient-reported symptoms)

With modern monitoring systems generating thousands of data points per session, manual integration becomes overwhelming. We set out to build an AI system that could do what humans struggle with: rapidly synthesize multimodal longitudinal data into actionable clinical insights.

Architecture: Multi-Agent LLM System Design

Rather than building a monolithic AI system, we architected DryWeight Compass as a coordinated multi-agent system, leveraging the latest advances in LLM orchestration frameworks.

System Architecture

DryWeight Compass System Architecture

Key Architectural Decisions

1. Model Context Protocol (MCP) for Tool Integration

Rather than building custom API wrappers, we leveraged Anthropic’s Model Context Protocol, a standardized interface for connecting LLMs to external systems. This gave us:

Standardized tool definitions: DuckDB query capabilities exposed as LLM-callable tools
Schema introspection: Agents can discover tables, columns, and relationships dynamically
Transport abstraction: HTTP-based MCP server running locally, easily scalable to cloud deployment

mcp_client = MultiServerMCPClient({
    "duckdb": {
        "transport": "http",
        "url": "http://127.0.0.1:8000/mcp"
    }
})
worker_agent_tools = await mcp_client.get_tools()

2. LangGraph for Stateful Agent Orchestration

We used LangGraph (LangChain’s agent framework) to manage two specialized agents:

Core Agent: The orchestrator. Manages conversation state, routes requests to worker agents, and formats responses for clinicians. When receiving multiple patient IDs, it spins up parallel worker agents, one dedicated worker per patient, to enable concurrent analysis and report generation.
Worker Agent: The clinical reasoning engine. Given a patient ID, it formulates SQL queries via MCP, retrieves multimodal data (hemodynamics, bioimpedance, session history), performs clinical analysis, and generates structured reports.

Why this separation? Cognitive specialization. The worker agent operates in “deep analysis” mode with detailed clinical reasoning prompts (~5000 tokens of domain knowledge). The core agent handles user interaction, keeping conversations fluid while offloading heavy computation.

# Core agent with task orchestration capabilities
core_agent = create_agent(
    model=llm,
    tools=[build_report, list_reports, get_report_content],
    system_prompt=orchestrator_prompt,
    checkpointer=checkpointer
)

# Worker agent with clinical reasoning system prompt
worker_agent = create_agent(
    model=llm,
    tools=worker_agent_tools,
    system_prompt=clinical_reasoning_prompt,
    checkpointer=InMemorySaver()  # Conversational memory
)

3. Structured Outputs with Pydantic

Clinical recommendations require deterministic schema compliance, no freeform text where we need structured data. We defined a formal DialysisSessionReport schema using Pydantic:

class DialysisSessionReport(BaseModel):
    longitudinal_summary: LongitudinalSummary
    hemodynamic_analysis: HemodynamicAnalysis
    tolerance_analysis: ToleranceAnalysis
    bioimpedance_coherence: BioimpedanceCoherence
    safety_alerts: SafetyAlerts
    global_clinical_impression: GlobalClinicalImpression
    dry_weight_recommendation: DryWeightRecommendation

This schema is injected into the LLM’s system prompt, ensuring every analysis follows the same structure, critical for clinical workflows and downstream integration with EHR systems.

4. DuckDB for Analytical Queries

Hemodialysis data is inherently analytical: time-series hemodynamics, session comparisons, longitudinal trends. We chose DuckDB over traditional OLTP databases because:

Columnar storage: Efficient for analytical queries on large parquet files
In-process execution: Runs embedded in the Python process, zero network latency
SQL-native: LLMs can generate complex analytical queries without learning custom DSLs
Parquet integration: Direct querying of parquet files without ETL

-- Example: Worker agent generates this query to analyze RPV trends
SELECT 
    session_id,
    datetime,
    relative_plasmatic_volume,
    conductivity
FROM therapy_link
WHERE patient_id = 'XXXXXXXXXXXX'
    AND relative_plasmatic_volume IS NOT NULL
ORDER BY session_id, datetime

Clinical Reasoning: Prompting for Safety-Critical Domains

Building LLM systems for healthcare requires fundamentally different prompting strategies than consumer applications. Our clinical reasoning prompt (5000+ tokens) encodes:

The agent must integrate conflicting signals:

“Neutral OH (overhydration) but hemoconcentration detected → prioritize clinical findings over single bioimpedance reading”

We teach the model to weight data sources hierarchically:

Tier 1: Direct clinical observations (hypotension requiring intervention, severe symptoms)
Tier 2: Objective measurements (RPV drops below 85%, ultrafiltration rate >13 mL/kg/h)
Tier 3: Bioimpedance trends (overhydration, ECW/ICW ratios)
Tier 4: Patient-reported symptoms (mild cramping, fatigue)

2. Minimum Clinically Meaningful Change

Medical interventions have thresholds of meaningful impact:

“Changes less than 0.5 kg are not considered clinically meaningful. If analysis suggests <0.5 kg adjustment, recommend Maintain and note evidence is below clinical relevance threshold.”

This prevents algorithmic “chasing noise”, a common pitfall in AI-driven clinical systems.

3. Explicit Uncertainty Quantification

Every recommendation includes:

Confidence score (percentage)
Primary uncertainty source (e.g., “Only 1 session available, no historical dry weight data”)
Data gaps explicitly called out (e.g., “No BCM/BIA data available”)

Example output:

RECOMMENDATION: 🟢 MAINTAIN CURRENT DRY WEIGHT
Confidence: 45% — Primary uncertainty: Only 1 session available,
no historical dry weight data, no BIA data, no prior sessions for
trend analysis

4. Safety Guardrails

We implement multiple safety checks:

class SafetyAlerts(BaseModel):
    alerts: list[str] = Field(
        default_factory=list,
        description="Explicit list of safety alerts, if present."
    )

The agent flags:

UF rate >10 mL/kg/h (aggressive ultrafiltration)
RPV <85% (critical hemoconcentration)
Systolic BP drop >20% (hemodynamic instability)
Patterns suggesting equipment malfunction

The Data Challenge (Without Violating Privacy)

We can’t share specifics, but here’s what working with real clinical data taught us:

Data Heterogeneity

Modern dialysis machines generate data in wildly inconsistent formats:

Comma vs. period decimal separators ("85,4" vs "85.4")
Timestamps in multiple formats across machines
Missing values represented as empty strings, NULL, "---", or zero
Fields with mixed units (some BIA machines report in mL, others in L)

Our approach: Basic data cleaning at query time (comma-to-period conversion, null handling with pandas) got us through the hackathon, but a proper preprocessing pipeline would be a key production requirement.

Temporal Alignment

A single 4-hour dialysis session generates:

~240 hemodynamic measurements (every minute)
1 bioimpedance measurement (pre-session)
5-10 blood pressure readings
50+ alarms/events

Challenge: Aligning these heterogeneous time-series for pattern detection.

Solution: Resampling to common time grid with forward-fill for sparse measurements, carefully preserving event timing.

Data Quality Assessment

Not all sessions are analytically useful:

Incomplete sessions (early termination)
Missing critical measurements (no RPV data)
Equipment malfunctions (sensor dropouts)

The agent learned to assess data quality and adjust confidence accordingly:

“Session terminated early at 142 minutes, insufficient data for meaningful RPV trend analysis. Confidence reduced.”

Results & Clinical Validation

While we didn’t have time for full clinical validation during the hackathon, we achieved:

Qualitative Assessment by Clinicians

Recommendations aligned with expert assessment in test cases
Identified data quality issues experts missed (e.g., sensor drift in one session)
Generated clinically coherent justifications that matched expert reasoning patterns

System Performance

Initial report generation: 1-2 minutes per patient (analyzing session history)
Follow-up queries: 15-25 seconds for conversational queries once the base report is generated
Token efficiency: ~8,000-12,000 tokens per complete patient analysis (Claude Opus 4.5)
Query performance: DuckDB analytical queries execute in <100ms on parquet files

Generated Reports

Reports included:

Longitudinal summary (patterns across recent sessions)
Hemodynamic analysis (RPV curves, BP trends, UF interaction)
Tolerance analysis (symptoms, intervention frequency)
Bioimpedance coherence (agreement/discordance with other signals)
Safety alerts (explicit warnings)
Global clinical impression with hydration status classification
Dry weight recommendation with confidence score

Exported as both Markdown (for version control) and styled PDF (for clinical workflow integration).

What We’d Do Differently: Future Optimizations

1. Patient-Centric ETL Pipeline for Data Reorganization

Current challenge: Data is organized by source (separate parquet files for sessions, bioimpedance, RPV measurements, etc.). Each patient analysis requires scanning multiple files and joining across datasets.

The Problem:

Query latency increases with dataset size (scanning all session files for one patient)
Redundant joins across multiple parquet files
Network I/O overhead when querying distributed storage
Cache inefficiency (different queries touch same files)

Proposed ETL Architecture:

Build a patient-centric data lake where all data for a single patient is co-located:

Current Structure:
├── sessions.parquet (all patients, all sessions)
├── bioimpedance.parquet (all patients, all measurements)
├── therapy_link.parquet (all patients, all RPV data)
└── bcm.parquet (all patients, all BCM data)

Proposed Structure:
├── patients/
│   ├── XXXXXXXXXXXX/
│   │   ├── metadata.json (demographics, latest dry weight)
│   │   ├── sessions.parquet (only this patient)
│   │   ├── hemodynamics.parquet (RPV, BP, conductivity)
│   │   ├── bioimpedance.parquet (BCM/BIA measurements)
│   │   └── events.parquet (alarms, interventions)
│   ├── XXXXXXXXXXXX/
│   └── ...
└── indexes/
    ├── patient_lookup.parquet (fast patient search)
    └── date_range_index.parquet (temporal queries)

Implementation Approach:

Incremental ETL with Apache Airflow/Prefect:
- Daily batch job: Process new session data from source systems
- Partition by patient ID using DuckDB’s PARTITION BY clause
- Maintain patient-level metadata cache (last session date, data completeness flags)
Data Deduplication & Validation:
- Detect duplicate records across sources
- Validate referential integrity (all sessions have corresponding patient records)
- Quality scoring per patient (% sessions with complete RPV data, BIA coverage)
Compression Strategy:
- Use Parquet’s column-level compression (ZSTD for hemodynamics, SNAPPY for metadata)
- Partition large time-series data by session to enable efficient pruning
Query Optimization:
- Patient queries become single-file reads (10-100x faster)
- Enable DuckDB’s hive_partitioning for automatic partition pruning
- Pre-compute common aggregations (average RPV per session, UF rate trends)

Expected Performance Gains:

Query latency: 15-25s → 2-5s per patient analysis
Reduced token costs: Faster queries = less wait time for LLM responses
Scalability: Linear scaling to millions of patients (partition pruning)
Cache efficiency: Patient-centric access patterns improve cache hit rates

Trade-offs:

Storage overhead: ~20-30% due to duplicate indexes
ETL complexity: Requires orchestration tooling
Data freshness: Batch processing introduces latency (acceptable for retrospective analysis)

Why it matters: Scaling beyond the hackathon prototype to production, query performance becomes critical. Patient-centric organization aligns data layout with access patterns, dramatically reducing analytical latency.

2. Retrieval-Augmented Generation (RAG) for Clinical Guidelines

Current system relies on static prompts encoding clinical knowledge. Future version would:

Index clinical guidelines (KDOQI, KDIGO) in vector database
Retrieve relevant guidance dynamically based on patient presentation
Cite specific guideline sections in justifications

Why it matters: Clinical guidelines update frequently. RAG allows knowledge updates without full prompt reengineering.

3. Fine-Tuned Embedding Model for Medical Time-Series

Current approach: LLM sees SQL results as text tables.

Problems:

Token inefficiency for long sessions
Pattern recognition relies on LLM’s general capabilities

Future approach: Train specialized embedding model to encode hemodynamic time-series into dense vectors. Agent queries embeddings for “similar patterns across cohort,” enabling:

“This RPV curve resembles 147 previous sessions where dry weight was 1.2kg too low”
Outlier detection: “This conductivity trend is anomalous compared to patient’s baseline”

4. Multi-Stage Reasoning with Tool Use

Current: Single-pass analysis by worker agent.

Proposed: Decompose into specialized sub-agents:

Data Quality Agent: Assesses completeness, identifies anomalies
Hemodynamic Agent: Specializes in RPV/BP/UF analysis
Bioimpedance Agent: Deep expertise in BIA interpretation
Synthesis Agent: Integrates sub-agent outputs into final recommendation

Benefits: More specialized prompting, parallel execution, easier debugging.

5. Uncertainty Quantification via Ensemble Methods

Currently, confidence scores are LLM self-assessment. More rigorous approach:

Generate multiple analyses with temperature variation
Measure consistency across outputs
Flag areas of high disagreement as uncertain
Use prediction intervals rather than point estimates

6. Continuous Learning from Clinician Feedback

Feedback loop architecture:

Clinician reviews AI recommendation
Records agreement/disagreement + justification
System logs feedback with patient context
Periodic fine-tuning on agreement/disagreement cases
A/B testing of prompt variations

Critical: Feedback must be structured and high-quality. Bad feedback degrades models faster than no feedback.

7. Real-Time Inference During Sessions

Current: Retrospective analysis after sessions complete.

Future: Streaming inference during active treatment:

Alert clinicians to concerning RPV trends in real-time
Suggest UF rate adjustments mid-session
Predictive warnings: “Current trajectory suggests hypotension in 20-30 minutes”

Challenges:

Infrastructure for streaming data ingestion
Sub-second latency requirements
Higher false-positive tolerance (interrupting treatment has costs)

8. Explainability Visualization

Current reports are text-heavy. Future versions would include:

Interactive RPV curve annotations: Hover over inflection points to see agent’s interpretation
Contribution analysis: Visual breakdown of which factors drove recommendation
Counterfactual explanations: “If bioimpedance showed +1.5L OH instead of +0.5L, recommendation would change to…“

9. Infrastructure & Deployment

Prototype limitations:

In-memory checkpointing (conversation state lost on restart)
Local DuckDB (single-user, no concurrency)

Production architecture:

Persistent state: Redis/PostgreSQL for conversation history
Distributed database: MotherDuck (DuckDB cloud) or Snowflake for multi-user access
Authentication: OAuth2 with role-based access control
Audit logging: Complete traceability of AI recommendations for compliance
HIPAA compliance: Encrypted data at rest/in transit, BAA with all vendors

Reflections: AI in Healthcare is Different

Building this system reinforced some hard truths about AI in clinical domains:

Accuracy isn’t enough. Calibration matters.

A model that’s 90% accurate but overconfident on its errors is dangerous. We’d rather have 80% accuracy with perfect uncertainty quantification.

Explainability is non-negotiable.

Clinicians won’t (and shouldn’t) trust black-box recommendations. Every output needs a justification they can evaluate against their clinical judgment.

Data quality determines ceiling.

The best model can’t overcome missing/incorrect data. We spent more time on data validation than on prompt engineering.

Human-in-the-loop is essential.

This system is decision support, not decision automation. Final judgment remains with clinicians who have context we can’t encode (patient preferences, comorbidities, social factors).

Edge cases are the norm.

In consumer AI, edge cases are 1%. In clinical practice, every patient is unique. The system must gracefully handle “I’ve never seen this pattern before” scenarios.

Data access remains a critical barrier to innovation.

In Europe, where privacy regulations are rigorous (and rightfully so), accessing clinical data, even anonymized, is extremely challenging for developers. This creates a paradox: the most impactful healthcare AI projects require real-world data, yet regulatory and institutional barriers make it nearly impossible to build them as open-source initiatives or accept external contributions.

This hackathon was valuable precisely because it provided rare access to genuine quality clinical data in a controlled environment. Our hope is that demonstrating what’s possible with proper data access will inspire the healthcare industry to develop frameworks that balance patient privacy with innovation, enabling more developers to contribute to life-saving technologies while maintaining the highest standards of data protection.

Technical Stack Summary

For those building similar systems:

Core Infrastructure:

LLM: AWS Bedrock (Claude Opus 4.5) for reasoning capabilities
Orchestration: LangChain + LangGraph for multi-agent coordination
Tool Protocol: Model Context Protocol (MCP) for standardized integrations
Database: DuckDB for analytical queries on parquet files
UI: Streamlit for rapid prototyping

Data Stack:

Storage: AWS S3 (raw CSV) → Parquet (normalized)
Validation: Pandas for data cleaning and type normalization
Visualization: Plotly for interactive time-series dashboards

Development:

Schema Validation: Pydantic for structured outputs
Export: Markdown + WeasyPrint for PDF generation
Version Control: Git

Conclusion

While end-stage renal disease requiring dialysis affects a relatively small patient population (5.3-10.5 million people worldwide, though many lack access to treatment), this project demonstrates a critical principle: if multi-agent LLM systems can handle the complexity of hemodialysis optimization, they can be adapted to virtually any clinical domain with similar data-rich, multi-factorial decision-making challenges.

Our third-place finish at GenAIHealthHack validated this hypothesis, the architecture, prompting strategies, and clinical reasoning frameworks we developed aren’t specific to nephrology. They’re a blueprint for building AI systems that augment clinical expertise in oncology, cardiology, intensive care, and beyond.

But this is just a prototype. The gap between “impressive demo” and “production clinical tool” involves:

Rigorous clinical validation studies
Regulatory compliance (FDA/CE marking for clinical decision support)
Integration with existing EHR/EMR systems
Real-world testing across diverse patient populations
Continuous monitoring for model drift and data quality issues

That said, the potential is enormous. If we can help nephrologists make more data-informed dry weight decisions, the downstream effects are significant:

Fewer hypotensive episodes → better patient quality of life
Reduced cardiovascular complications → improved long-term outcomes
More consistent care → reduced inter-clinician variability

The future of clinical AI isn’t replacing clinicians, it’s giving them superpowers.

Acknowledgments

This project was built during GenAIHealthHack Barcelona 2026, organized by:

Clínic Barcelona
Universitat de Barcelona (including the Cátedra d’Innovació en Oncologia de Precisió)
Universitat Pompeu Fabra (UPF)
MIT Critical Data
Món Clínic Barcelona
MIT Spain
Generalitat de Catalunya - Departament de Salut

With the collaboration of:

Barcelona Supercomputing Center (BSC-CNS)
BAPS
AWS

And with the generous sponsorship of:

NTT Data
avvale
LOGICALIS
Palo Alto Networks
Vodafone Business
Lilly
SIRT

Special thanks to the organizing team, mentors and clinical advisors who provided invaluable domain expertise, and to my teammates who made this experience possible.

Want to discuss this work? Connect with me on LinkedIn.

Disclaimer: This system is a research prototype developed during a hackathon. It has not undergone clinical validation and should not be used for actual patient care. This system was developed using anonymized clinical data provided during the hackathon. No actual patient data is disclosed in this post.