Revista JCR
Procesamiento de Lenguaje Natural
IF: 4.9
Q1

Oncology data extraction with large language models from real-world breast cancer electronic health records in Spanish

Julio Montes-Torres, Francisco J. Moreno-Barea, Leonardo Franco, Nuria Ribelles, Emilio Alba, José M. Jerez

Machine Learning with Applications2026Vol. 23: 100837
0
Citas
0
Visualizaciones
0
Descargas
N/A
Altmetric Score
1/3/2026
Publicado
Resumen

The integration of Artificial Intelligence (AI) in healthcare systems has the potential to significantly enhance patient care and streamline clinical processes. This research investigates the utilisation of generative AI and large language models (LLMs) for oncological information extraction (IE) from Spanish real electronic health records (EHRs) to enhance clinical decision-making and research. We conducted a comparative analysis of GPT-4.5 and 11 state-of-the-art, locally executable LLM-based chatbots, including Llama 3.2, Mistral-Small 3.2, and Phi-4, to extract specific clinical entities from real EHR narratives. Our evaluation workflow aimed to assess the performance of these models in contexts with computational constraints, specifically targeting the extraction of breast cancer prognostic factors. Initial findings indicate that while open-source LLM models are improving, they are not yet equivalent to human specialists in terms of Named Entity Recognition (NER) accuracy. The language of the clinical records notably influences performance, revealing that smaller models particularly struggle with Spanish text. However, with careful model selection and output post-processing, Mistral-Small 3.2 achieved a detection F1 score of over 74.7% for critical TNM information. This study highlights significant potential for generative AI in clinical IE but underscores the need for ongoing improvements, particularly in handling linguistic diversity. Locally managed open source models are still far from performing like a human specialist, but addressing common model shortcomings can facilitate the integration of AI-driven solutions into public healthcare systems, thereby improving patient outcomes and fostering efficient data utilisation.

Palabras Clave
Information extraction
Natural language processing
Electronic health records
Oncology
Spanish
Acceso a la Publicación
Información de Publicación
Volumen
23
Páginas
100837
Publicado
1/3/2026
Recibido
19/6/2025
Aceptado
3/1/2026
Métricas de Impacto
Citas0
Factor de Impacto4.9
Cuartil
Q1
000