Revista JCR

Procesamiento de Lenguaje Natural

IF: 6.8

ICD-10 Neoplasm Location using Text Classification Models in Spanish Electronic Health Records

Francisco J. Moreno-Barea, Alejandro Pascual-Mellado, Héctor Mesa, Beatriz Villaescusa-Gonzalez, Emilio Alba, José M. Jerez

IEEE Journal of Biomedical and Health Informatics•2025•Vol. : 1-14

Citas

Visualizaciones

Descargas

Altmetric Score

8/10/2025

Publicado

Autores

Fco. Javier Moreno-Barea
CorrespondenciaCorresp

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

José Jerez Aragonés

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

Héctor Mesa Jiménez

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

Alejandro Pascual-Mellado

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

Beatriz Villaescusa-Gonzalez

Unidad de Gestión Clínica Intercentros de Oncología, Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain

Emilio Alba

Unidad de Gestión Clínica Intercentros de Oncología, Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain

Resumen

The majority of clinical information stored in Spanish healthcare systems is found as unstructured text in electronic health records (EHRs). The automatic extraction of valuable information contained in these documents is a critical task. Valuable information for oncology clinical analysis units and Real-World Evidence studies includes the location of the neoplasm presented by a patient. This location, included in the ICD-10 coding category, can be extracted from the texts by natural language processing (NLP). This study set out to explore the classification of medical documents in Spanish for the purpose of extracting the location of the patient's primary neoplasm. A private corpus composed of 23, 704 real clinical EHRs was utilised. The prediction problem was approached through a classification of 12 primary organ groupings and 29 specific locations. In order to achieve this, four NLP methodologies were developed: traditional machine learning (ML); ensemble ML; recurrent neural networks (RNN); and Transformersbased models. Our findings demonstrate that traditional ML models exhibit superior performance when compared to RNNs and Transformers. Models such as XGBoost and SVMs demonstrate remarkable efficacy, attaining an F1- score of 0.938 for the 12-class classification and an F1- score of 0.838 for the 29-class specific classification, respectively. The pre-trained RoBERTa-Base-Biomed model, which incorporates medical and clinical corpora in Spanish, demonstrates an F1-score of 0.808 for the 29-location problem. However, the Transformer model exhibits superior performance than the rest of the approaches when dealing with an external corpus, indicating a higher generalisation capacity.

Palabras Clave

Natural language processing

Cancer classification

Large language models

Transformers

Deep learning

Ensemble learning

Multiclass classification

Electronic health records

Acceso a la Publicación

Ver en Revista

Información de Publicación

Páginas

1-14

Publicado

8/10/2025

Recibido

1/6/2025

Aceptado

1/10/2025

Métricas de Impacto

Citas0

Factor de Impacto6.8

Cuartil

Visualizaciones36

Descargas16

Altmetric5

ICD-10 Neoplasm Location using Text Classification Models in Spanish Electronic Health Records

Fco. Javier Moreno-BareaCorrespondenciaCorresp

José Jerez Aragonés

Héctor Mesa Jiménez

Alejandro Pascual-Mellado

Beatriz Villaescusa-Gonzalez

Emilio Alba

Fco. Javier Moreno-Barea
CorrespondenciaCorresp