Revista JCR
Procesamiento de Lenguaje Natural
IF: 6.8
Q1

ICD-10 Neoplasm Location using Text Classification Models in Spanish Electronic Health Records

Francisco J. Moreno-Barea, Alejandro Pascual-Mellado, Héctor Mesa, Beatriz Villaescusa-Gonzalez, Emilio Alba, José M. Jerez

IEEE Journal of Biomedical and Health Informatics2025Vol. : 1-14
0
Citas
36
Visualizaciones
16
Descargas
5
Altmetric Score
8/10/2025
Publicado
Autores
Alejandro Pascual-Mellado

Alejandro Pascual-Mellado

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

Beatriz Villaescusa-Gonzalez

Beatriz Villaescusa-Gonzalez

Unidad de Gestión Clínica Intercentros de Oncología, Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain

Emilio Alba

Emilio Alba

Unidad de Gestión Clínica Intercentros de Oncología, Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain

Resumen

The majority of clinical information stored in Spanish healthcare systems is found as unstructured text in electronic health records (EHRs). The automatic extraction of valuable information contained in these documents is a critical task. Valuable information for oncology clinical analysis units and Real-World Evidence studies includes the location of the neoplasm presented by a patient. This location, included in the ICD-10 coding category, can be extracted from the texts by natural language processing (NLP). This study set out to explore the classification of medical documents in Spanish for the purpose of extracting the location of the patient's primary neoplasm. A private corpus composed of 23, 704 real clinical EHRs was utilised. The prediction problem was approached through a classification of 12 primary organ groupings and 29 specific locations. In order to achieve this, four NLP methodologies were developed: traditional machine learning (ML); ensemble ML; recurrent neural networks (RNN); and Transformersbased models. Our findings demonstrate that traditional ML models exhibit superior performance when compared to RNNs and Transformers. Models such as XGBoost and SVMs demonstrate remarkable efficacy, attaining an F1- score of 0.938 for the 12-class classification and an F1- score of 0.838 for the 29-class specific classification, respectively. The pre-trained RoBERTa-Base-Biomed model, which incorporates medical and clinical corpora in Spanish, demonstrates an F1-score of 0.808 for the 29-location problem. However, the Transformer model exhibits superior performance than the rest of the approaches when dealing with an external corpus, indicating a higher generalisation capacity.

Palabras Clave
Natural language processing
Cancer classification
Large language models
Transformers
Deep learning
Ensemble learning
Multiclass classification
Electronic health records
Acceso a la Publicación
Información de Publicación
Páginas
1-14
Publicado
8/10/2025
Recibido
1/6/2025
Aceptado
1/10/2025
Métricas de Impacto
Citas0
Factor de Impacto6.8
Cuartil
Q1
Visualizaciones36
Descargas16
Altmetric5