Francisco J. Moreno-Barea, Alejandro Pascual-Mellado, Héctor Mesa, Beatriz Villaescusa-Gonzalez, Emilio Alba, José M. Jerez

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

Unidad de Gestión Clínica Intercentros de Oncología, Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain

Unidad de Gestión Clínica Intercentros de Oncología, Hospitales Universitarios Regional y Virgen de la Victoria, Málaga, Spain
The majority of clinical information stored in Spanish healthcare systems is found as unstructured text in electronic health records (EHRs). The automatic extraction of valuable information contained in these documents is a critical task. Valuable information for oncology clinical analysis units and Real-World Evidence studies includes the location of the neoplasm presented by a patient. This location, included in the ICD-10 coding category, can be extracted from the texts by natural language processing (NLP). This study set out to explore the classification of medical documents in Spanish for the purpose of extracting the location of the patient's primary neoplasm. A private corpus composed of 23, 704 real clinical EHRs was utilised. The prediction problem was approached through a classification of 12 primary organ groupings and 29 specific locations. In order to achieve this, four NLP methodologies were developed: traditional machine learning (ML); ensemble ML; recurrent neural networks (RNN); and Transformersbased models. Our findings demonstrate that traditional ML models exhibit superior performance when compared to RNNs and Transformers. Models such as XGBoost and SVMs demonstrate remarkable efficacy, attaining an F1- score of 0.938 for the 12-class classification and an F1- score of 0.838 for the 29-class specific classification, respectively. The pre-trained RoBERTa-Base-Biomed model, which incorporates medical and clinical corpora in Spanish, demonstrates an F1-score of 0.808 for the 29-location problem. However, the Transformer model exhibits superior performance than the rest of the approaches when dealing with an external corpus, indicating a higher generalisation capacity.