Alejandro Pascual-Mellado, Nuria Ribelles, José M. Jerez, Francisco J. Moreno-Barea

Departamento de Lenguajes y Ciencias de la Computación, Escuela Técnica Superior de Ingeniería Informática, Universidad de Málaga, Málaga, Spain

Hospital Universitario Virgen de la Victoria, Málaga, Spain
Most of the clinical information stored in Spanish healthcare systems is found as unstructured text in electronic medical records. The automatic extraction of valuable information contained in these documents is a critical task. Valuable information for clinical analysis units in oncology includes the location of a patient's neoplasm. This location, included in the ICD-10-ES coding category, can be extracted from the texts using natural language processing. To this end, in this study we have developed methodologies based on the state of the art in natural language processing, the Transformer models. The results obtained show that the application of these models is of great help in this task. In particular, the RoBERTa-Base-Biomed model performed best, with a value of 0.946 in percentage of correct answers, 0.920 in precision, 0.898 in sensitivity and 0.908 in F1-score, showing great performance for most classes.