Conference paper
Procesamiento del Lenguaje Natural
IF: 0
TBD

Named Entity Recognition for de-identifying Real-World Health Records in Spanish

Guillermo López-García, Francisco J. Moreno-Barea, Héctor Mesa, José M. Jerez, Nuria Ribelles, Emilio Alba, Francisco J. Veredas

International Conference on Computational Science, Springer2023Vol. : 228–242
1
Citas
850
Visualizaciones
N/A
Descargas
N/A
Altmetric Score
26/6/2023
Publicado
Resumen

A growing and renewed interest has emerged in Electronic Health Records (EHRs) as a source of information for decision-making in clinical practice. In this context, the automatic de-identification of EHRs constitutes an essential task, since their dissociation from personal data is a mandatory first step before their distribution. However, the majority of previous studies on this subject have been conducted on English EHRs, due to the limited availability of annotated corpora in other languages, such as Spanish. In this study, we addressed the automatic de-identification of medical documents in Spanish. A private corpus of 599 real-world clinical cases have been annotated with 8 different protected health information categories. We have tackled the predictive problem as a named entity recognition task, developing two different deep learning-based methodologies, namely a first strategy based on recurrent neural networks (RNN) and an end-to-end approach based on transformers. Additionally, we have developed a data augmentation procedure to increase the number of texts used to train the models. The results obtained show that transformers outperform RNN on the de-identification of Spanish clinical data. In particular, the best performance was obtained by the XLM-RoBERTa large transformer, with a strict-match micro-averaged value of 0.946 for precision, 0.954 for recall and 0.95 for F1-score, when trained on the augmented version of the corpus. The performance achieved by transformers in this study proves the viability of applying these state-of-the-art models in real-world clinical scenarios.

Palabras Clave
Named Entity Recognition
Natural Language Processing
Electronic Health Records
De-identification
Spanish
Acceso a la Publicación
Información de Publicación
Páginas
228–242
Publicado
26/6/2023
Métricas de Impacto
Citas1
Factor de Impacto0
Cuartil
TBD
Visualizaciones850