Leveraging Pre-trained and Transformer-derived Embeddings from EHRs to Characterize Heterogeneity Across Alzheimer’s Disease and Related Dementias

Published in arXiv, 2024

Recommended citation: West, M., Magdamo, C., Cheng, Y., He, Y., & Das, S. (2024). Leveraging Pre-trained and Transformer-derived Embeddings from EHRs to Characterize Heterogeneity Across Alzheimer's Disease and Related Dementias. arXiv:2404.00464.

This work applies unsupervised machine learning to electronic health records (EHRs) from memory disorder patients to characterize heterogeneity in Alzheimer’s disease and related dementias (ADRD). The approach employs pre-trained embeddings for medical codes as well as transformer-derived Clinical BERT embeddings of free-text clinical notes to encode patient EHRs. Hierarchical clustering identifies distinct patient sub-populations based on shared medical conditions and clinical documentation patterns. The identified clusters suggest heterogeneous ADRD subtypes that may reflect different disease mechanisms and treatment needs. This work represents an early methodological step toward precision medicine approaches for ADRD characterization using routinely collected clinical data.