Abstract
Electronic medical claims (EMC) database has been successfully used for predicting occurrences of stroke and a variety of other diseases. However, inadequate predictive performances have been observed in cases of rare occurrences due to both insufficient training samples and highly imbalanced class distribution. In this work, our aim is to improve stroke prediction, especially for young age group (25-45 year-old) in a large population-based EMC database (552,898 subjects). We learn a young stroke predictive deep neural network model using a novel active data augmenter. The augmenter selects the most informative EHR data samples from old age stroke patients. This approach achieves 9.3% and 8.2% area under the receiver operating characteristic curve (AUC) value improvements compared to training directly with only young age group data and training all age groups data, respectively. We further provide analyses on the AUC values obtained as a function of the training data size, and the amount and the type of augmented data samples.