Abstract
Electronic claims records (ECRs) are large scale and longitudinal collections of individual's medical service seeking actions. Compared to in-hospital medical records (EMRs), ECRs are more standardized and cross-sites. Recently, there has been studies showing promising results on modeling claims data for a wide range of medical applications. However, few of them address the exclusion criteria on cohort selection to extract new incidence without prior signs and also often lack of emphasis on predicting cancer in early stages. In this work, we aim to design a lung cancer prediction framework using ECRs with rigorous exclusion design using state-of-the-art sequence-based transformer. Furthermore, this work presents one of the first results by applying disease prediction model to the entire population in Taiwan. The result shows over 2.1 predictive power, 5 average positive predictive value (PPV), and 0.668 area under curve (AUC) in all-stage lung cancer and around 2.0 predictive power, 1 average PPV and 0.645 AUC in early-stage in our dataset. Sub-cohort analysis could funnel high precision selective group into prioritized clinical examination. Onset analysis validates the effect of our exclusion criteria. This work presents comprehensive analyses on lung cancer prediction, and the proposed approach can serve as a state-of-the-art disease risk prediction framework on claims data.