Development of a pancreatic cancer prediction model using a multinational medical records database

Limor Appelbaum, MD1, Alex Berg2, Jose Cambronero2, Thurston Dang, PhD2, Charles Jin2, Lori Zhang2, Steven Kundrot3, Matvey Palchuk3, Laura A. Evans3, Irving D. Kaplan, MD1, Martin Rinard,PhD2

Beth Israel Deaconess Medical Center, Boston, MA1
Massachusetts Institute of Technology, Cambridge, MA2
TriNetX, Cambridge, MA; TriNetX, Inc., Cambridge, MA3
Pancreatic ductal adenocarcinoma (PDAC) is the third most lethal cancer in the USA, with a 10% five-year survival rate [1]. The majority of PDAC cases are diagnosed at an advanced stage, mostly due to lack of early specific symptoms, and absence of effective screening strategies. PDAC in its early or precursor stages is potentially resectable and curable [2]. Although screening is recommended for individuals with an inherited predisposition, who comprise less than 10% of all PDAC cases [3], there is currently no screening program for the general population.
Our prior work demonstrated that leveraging Machine Learning on diagnoses from Electronic Health Records (EHRs), can identify individuals at high-risk for PDAC, up to 1 year before current diagnosis [4].
We aim to improve the performance of our model by using a multi-center dataset, and adding lab test features.
EHR data from TriNetX, a federated health research network, was utilized to develop Logistic Regression (LR) models, using diagnoses and lab test data from 32 Health Care Organizations in the US (2015-2020).
PDAC patients were identified using ICD codes, and validated with tumor registry and pathology data. Patients 60-80 years old, with one or more clinical encounter/s, minimum 6 months prior to PDAC diagnosis, were included, using prediction time cutoffs of 180, 270, 360 days before PDAC diagnosis (Fig. 1).
Preliminary basic data analysis was initially performed to explore potential lab test features that could be used to improve model performance.
LR models were compared using Area Under the Receiver Operating Characteristic Curve (AUC), 95% Confidence Interval using empirical bootstrap over test data were computed.
We used L2-regularized LR, and performed evaluation using cross-validation. In contrast to prior published work that used predefined feature sets, we incorporated a wide range of indicators, and relied on regularization to address potential overfitting risk.
Censoring patient data to evaluate performance in early prediction
  • Required for clinical relevance
  • Consider 180, 270, and 365 days lead time to cancer diagnosis 
The x-axis shows average number of lab test administrations per PDAC-patient at 360 censor days; the y-axis shows average number of lab tests per control patient; the separator has a slope of 1: points appearing on or near the line are on average administered equally frequently to control patients and PDAC patients. A point’s color indicates whether the type of lab value to be tested was selected by an expert (orange dots) or was pulled from all lab tests available. Most lab tests were more frequently administered to PDAC patients than controls.
The figure portrays the five lab tests which are deemed to have the greatest discriminatory capacity (approximated by distance from the separator with a slope of 1). The Top 5 lab tests with highest discriminatory coefficients are the same across cutoffs = 180, 270, and 360 days.
With a 360-day lead time, the test AUCs for the diagnoses-based LR, the lab-test based LR, and the combined diagnoses/lab-test model (concatenated LR model) were 0.58, 0.72 and 0.73, respectively.

Lab test administration per patient was found to be the most valuable feature for improving discrimination. 95% Confidence Interval using empirical bootstrap over test data -L2-regularized LR; evaluation using cross-validation.

LR models were trained and evaluated on diagnoses and labs for 25644 patients (cases=1352; age-sex paired controls).
Lab test administration per patient (i.e. which lab tests/how frequently), was found to be the most valuable feature for improving discrimination. For almost every type of lab test, the average number of administrations per patient was higher for PDAC patients than controls (Fig. 2). The top lab tests with highest discriminatory coefficients included glucose, potassium, hematocrit, hemoglobin, and creatinine (Fig. 3).
With a 360-day lead time, the test AUCs for the diagnoses-based LR, the lab-test based LR, and the combined diagnoses/lab-test model (concatenated LR model) were 0.58, 0.72 and 0.73, respectively (Fig. 4).
1. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J Clin. 2020;70(1):7-30.
2. Canto MI, Almario JA, Schulick RD, Yeo CJ, Klein A, Blackford A, et al. Risk of Neoplastic Progression in Individuals at High Risk for Pancreatic Cancer Undergoing Long-term Surveillance. Gastroenterology. 2018;155(3):740-51 e2.
3. Benzel J, Fendrich V. Familial Pancreatic Cancer. Oncol Res Treat. 2018;41(10):611-8.
4. Appelbaum L., Cambronero JP, et al. Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study. Eur J Cancer. 2021 Jan;143:19-30. doi: 10.1016/j.ejca.2020.10.019. Epub 2020 Dec 2. PMID: 33278770.
Concatenated LR models, can outperform both diagnoses-based, and lab-test-based LR models, and can be utilized in PDAC prediction.
Limor Appelbaum
No Relationships to Disclose

Alexandra Berg
Employment - MedCap, Envirotainer AB (I)
Leadership - MedCap board member), Envirotainer AB (I)
Stock and Other Ownership Interests - MedCap, Envirotainer AB (I)
Jose Pablo Cambronero
No Relationships to Disclose

Thurston Hou Yeen Dang
Stock and Other Ownership Interests - Indirectly via 401K funds
Charles Chuan Jin
No Relationships to Disclose
Lori Zhang
No Relationships to Disclose

Steven Kundrot
Employment - Trinetx
Leadership - Trinetx
Stock and Other Ownership Interests - Trinetx
Matvey Palchuk
Employment - Trinetx
Laura A. Evans
Employment - TriNetX
Irving D. Kaplan
No Relationships to Disclose
Martin Rinard
No Relationships to Disclose
No specific funding was received for this research.
This research was performed as part of the employment of the authors, with a no-cost collaboration agreement between BIDMC, MIT and TriNetX

© 2021 American Society of Clinical Oncology, Inc. Reused with permission. This abstract was accepted and previously presented at the 2021 ASCO-GI Meeting. All rights reserved