Synthea data generator

4/16/2023

Khaled El Emam, CEO of Replica Analytics, provided a highly accessible technical overview of Data Synthesis and Synthetic Data. The implications of being able to create and share data providing insights into cohorts and segments without impinging on privacy has profound implications for the adtech and martech ecosystem struggling with privacy-centric moves by Apple, Google, and other platforms.ĭuring the Spokes Privacy Conference, Dr. It is artificial data that does not map back to any actual natural person. Importantly, the resulting data is not the actual data that has been pseudoanonymized or anonymized. Rather it is data that has been generated by a computer – i.e., synthetic data generation tools – that match the key statistical properties of the real sample data. Synthetic data, as its name implies, is not actual data taken from real world events or individuals’ attributes. There is, however, a rapidly emerging solution to the use and sharing of data across organizations and, indeed, across borders: Data synthesis This impedes not only the efforts of organizations to rationalize resources (e.g., workforce analytics, resource allocations, and consumer insights), but research into health, medicine, the social sciences, and other endeavors that benefit society at large.įinding a balance between privacy and the potential benefits of sharing personally identifiable or sensitive data (e.g., personal health information) seems an intractable problem putting privacy advocates and those who wish to use data for commercial or research purposes at loggerheads. The prediction model can be easily deployed in a CKD screening program in healthcare institutions with existing EHR systems.Regulation, cost, and other factors can hinder the great many benefits of access to data for analysis. Using CNN and synthetic veteran patient dataset, we have demonstrated a viable prediction model for CKD detection in healthy patients at-risk of CKD using longitudinal data from EHR system. Additionally, based on the dataset, age, diabetes, elevated BMI and medication taken, specifically 24 HR Metformin Hydrochloride, represent the topmost important features to predict the onset of CKD.

The CNN algorithm has been designed, implemented and tested using the synthetic dataset, achieving precision of 0.918, recall of 0.739, specificity of 0.983, accuracy of 0.932, and AUROC of 0.937 as depicted in Figure 1. A total of 12,503 patients with CKD and 48,212 patients without CKD matched by propensity score along with 290 other features including anthropometrics, medication, comorbidities, and laboratory data were used to train and validate the prediction model. The dataset was generated by using Synthea™, a patient generator tool, and contains standard data elements that are commonly used in major Electronic Health Record systems. MethodsĪ synthetic dataset containing a total of 100,000 synthetic patient records, derived from cross-sectional cohort of Veteran from the general population, was used to train and validate the prediction model. Here we present preliminary result on feasibility and performance of employing Convolutional Neural Network (CNN) prediction model in detecting patients who are at-risk of CKD based on longitudinal data from Electronic Health Records (EHR). Developing effective screening tool for Chronic kidney disease (CKD) helps in reducing morbidity, mortality as well as cost and burden to the health system.

0 Comments

Synthea data generator

Leave a Reply.

Author

Archives

Categories