Synthetic data allows for safe sharing in low-resource settings
By Susan Scutti, Fogarty International Center
The Kaloleni-Rabai Health and Demographic Surveillance System (KRHDSS) is embedded in seven rural and three peri-urban community health units centered around Mariakani township, Kenya. Set up by Aga Khan University (AKU) in 2017, KRHDSS holds information on more than 103,000 residents. The beauty of such a large dataset is it collects data over time, so it can reveal otherwise undetectable health patterns that affect a community, says Dorcas Mwigereri, a research fellow at AKU. “We can study separate diseases, comorbidities, and also look at how one disease leads to the development of another.”
Unfortunately, accessing, using, and sharing medical data is restricted by the necessary regulations to protect patient privacy, and this constrains the development and deployment of new technologies within health systems, says Mwigereri. “How do we solve this problem? That’s where synthetic data comes in—synthetic data creates a dataset with the same statistical properties as the original data yet with minimal privacy risks.”
One way to create synthetic data is by using a generative adversarial network (GAN), a type of machine learning model, that can anonymize information in a dataset with complex structures. So which GAN would work best in the Kenyan context? Mwigereri and her colleagues evaluated fidelity (how well a model reproduces the statistical patterns of the original data), utility (how well a model supports analysis and prediction), and privacy (how well a model protects confidential data) across three open-source GANs and found CTGAN performed best overall.
Good performance within a specific context is crucial when creating synthetic data, says Mwigereri. She recalls using Teladoc, an automated, AI-enabled health care service, while studying in the U.S. “The feedback was ‘we’re not able to understand what you’re saying, please get in touch with the facility.’ My accent is different. Clearly, this model was not trained with data from my context—an African context.”
In Kenya, there are 47 tribes. Other African nations similarly include different populations. Meanwhile, individual countries do not always share a unifying language. “African researchers need to collect enough data from our people to create technologies that fit our societies, so that we can then co-create solutions with researchers in the U.S., in the UK, wherever.”As she completes her PhD, Mwigereri continues working on two additional DS-I Africa projects in Kenya. One uses AI with data collected across five facilities to identify healthcare workers prone to depression. The other relies on electronic health records (EHRs) to distinguish women in danger of developing gestational diabetes mellitus. The data is there in the EHRs, but it was collected for clinical purposes, not research, so it’s not yet accessible to researchers, says Mwigereri.
“If we sort out the issues around data access, Africa will see improvements in the healthcare sector… he who owns the data, owns the insights.”
Article:
Synthetic data generation of health and demographic surveillance systems data: a case study in a low- and middle-income country
Publication: JAMIA Open, 2025
