Synthetic data allows for safe sharing in low-resource settings

By Susan Scutti, Fogarty International Center

The Kaloleni-Rabai Health and Demographic Surveillance System (KRHDSS) is embedded in seven rural and three peri-urban community health units centered around Mariakani township, Kenya. Set up by Aga Khan University (AKU) in 2017, KRHDSS holds information on more than 103,000 residents. The beauty of such a large dataset is it collects data over time, so it can reveal otherwise undetectable health patterns that affect a community, says Dorcas Mwigereri, a research fellow at AKU. “We can study separate diseases, comorbidities, and also look at how one disease leads to the development of another.”

Unfortunately, accessing, using, and sharing medical data is restricted by the necessary regulations to protect patient privacy, and this constrains the development and deployment of new technologies within health systems, says Mwigereri. “How do we solve this problem? That’s where synthetic data comes in—synthetic data creates a dataset with the same statistical properties as the original data yet with minimal privacy risks.”

One way to create synthetic data is by using a generative adversarial network (GAN), a type of machine learning model, that can anonymize information in a dataset with complex structures. So which GAN would work best in the Kenyan context? Mwigereri and her colleagues evaluated fidelity (how well a model reproduces the statistical patterns of the original data), utility (how well a model supports analysis and prediction), and privacy (how well a model protects confidential data) across three open-source GANs and found CTGAN performed best overall.

Good performance within a specific context is crucial when creating synthetic data, says Mwigereri. She recalls using Teladoc, an automated, AI-enabled health care service, while studying in the U.S. “The feedback was ‘we’re not able to understand what you’re saying, please get in touch with the facility.’ My accent is different. Clearly, this model was not trained with data from my context—an African context.”

In Kenya, there are 47 tribes. Other African nations similarly include different populations. Meanwhile, individual countries do not always share a unifying language. “African researchers need to collect enough data from our people to create technologies that fit our societies, so that we can then co-create solutions with researchers in the U.S., in the UK, wherever.”As she completes her PhD, Mwigereri continues working on two additional DS-I Africa projects in Kenya. One uses AI with data collected across five facilities to identify healthcare workers prone to depression. The other relies on electronic health records (EHRs) to distinguish women in danger of developing gestational diabetes mellitus. The data is there in the EHRs, but it was collected for clinical purposes, not research, so it’s not yet accessible to researchers, says Mwigereri.

“If we sort out the issues around data access, Africa will see improvements in the healthcare sector… he who owns the data, owns the insights.”

Article:
Synthetic data generation of health and demographic surveillance systems data: a case study in a low- and middle-income country
Publication: JAMIA Open, 2025

News & Updates

NIH Grant to Strengthen Research Administration Capacity Across Africa

Episode 24: Data Science; Using Technology to Improve Healthcare in the Community

Abdominal Massage during Pregnancy can Lead to Stillbirths and Maternal Deaths

Reduced Maternal Deaths in Kilifi County

Dealing with Anaemia in Pregnant Women in Kilifi County

Increase in Hospital Deliveries in Kilifi County

Shadows of Scarcity

Low Blood Levels Pose Risk to Expectant Mothers, Babies

Traditional Kilifi Midwives now Become Chief Advocates of Hospital Deliveries

Kilifi Seeks to End Risky Abdominal Massage on Pregnant Women