Citation Link: https://doi.org/10.25819/ubsi/10806
Ensemble Learning for Dealing with Missing Data in Public Health
Source Type
Master Thesis
Author
Julian Gibas
Institute
Issue Date
2025
Abstract
Missing data in public health present challenges for evidence-based research. While convenient missing data handling methods like complete-case analysis or single imputation have a strong tendency to produce biased parameter estimates, Multiple Imputation by Chained Equations (MICE) provides a more appropriate approach but struggles with statistical complexities such as non-linearity in health data. Ensemble learning methods like random forest and XGBoost offer greater flexibility for multiple imputation (MI). Particularly multiple Imputation through XGBoost (Mixgb) is a recent development that promises to transfer the strong performance of XGBoost to missing data challenges.
This thesis evaluates three traditional methods (complete-case analysis, mean-mode-median imputation, MICE) alongside two ensemble algorithms for MI (MICE-ranger, Mixgb) through a simulation study recreating complex health data sets. Eight missing data scenarios combine Missing At Random (MAR) or Missing Not At Random (MNAR) mechanisms with missing data proportions of 10%, 20%, 30%, or 40%. Statistical complexities including non-linearity, interactions, conditional heteroskedasticity, class imbalance, and noise were incorporated. Methods were assessed on bias, confidence interval width, coverage, and a composite performance score, then applied to real-world health data.
Results indicate Mixgb is, on average, the most robust method across all scenarios, demonstrating the least bias and consistently good coverage. MICE-ranger performed better than MICE but worse than Mixgb. Real-world data analysis showed all MI methods generated proper imputations for low-to-moderate missing data proportions, though significant differences in single parameter estimates occurred between methods, underscoring the need to consider different MI methods to handle missing data.
These findings suggest ensemble learning methods, particularly Mixgb, offer superior performance compared to linear methods like MICE for complex missing data scenarios. Given the heterogeneous nature of public health data, researchers should consider ensemble methods for MI as robust solutions for missing data challenges, without neglecting MICE as a possible solution to more linear missing data scenarios.
This thesis evaluates three traditional methods (complete-case analysis, mean-mode-median imputation, MICE) alongside two ensemble algorithms for MI (MICE-ranger, Mixgb) through a simulation study recreating complex health data sets. Eight missing data scenarios combine Missing At Random (MAR) or Missing Not At Random (MNAR) mechanisms with missing data proportions of 10%, 20%, 30%, or 40%. Statistical complexities including non-linearity, interactions, conditional heteroskedasticity, class imbalance, and noise were incorporated. Methods were assessed on bias, confidence interval width, coverage, and a composite performance score, then applied to real-world health data.
Results indicate Mixgb is, on average, the most robust method across all scenarios, demonstrating the least bias and consistently good coverage. MICE-ranger performed better than MICE but worse than Mixgb. Real-world data analysis showed all MI methods generated proper imputations for low-to-moderate missing data proportions, though significant differences in single parameter estimates occurred between methods, underscoring the need to consider different MI methods to handle missing data.
These findings suggest ensemble learning methods, particularly Mixgb, offer superior performance compared to linear methods like MICE for complex missing data scenarios. Given the heterogeneous nature of public health data, researchers should consider ensemble methods for MI as robust solutions for missing data challenges, without neglecting MICE as a possible solution to more linear missing data scenarios.
File(s)![Thumbnail Image]()
Loading...
Name
Masterarbeit_Gibas_Julian.pdf
Size
1.86 MB
Format
Adobe PDF
Checksum
(MD5):ad8afaa0c25636a7e76610d49192295d
Owning collection

