Dealing With Sparse Data Bias in Medical Sciences: Comprehensive Review of Methods and Applications
Abstract
This study aims to illustrate the problem of (Quasi) Complete Separation in the sparse data pattern occurring medical data. We presented the failure of traditional methods and then provided an overview of popular remedial approaches to reduce bias through vivid examples. Penalized maximum likelihood estimation and Bayesian methods are some remedial tools introduced to reduce bias. Data from the Tehran Thyroid and Pregnancy Study, a two-phase cohort study conducted from September 2013 through February 2016, was applied for illustration. The bias reduction of the estimate showed how sufficient these methods are compared to the traditional method. Extremely large measures of association such as the Risk ratios along with an extraordinarily wide range of confidence interval proved the traditional estimation methods futile in case of sparse data while it is still widely applying and reporting. In this review paper, we introduce some advanced methods such as data augmentation to provide unbiased estimations.
2. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in medicine. 2002;21(16):2409-19.
3. Greenland S, Schwartzbaum JA, Finkle WD. Problems due to Small Samples and Sparse Data in Conditional Logistic Regression Analysis. American Journal of Epidemiology. 2000;151(5):531-9.
4. Sullivan SG, Greenland S. Bayesian regression in SAS software. International Journal of Epidemiology. 2012;42(1):308-17.
5. Discacciati A, Orsini N, Greenland S. Approximate Bayesian logistic regression via penalized likelihood by data augmentation. The Stata Journal. 2015;15(3):712-36.
6. Lyles RH, Guo Y, Greenland S. Reducing Bias and Mean Squared Error Associated With Regression-Based Odds Ratio Estimators. Journal of statistical planning and inference. 2012;142(12):3235-41.
7. Greenland S, Mansournia MA. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in medicine. 2015;34(23):3133-43.
8. Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine. 2006;25(24):4216-26.
9. Fushihara G, Kamide T, Kimura T, Takeda R, Ikeda T, Kikkawa Y, et al. Factors associated with early seizures after surgery of unruptured intracranial aneurysms. Clinical Neurology and Neurosurgery. 2019.
10. Gambhir S, Grigorian A, Ashbaugh A, Spencer D, Ramakrishnan D, Schubl SD, et al. Early Versus Late Pulmonary Embolism in Trauma Patients: Not All Pulmonary Embolisms are Created Similarly. Journal of Surgical Research. 2019;239:174-9.
11. Kim WH, Kim HJ, Park HY, Park JY, Chae YS, Lee SM, et al. Axillary Pathologic Complete Response to Neoadjuvant Chemotherapy in Clinically Node-Positive Breast Cancer Patients: A Predictive Model Integrating the Imaging Characteristics of Ultrasound Restaging with Known Clinicopathologic Characteristics. Ultrasound in Medicine and Biology. 2019;45(3):702-9.
12. Panahi MH, Bidhendi RY. Bias in determining factors associated with early seizures after surgery of unruptured intracranial aneurysms. Clinical neurology and neurosurgery. 2019;179:66-.
13. Yarandi RB, Panahi MH. Is Granulocyte colony-stimulating factor associated with development of aortitis? Cytokine. 2019;120:191.
14. Oshima Y, Takahashi S, Tani K, Tojo A. Granulocyte colony-stimulating factor-associated aortitis in the Japanese Adverse Drug Event Report database. Cytokine. 2019;119:47-51.
15. Bidhendi Yarandi R, Panahi MH. Bias estimation of predictors and internal validity of the study “Admission characteristics predictive of in-hospital death from hospital-acquired sepsis: A comparison to community-acquired sepsis”. Journal of Critical Care. 2019.
16. Bidhendi Yarandi R, Panahi MH. Methodological issues regarding “Decline in ankle-brachial index is stronger in poorly than in well controlled diabetes: Results from the Heinz Nixdorf Recall cohort study”. Atherosclerosis. 2019;286:179.
17. Bidhendi Yarandi R, Panahi MH. Postnatal nutritional deficit is an independent predictor of bronchopulmonary dysplasia among extremely premature infants born at or <28 weeks gestation: Some methodological issues. Early Human Development. 2019;134:47.
18. Panahi MH, Bidhendi Yarandi R. Is irradiation significantly associated with a higher risk for CVD? European Archives of Oto-Rhino-Laryngology. 2019.
19. Bidhendi Yarandi R, Panahi MH. Comment on unplanned out-of-hospital birth and risk factors of adverse perinatal outcome: findings from a prospective cohort. Scand J Trauma Resusc Emerg Med. 2019;27(1):37-.
20. Panahi MH, Bidhendi Yarandi R. Bias in determining factors associated with early seizures after surgery of unruptured intracranial aneurysms. Clinical Neurology and Neurosurgery. 2019;179:66.
21. Firth D. Bias Reduction of Maximum Likelihood Estimates. Biometrika. 1993;80(1):27-38.
22. Greenland S. Simpson's Paradox From Adding Constants in Contingency Tables as an Example of Bayesian Noncollapsibility. The American Statistician. 2010;64(4):340-4.
23. Rahman MS, Sultana M. Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data. BMC medical research methodology. 2017;17(1):33-.
24. Heinze G, Ladner T. logistiX: Exact Logistic Regression Including Firth Correction. R package version; 2013.
25. Kosmidis I. brglm: Bias reduction in binomial-response generalized linear models. R Foundation for Statistical Computing. 2013.
26. Coveney J. FIRTHLOGIT: Stata module to calculate bias reduction in logistic regression. 2015.
27. Greenland S. Bayesian perspectives for epidemiological research: I. Foundations and basic methods. International Journal of Epidemiology. 2006;35(3):765-75.
28. Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016;352.
29. Greenland S, Christensen R. Data augmentation priors for Bayesian and semi-Bayes analyses of conditional-logistic and proportional-hazards regression. Statistics in Medicine. 2001;20(16):2421-8.
30. Bedrick EJ, Christensen R, Johnson W. A New Perspective on Priors for Generalized Linear Models. Journal of the American Statistical Association. 1996;91(436):1450-60.
31. Bedrick EJ, Christensen R, Johnson W. Bayesian binomial regression: Predicting survival at a trauma center. The American Statistician. 1997;51(3):211-8.
32. George EPB. Sampling and Bayes' Inference in Scientific Modelling and Robustness. Journal of the Royal Statistical Society Series A (General). 1980;143(4):383-430.
33. Simonoff JS. A penalty function approach to smoothing large sparse contingency tables. The Annals of Statistics. 1983;11(1):208-18.
34. Lee A, Silvapulle M. Ridge estimation in logistic regression. Communications in Statistics-Simulation and Computation. 1988;17(4):1231-57.
35. Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1992;41(1):191-201.
36. Trevor H, Robert T, JH F. The elements of statistical learning: data mining, inference, and prediction. New York, NY: Springer; 2009.
37. Bidhendi-Yarandi R, Mohammad K, Zeraati H, Tehrani FR, Mansournia MA. Bayesian Methods for Clinicians Medical Journal of The Islamic Republic of Iran (MJIRI). 2020;Accepded.
38. Agresti A, Kateri M. Categorical data analysis: Springer; 2011.
39. Hamra GB, MacLehose RF, Cole SR. Sensitivity analyses for sparse-data problems-using weakly informative bayesian priors. Epidemiology (Cambridge, Mass). 2013;24(2):233-9.
40. Greenland S. Prior data for non‐normal priors. Statistics in medicine. 2007;26(19):3578-90.
41. Nazarpour S, Tehrani FR, Simbar M, Tohidi M, Azizi F. Thyroid and pregnancy in Tehran, Iran: objectives and study protocol. International journal of endocrinology and metabolism. 2016;14(1).
42. Nazarpour S, Ramezani Tehrani F, Simbar M, Tohidi M, Minooee S, Rahmati M, et al. Effects of Levothyroxine on Pregnant Women With Subclinical Hypothyroidism, Negative for Thyroid Peroxidase Antibodies. The Journal of Clinical Endocrinology & Metabolism. 2017;103(3):926-35.
43. Mills JL, Ali M, Buck Louis GM, Kannan K, Weck J, Wan Y, et al. Pregnancy Loss and Iodine Status: The LIFE Prospective Cohort Study. Nutrients. 2019;11(3):534.
44. Allison PD, editor Convergence failures in logistic regression. SAS Global Forum; 2008.
45. Jewell NP. Statistics for epidemiology. Chapman and Hall/CRC; 2004. p. 38-9.
Files | ||
Issue | Vol 58, No 11 (2020) | |
Section | Original Article(s) | |
DOI | https://doi.org/10.18502/acta.v58i11.5147 | |
Keywords | ||
Bayesian Method Complete/Quasi-Complete Separation Data augmentation Penalization methods Sparse Data Bias |
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |