Logistic regression with incomplete covariate data in complex survey sampling

5203.0: Wednesday, October 24, 2001 - 3:24 PM

Abstract #20295

Logistic regression with incomplete covariate data in complex survey sampling

Charity G Moore, MSPH, PhD¹, Stuart R Lipsitz, ScD², Cheryl L Addy, PhD³, James R Hussey, PhD³, and Donald G Edwards⁴. (1) Departments of Preventive Medicine and Neuroligical Sciences, Section of Biostatistics, Rush-Presbyterian-St. Luke's Medical Center, 1725 West Harrison, Suite 755, Chicago, IL 60612, 312-563-2381, cgmoore@pvm.rpslmc.edu, (2) Department of Biometry and Epidemiology, Medical University of South Carolina, (3) Department of Epidemiology and Biostatistics, University of South Carolina, (4) Department of Statistics, University of South Carolina

Many epidemiological studies use complex survey sampling to maximize information about a population while minimizing costs within a sample. Analyses should take into account the design of the study (stratification, clustering) for correct parameter estimation. Item non-response takes place when a sampled individual does not have complete information on all items. The missing data mechanism can be missing completely at random (MCAR), missing at random (MAR), or non-ignorable (NI). This study focuses on situations where the missingness is MAR. MAR occurs when the probability of having incomplete data for a variable depends on the outcome and/or the complete covariate(s). Simulation studies investigated performance of four methods by comparing bias, coverage probabilities and calculated versus empirical variances for a logistic regression model with binary outcome Y in stratified random sampling. Of the two covariates X and Z, only Z is subject to missingness. Multiple imputation (MI), re-weighted estimating equations (RWEE), and the Expectation-Maximization (EM) algorithm were compared to complete case (CC) analysis. Multiple imputation (MI) performed better than CC analysis when estimating the coefficient for X, but not for the incomplete covariate. The RWEE method showed good results compared to CC analysis when the missing data mechanism was correctly specified. The EM algorithm performed better than the other methods when comparing bias and coverage probabilities. The method is stable across varying levels of association between the variables X and Z and the variables Y and Z conditioning on X.

Learning Objectives: At the conclusion of the session, the participant in this session will be able to: 1) Recognize different types of missing data that arise in public health and medical practice 2) Recognize different ways to handle missing covariate data in logistic regression in complex survey sampling 3) Evaluate the performance of methods used to handle missing data problems when doing analysis with stratified random sampling

Keywords: Biostatistics, Statistics

Presenting author's disclosure statement:
Organization/institution whose products or services will be discussed: None
I do not have any significant financial interest/arrangement or affiliation with any organization/institution whose products or services are being discussed in this session.

The 129th Annual Meeting of APHA