Task 1

Mice: The key concept of MICE is to use distribution of the observed data to estimate a set of plausible values for missing data. Random components are incorporated into these estimated values to reflect their uncertainty. Multiple datasets are created and then analysed individually but identically to obtain a set of parameter estimates. Finally, estimates are combined to obtain overall estimates, variances and confidence intervals. Although MICE can be implemented under MNAR mechanisms, standard implementations assume MCAR or MAR.

Authors: Stef van Buuren, Karin Groothuis-Oudshoorn
Other contributors: Gerko Vink, Rianne Schouten, Alexander Robitzsch and many more

Expertise of the authors: Stef van Buuren is a renowned statistician and a Professor at Utrecht University in the Netherlands. His area of expertise is specifically in missing data, multiple imputation, and statistical methodology for public health and growth charts. Karina Groothuis-Oudshoorn is a professor at University of Twente, another expert in statistics and imputation.

The mice software was published in the Journal of Statistical Software (Van Buuren and Groothuis-Oudshoorn, 2011). doi:10.18637/jss.v045.i03. The first application of the method concerned missing blood pressure data (Van Buuren et. al., 1999). The term Fully Conditional Specification was introduced in 2006 to describe a general class of methods that specify imputations model for multivariate data as a set of conditional distributions (Van Buuren et. al., 2006). Further details on mixes of variables and applications can be found in the book Flexible Imputation of Missing Data. Second Edition. Chapman & Hall/CRC. Boca Raton, FL.

The mice package has a long and active development history, with its first release on CRAN dating back to 2000. It has undergone numerous version updates to add new features, improve stability, and expand its capabilities. I’m mentioning last 3 major changes here:

Changes in version 3.18.0: Fixed a long-standing issue in the internal augment() function that affected ordered factors
Changes in version 3.17.0: Imputing categorical data by predictive mean matching
Changes in version 3.16.0: Expands futuremice() functionality by allowing for external packages and user-written functions

Task 2

#In the documentation, the 'mice' technique was applied on two R-datasets: nhans and nhans2
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

nhanes

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

View(nhanes)
imp<-mice(nhanes) #Here the mice is applied on this dataset using 'pmm', as 'mice'takes 'pmm' method as by-default method

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

#Here m=5 has been used as R automatically takes m=5 when nothing is mentioned about m
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm" 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

imp$imp$bmi #The imputations for BMI

##       1    2    3    4    5
## 1  33.2 22.0 26.3 22.7 25.5
## 3  30.1 27.2 35.3 28.7 22.0
## 4  25.5 24.9 24.9 27.5 24.9
## 6  21.7 21.7 20.4 24.9 22.5
## 10 25.5 22.7 26.3 27.2 25.5
## 11 25.5 33.2 26.3 30.1 22.0
## 12 30.1 22.5 20.4 26.3 25.5
## 16 28.7 33.2 29.6 27.4 22.0
## 21 25.5 35.3 20.4 27.2 29.6

complete(imp)

##    age  bmi hyp chl
## 1    1 33.2   1 184
## 2    2 22.7   1 187
## 3    1 30.1   1 187
## 4    3 25.5   2 206
## 5    1 20.4   1 113
## 6    3 21.7   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 25.5   2 187
## 11   1 25.5   1 187
## 12   2 30.1   2 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 187
## 16   1 28.7   1 187
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 284
## 21   1 25.5   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 204
## 25   2 27.4   1 186

View(nhanes2)  #This dataset has different types of variables. bmi:Continuous, hyp:Categorical, chl:continuous
# imputation on mixed data with a different method per column
mice(nhanes2, meth = c("sample", "pmm", "logreg", "norm")) #For the first variable age:'simple' method imputes missing values by randomly sampling from the observed values in the same variable.

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"   "norm" 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

#bmi:'predictive mean matching' runs a regression to predict the missing value
#hyp: 'logreg' uses logistic regression as hyp is a binary categorical variable
#chl: "norm" uses bayesian linear regression

#plot
#The example does not provide any plot code, I'll plot stripplot to show the imputed and observed data
stripplot(imp, pch = 20, cex = 1.2)

The red points(imputations)fall within the range of the blue points(observed data).There are no obvious outliers or impossible values(no negative bmi or chl), which confirms the methods are working correctly. The spread of the red points in the imputed datasets looks similar to the spread of the blue points in the original data The imputed values are only at the two levels(1 and 2),which is correct.The”logreg”method successfully imputed binary outcomes.

Task 3

library(mice)
set.seed(123) 

#Creating my 'state' and 'districts' columns
states<- c(rep("West Bengal", 10), rep("Kerala", 10)) 
districts<- c("Alipurduar", "Bankura", "Birbhum", "Cooch Behar", "Dakshin Dinajpur",
               "Darjeeling", "Hooghly", "Howrah", "Jalpaiguri", "Kolkata",
               "Alappuzha", "Ernakulam", "Idukki", "Kannur", "Kasaragod",   #Only 10 West Bengal and 10 Kerala   districts have been considered here
               "Kollam", "Kottayam", "Kozhikode", "Palakkad", "Thiruvananthapuram")
indicators_list<- c("Women_Literacy", "Early_Marriage", "College_Education", 
                     "Households_with_improved_sanitation", "Teen_Mothers")  #All these social indicator values   are in %


n<- length(districts)*length(indicators_list)  
project_data_long<- data.frame(State = rep(states, each = length(indicators_list)), #This is a long data, we will make it wider (panel data) before imputation
District = rep(districts, each = length(indicators_list)),  #Creating an empty dataframe with 100 rows as we have 20 districts and for each district 5 indicators
Indicators = rep(indicators_list, times = length(districts)),
Survey_1 = numeric(n),
Survey_2 = numeric(n)
)

for (i in 1:nrow(project_data_long)) {
 state<- project_data_long$State[i]
  ind<- project_data_long$Indicators[i]
  
  if (state == "West Bengal") {
    base_values <- list( Women_Literacy = c(75, 5),  #Here I'm creating base values for WB indicators, I'll create different basevalues for Kerala
      Early_Marriage = c(25, 6), #This (25,7) means I want my early marriage variable to be in the range of 19-31%
      College_Education = c(30, 7),   #This (30,7) means I want college education(%) of women of West Bengal to be in the range of 23% to 37%
      Households_with_improved_sanitation= c(70, 10),
      Teen_Mothers = c(15, 4)
    )
  } else {
 base_values<- list(Women_Literacy= c(92, 3),Early_Marriage= c(8, 3),College_Education= c(45, 5),Households_with_improved_sanitation = c(95, 4),
      Teen_Mothers= c(5, 2)    #Kerala data
    )
  }
  
  mean_val<- base_values[[ind]][1]
  sd_val<- base_values[[ind]][2]
  project_data_long$Survey_1[i]<- rnorm(1, mean = mean_val, sd = sd_val)  #I'll create data using the mean values for both the surveys
  project_data_long$Survey_2[i]<- rnorm(1, mean = mean_val, sd = sd_val)
}


project_data_long$Survey_1[sample(1:n, size = 10)]<- NA  #Here I'm creative missing values completely at random (MCAR) for both survey1 and survey2
project_data_long$Survey_2[sample(1:n, size = 10)]<- NA  #There are several missing value schemes, MAR(Missing at random), MCAR(Missing completely at random), MNAR(Missing not at random)
View(project_data_long)

project_data_long is my synthetic dataset, I cannot impute missing values using mice technique for long data, I’ll have to convert it into a panel data to apply mice.Mice uses regression technique multiple times, for that I need to have columns of indicators,not columns of surveys.

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.5.1

long_data_synth<- pivot_longer(project_data_long, 
                         cols = c("Survey_1", "Survey_2"), 
                         names_to = "Survey", 
                         values_to = "Value")
panel_data_synth<- pivot_wider(long_data_synth, 
                             id_cols = c("State", "District", "Survey"),
                             names_from = "Indicators",
                             values_from = "Value")
panel_data_synth<- panel_data_synth[order(panel_data_synth$State, 
                                     panel_data_synth$District, 
                                     -rank(panel_data_synth$Survey)), ]
View(panel_data_synth)

#I do not want to do imputation on the whole dataset as I think these social indicators are very much dependent on the demographic structure.I want to do the imputation for each state separately. 

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data_list<- split(panel_data_synth, panel_data_synth$State) #I'll create two separate dataframes for the mentioned two states

impute_state_data<- function(state_data) {
  
  id_columns<- state_data %>% select(State, District, Survey)
  numeric_data_to_impute<- state_data %>% select(Women_Literacy, Early_Marriage, College_Education,Households_with_improved_sanitation, Teen_Mothers)
  imp<- mice(numeric_data_to_impute, m = 5, method = 'pmm', printFlag = FALSE)   #I'm using predictive mean matching method
  completed_numeric_data<- complete(imp, 1)  #using first imputed dataset
  final_state_data<- bind_cols(id_columns, completed_numeric_data) #We separated state,districts,survey columns first, now I'll add them again
  return(final_state_data)
}

imputed_list<- lapply(data_list, impute_state_data)
final_imputed_panel_data<- bind_rows(imputed_list) #I have added both the states in one dataframe again

View(final_imputed_panel_data)

One interesting observation: If I had sufficiently large dataset then dplyr’s groupby=‘State’ and do() functions would have worked perfectly for mice. Here, it didn’t work as mice tried to do math on the text columns as well. It perfectly worked for me when I had more than 100 variables.I did split+lapply method here as I have only 5 variables (Indicators).

I’ll plot dot-charts for both the states showing the imputed and observed values of the indicator: Women Literacy(%)

wb_data<- final_imputed_panel_data[final_imputed_panel_data$State== "West Bengal", ]
original_wb<- panel_data_synth[panel_data_synth$State== "West Bengal", ]

point_colors<- ifelse(is.na(original_wb$Women_Literacy),"red","black")

dotchart(wb_data$Women_Literacy,labels = paste(wb_data$District, wb_data$Survey),color = point_colors,pch = 19, main = "West Bengal: Women Literacy (%))",
         xlab = "Percentage",cex = 0.7) 

legend("bottomleft", legend = c("Observed","Imputed"), col= c("black", "red"), pch = 19,cex=0.5) #I have chosen a small cex as it was obstructing the values for the last district

kerala_data<- final_imputed_panel_data[final_imputed_panel_data$State == "Kerala", ]
original_kerala<- panel_data_synth[panel_data_synth$State == "Kerala", ]

point_colors_kerala<- ifelse(is.na(original_kerala$Women_Literacy), "red", "black")

dotchart(kerala_data$Women_Literacy,labels = paste(kerala_data$District, kerala_data$Survey),color = point_colors_kerala,pch = 19,
main = "Kerala: Women Literacy (%))", xlab = "Percentage",cex = 0.7)

legend("bottomright", legend = c("Observed", "Imputed"),  col = c("black", "red"), pch = 19,cex = 0.5)

I can plot dotcharts for all the indicators for both states What else could I do? If I had data for all the districts of West Bengal and Kerala,I could divide the districts of each state into 3 categories based on each indicator using K means where K=3 We could then print the maps of both the states on the basis of the output of K-means using ‘tmap’ package. What can we do with this imputed data? If we know a policy was introduced for Kerala in between the two surveys , we can fit a DID (Difference in difference) model and check how effective the public policy was. We can apply Dr. Jennifer Hill’s mister P (multi-level regression and post-stratification) on this data to estimate state-wide averages of these indicators (there is one package called ‘misterP’ in R)

Assignment_1_MICE_package

Sabuj Ganguly

2025-09-17

Task 1

Task 2

Task 3