Missing value imputation using Linear Regresison

Consider the following data:

x <- 1:10
y <- c(11,12,18,14,17, NA,NA,19,NA,27)
z <- sample(1:20, 10)
w <- c(seq(1,10,3), 3,5,7,6,6,9)

data <- data.frame(x,y,z,w)

data

##     x  y  z  w
## 1   1 11 10  1
## 2   2 12 15  4
## 3   3 18 20  7
## 4   4 14  1 10
## 5   5 17  8  3
## 6   6 NA 16  5
## 7   7 NA  7  7
## 8   8 19 19  6
## 9   9 NA 17  6
## 10 10 27  3  9

summary(data)

##        x               y               z               w        
##  Min.   : 1.00   Min.   :11.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 3.25   1st Qu.:13.00   1st Qu.: 7.25   1st Qu.: 4.25  
##  Median : 5.50   Median :17.00   Median :12.50   Median : 6.00  
##  Mean   : 5.50   Mean   :16.86   Mean   :11.60   Mean   : 5.80  
##  3rd Qu.: 7.75   3rd Qu.:18.50   3rd Qu.:16.75   3rd Qu.: 7.00  
##  Max.   :10.00   Max.   :27.00   Max.   :20.00   Max.   :10.00  
##                  NA's   :3

The summary of the data shows that there are 3 missing values present in the variable y. One way to find what their possible values can be is to predict these values using the the most correlated variable correlated to the variable y) in the dataset. To check the degree of association among the variables we will be using the symnum() finction with the cor() argument.

symnum(cor(data, use = "complete.obs")) #use = complete.obs suggest to take only the complete data

##   x y z w
## x 1      
## y * 1    
## z     1  
## w . . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

The legend gives a very good idea of what the (.)s and ()s means inside the matrix. The corresponding to x and y suggests that the variables x and y are asociated to a magnitude of nearly 0.95. This indicates that the missing values of y can be well predicted using the values of x.

Before we go ahead and predict let us do a couple of things:

1. creating a dummy variable that will indicate missing data:

#For that let us create a function that will make our task easier
#The following function takes a vector as an argument and returns a binary vector of 0 corresponding the missing value in the argument vector and 1 otherwise.

missDummy <- function(t)
{
  x <- dim(length(t)) 
  x[which(!is.na(t))] = 1
  x[which(is.na(t))] = 0
  return(x)
}

#Now we will use this function to create a dummy variable that will indicate missing value using 0, otherwise willtake the value 1.

data$dummy <- missDummy(data$y)

#Let us take a look at the data
data

##     x  y  z  w dummy
## 1   1 11 10  1     1
## 2   2 12 15  4     1
## 3   3 18 20  7     1
## 4   4 14  1 10     1
## 5   5 17  8  3     1
## 6   6 NA 16  5     0
## 7   7 NA  7  7     0
## 8   8 19 19  6     1
## 9   9 NA 17  6     0
## 10 10 27  3  9     1

2. Next, let us fit a linear model with y as dependent variable and x as independent variable.

lm(y ~ x, data)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Coefficients:
## (Intercept)            x  
##       9.743        1.509

We get an intercept of 9.743 and a slope value of 1.509.

Finally let us impute the missing values using the above model.

for(i in 1:nrow(data))
{
  if(data$dummy[i] == 0)
  {
    data$y[i] = 9.743 + 1.509*data$x[i]
  }
}

All is done, Let us take a final look at the data.

data

##     x      y  z  w dummy
## 1   1 11.000 10  1     1
## 2   2 12.000 15  4     1
## 3   3 18.000 20  7     1
## 4   4 14.000  1 10     1
## 5   5 17.000  8  3     1
## 6   6 18.797 16  5     0
## 7   7 20.306  7  7     0
## 8   8 19.000 19  6     1
## 9   9 23.324 17  6     0
## 10 10 27.000  3  9     1

if we wish to keep y as an integer as before then we can make the following small change.

data$y <- as.integer(data$y)

data

##     x  y  z  w dummy
## 1   1 11 10  1     1
## 2   2 12 15  4     1
## 3   3 18 20  7     1
## 4   4 14  1 10     1
## 5   5 17  8  3     1
## 6   6 18 16  5     0
## 7   7 20  7  7     0
## 8   8 19 19  6     1
## 9   9 23 17  6     0
## 10 10 27  3  9     1

Missing value imputation using Linear Regresison

Gourab Nath

20 January 2016