This is my seminar R markdown of dealing with missing_ness out put document

Missing data can be a not so trivial problem when analyzing a data set and accounting for it is usually not so straightforward either. If the amount of missing data is very small relatively to the size of the data set, then leaving out the few samples with missing features may be the best strategy in order not to bias the analysis, however leaving out available data points deprives the data of some amount of information and depending on the situation you face, you may want to look for other fixes before wiping out potentially useful data points from your data set.

let’s import data from excel

library(mice)
## Warning: package 'mice' was built under R version 4.2.1
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(VIM)
## Warning: package 'VIM' was built under R version 4.2.1
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.2.1
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.1
misdat <- read.csv("C:/Users/User/Downloads/vehicleMiss.csv")
View(misdat)
str(misdat)
## 'data.frame':    1624 obs. of  7 variables:
##  $ vehicle: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fm     : int  0 10 15 0 13 21 11 5 8 1 ...
##  $ Mileage: int  863 4644 16330 13 22537 40931 34762 11051 7003 11 ...
##  $ lh     : num  1.1 2.4 4.2 1 4.5 3.1 0.7 2.9 3.4 0.7 ...
##  $ lc     : num  66.3 233 325.1 66.6 328.7 ...
##  $ mc     : num  697 120 175 0 175 ...
##  $ State  : chr  "MS" "CA" "WI" "OR" ...
summary(misdat)
##     vehicle             fm            Mileage            lh        
##  Min.   :   1.0   Min.   :-1.000   Min.   :    1   Min.   : 0.000  
##  1st Qu.: 406.8   1st Qu.: 4.000   1st Qu.: 5778   1st Qu.: 1.500  
##  Median : 812.5   Median :10.000   Median :17000   Median : 2.600  
##  Mean   : 812.5   Mean   : 9.414   Mean   :20559   Mean   : 3.294  
##  3rd Qu.:1218.2   3rd Qu.:14.000   3rd Qu.:30061   3rd Qu.: 4.300  
##  Max.   :1624.0   Max.   :23.000   Max.   :99983   Max.   :35.200  
##                                    NA's   :13      NA's   :6       
##        lc               mc            State          
##  Min.   :   0.0   Min.   :   0.0   Length:1624       
##  1st Qu.: 106.5   1st Qu.: 119.7   Class :character  
##  Median : 195.4   Median : 119.7   Mode  :character  
##  Mean   : 242.8   Mean   : 179.4                     
##  3rd Qu.: 317.8   3rd Qu.: 175.5                     
##  Max.   :3234.4   Max.   :3891.1                     
##  NA's   :8

Missing data Find Location of Missing Values

#is.na(misdat)                   # to check the existence of missing value
which(is.na(misdat))             # to know the position of missing value 
##  [1]  3267  3268  3501  3502  3503  3504  4109  4110  4111  4816  4817  4818
## [13]  4819  5701  5702  5703  6450  6451  6452  6562  6563  6564  6565  6566
## [25]  6567  6568  6569  9812  9813  9886  9887  9888  9889  9890  9891 10572
## [37] 10573 10574 11092 11342 11343 11344
#is.na(misdat$Mileage)           # to check the missing value for single column
which(is.na((misdat$Mileage)))   # to know where is missed value
##  [1]   19   20  253  254  255  256  861  862  863 1568 1569 1570 1571
#misdat

Count Total Missing Values

sum(is.na(misdat))               #count total number of missing in all column
## [1] 42
sum(is.na(misdat$Mileage))       #count total number of missing value in a single column
## [1] 13

Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large data sets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function

Using mice for looking at missing data pattern The mice package provides a nice function md.pattern() to get a better understanding of the pattern of missing data

p <- function(x) {sum(is.na(x))/length(x)*100}   # Look for missing > 5% variables
apply(misdat, 2, p)                              # Check each column
##   vehicle        fm   Mileage        lh        lc        mc     State 
## 0.0000000 0.0000000 0.8004926 0.3694581 0.4926108 0.0000000 0.9236453
#apply(misdat, 1, p)                              # Check each row
md.pattern(misdat)                               # Missing data pattern

##      vehicle fm mc lh lc Mileage State   
## 1586       1  1  1  1  1       1     1  0
## 11         1  1  1  1  1       1     0  1
## 13         1  1  1  1  1       0     1  1
## 6          1  1  1  1  0       1     1  1
## 2          1  1  1  1  0       1     0  2
## 4          1  1  1  0  1       1     1  1
## 2          1  1  1  0  1       1     0  2
##            0  0  0  6  8      13    15 42
md.pairs(misdat)
## $rr
##         vehicle   fm Mileage   lh   lc   mc State
## vehicle    1624 1624    1611 1618 1616 1624  1609
## fm         1624 1624    1611 1618 1616 1624  1609
## Mileage    1611 1611    1611 1605 1603 1611  1596
## lh         1618 1618    1605 1618 1610 1618  1605
## lc         1616 1616    1603 1610 1616 1616  1603
## mc         1624 1624    1611 1618 1616 1624  1609
## State      1609 1609    1596 1605 1603 1609  1609
## 
## $rm
##         vehicle fm Mileage lh lc mc State
## vehicle       0  0      13  6  8  0    15
## fm            0  0      13  6  8  0    15
## Mileage       0  0       0  6  8  0    15
## lh            0  0      13  0  8  0    13
## lc            0  0      13  6  0  0    13
## mc            0  0      13  6  8  0    15
## State         0  0      13  4  6  0     0
## 
## $mr
##         vehicle fm Mileage lh lc mc State
## vehicle       0  0       0  0  0  0     0
## fm            0  0       0  0  0  0     0
## Mileage      13 13       0 13 13 13    13
## lh            6  6       6  0  6  6     4
## lc            8  8       8  8  0  8     6
## mc            0  0       0  0  0  0     0
## State        15 15      15 13 13 15     0
## 
## $mm
##         vehicle fm Mileage lh lc mc State
## vehicle       0  0       0  0  0  0     0
## fm            0  0       0  0  0  0     0
## Mileage       0  0      13  0  0  0     0
## lh            0  0       0  6  0  0     2
## lc            0  0       0  0  8  0     2
## mc            0  0       0  0  0  0     0
## State         0  0       0  2  2  0    15
marginplot(misdat[,c('Mileage', 'lc')])

library(VIM)

Plot of missing data pattern

aggr_plot <- aggr(misdat, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##     State 0.009236453
##   Mileage 0.008004926
##        lc 0.004926108
##        lh 0.003694581
##   vehicle 0.000000000
##        fm 0.000000000
##        mc 0.000000000

Box plot

marginplot(misdat[c(1,2)])

A couple of notes on the parameters:

m=5 refers to the number of imputed data sets. Five is the default value. meth=‘pmm’ refers to the imputation method. In this case we are using predictive mean matching as imputation method. Other imputation methods can be used, type methods(mice) for a list of the available imputation methods.

impute <- mice(misdat,m=3,maxit=5,meth='pmm',seed=123)
## 
##  iter imp variable
##   1   1  Mileage  lh  lc
##   1   2  Mileage  lh  lc
##   1   3  Mileage  lh  lc
##   2   1  Mileage  lh  lc
##   2   2  Mileage  lh  lc
##   2   3  Mileage  lh  lc
##   3   1  Mileage  lh  lc
##   3   2  Mileage  lh  lc
##   3   3  Mileage  lh  lc
##   4   1  Mileage  lh  lc
##   4   2  Mileage  lh  lc
##   4   3  Mileage  lh  lc
##   5   1  Mileage  lh  lc
##   5   2  Mileage  lh  lc
##   5   3  Mileage  lh  lc
## Warning: Number of logged events: 1
print(impute)
## Class: mids
## Number of multiple imputations:  3 
## Imputation methods:
## vehicle      fm Mileage      lh      lc      mc   State 
##      ""      ""   "pmm"   "pmm"   "pmm"      ""      "" 
## PredictorMatrix:
##         vehicle fm Mileage lh lc mc State
## vehicle       0  1       1  1  1  1     0
## fm            1  0       1  1  1  1     0
## Mileage       1  1       0  1  1  1     0
## lh            1  1       1  0  1  1     0
## lc            1  1       1  1  0  1     0
## mc            1  1       1  1  1  0     0
## Number of logged events:  1 
##   it im dep     meth   out
## 1  0  0     constant State
summary(impute)
## Class: mids
## Number of multiple imputations:  3 
## Imputation methods:
## vehicle      fm Mileage      lh      lc      mc   State 
##      ""      ""   "pmm"   "pmm"   "pmm"      ""      "" 
## PredictorMatrix:
##         vehicle fm Mileage lh lc mc State
## vehicle       0  1       1  1  1  1     0
## fm            1  0       1  1  1  1     0
## Mileage       1  1       0  1  1  1     0
## lh            1  1       1  0  1  1     0
## lc            1  1       1  1  0  1     0
## mc            1  1       1  1  1  0     0
## Number of logged events:  1 
##   it im dep     meth   out
## 1  0  0     constant State

If you would like to check the imputed data, for instance for the variable Mileage, you need to enter the following line of code

impute$imp$Mileage
##          1     2     3
## 19   17817 20635 23179
## 20   30595  4952 16408
## 253   1336   713    45
## 254  14422 15204  7289
## 255  32750 45273 33011
## 256  25306 26553 24348
## 861   7219 18584 17334
## 862   1923  7160  1048
## 863  16093 20370 25622
## 1568 55273 55273 14231
## 1569 26646 14660 45927
## 1570     4    26    11
## 1571 30718 13923  3948

The output shows the imputed data for each observation (first column left) within each imputed data set (first row at the top).

If you need to check the imputation method used for each variable, mice makes it very easy to do

impute$method            # What imputation method did we use?
## vehicle      fm Mileage      lh      lc      mc   State 
##      ""      ""   "pmm"   "pmm"   "pmm"      ""      ""
methods(mice)           # Possible imputation models provided by mice() are
## Warning in .S3methods(generic.function, class, envir): function 'mice' appears
## not to be S3 generic; found functions that look like S3 methods
##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.norm                mice.impute.norm.boot          
## [23] mice.impute.norm.nob            mice.impute.norm.predict       
## [25] mice.impute.panImpute           mice.impute.passive            
## [27] mice.impute.pmm                 mice.impute.polr               
## [29] mice.impute.polyreg             mice.impute.quadratic          
## [31] mice.impute.rf                  mice.impute.ri                 
## [33] mice.impute.sample              mice.mids                      
## [35] mice.theme                     
## see '?methods' for accessing help and source code

Now we can get back the completed data set using the complete() function. It is almost plain English:

completedData <- complete(impute,1)    # Get completed datasets (observed and imputed)

The missing values have been replaced with the imputed values in the first of the five data sets. If you wish to use another one, just change the second parameter in the complete() function.

Inspecting the distribution of original and imputed data Let’s compare the distributions of original and imputed data using a some useful plots. First of all we can use a scatter plot and plot Ozone against all the other variables

xyplot(impute, vehicle ~ fm+Mileage+lh+lc+mc+State,pch=18,cex=1)   #Scatter plot vehicle vs all
## Warning in order(as.numeric(x)): NAs introduced by coercion
## Warning in split(as.numeric(x)[ord], gg[ord]): NAs introduced by coercion
## Warning in panel.superpose(x, y, type = type, groups = groups, pch = pch, : NAs
## introduced by coercion

What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”. Another helpful plot is the density plot:

densityplot(impute)    # Density plot original vs imputed data set
## Warning in (function (x, darg, groups = NULL, weights = NULL, subscripts =
## TRUE, : NAs introduced by coercion
## Warning in panel.superpose(x, darg = darg, plot.points = plot.points, ref =
## FALSE, : NAs introduced by coercion

he density of the imputed data for each imputed data set is showed in magenta while the density of the observed data is showed in blue. Again, under our previous assumptions we expect the distributions to be similar.

Another useful visual take on the distributions can be obtained using the stripplot() function that shows the distributions of the variables as individual points

stripplot(impute, pch = 20, cex = 1.2)   # Another take on the density: stripplot()

Pooling Suppose that the next step in our analysis is to fit a linear model to the data. You may ask what imputed data set to choose. The mice package makes it again very easy to fit a a model to each of the imputed data set and then pool the results together

modelFit1 <- with(impute,lm(vehicle ~ fm+Mileage+lh+lc+mc))
summary(pool(modelFit1))    # Pooling the results and fitting a linear model
##          term      estimate    std.error  statistic       df     p.value
## 1 (Intercept)  8.394778e+02 2.442800e+01 34.3653939 1615.124 0.000000000
## 2          fm -5.397482e+00 2.613242e+00 -2.0654357 1603.673 0.039042151
## 3     Mileage  1.165737e-04 8.932166e-04  0.1305101 1581.587 0.896179496
## 4          lh -4.298063e+01 1.870605e+01 -2.2976861 1345.373 0.021732526
## 5          lc  6.915074e-01 2.442459e-01  2.8311930 1352.017 0.004706155
## 6          mc -2.458190e-02 6.075833e-02 -0.4045848 1615.829 0.685836256

The variable modelFit1 contains the results of the fitting performed over the imputed data sets, while the pool() function pools them all together.

Remember that we initialized the mice function with a specific seed, therefore the results are somewhat dependent on our initial choice. To reduce this effect, we can impute a higher number of data set, by changing the default m=5 parameter in the mice() function as follows

impute2 <- mice(misdat,m=6,seed=245)   # Using more imputed data sets
## 
##  iter imp variable
##   1   1  Mileage  lh  lc
##   1   2  Mileage  lh  lc
##   1   3  Mileage  lh  lc
##   1   4  Mileage  lh  lc
##   1   5  Mileage  lh  lc
##   1   6  Mileage  lh  lc
##   2   1  Mileage  lh  lc
##   2   2  Mileage  lh  lc
##   2   3  Mileage  lh  lc
##   2   4  Mileage  lh  lc
##   2   5  Mileage  lh  lc
##   2   6  Mileage  lh  lc
##   3   1  Mileage  lh  lc
##   3   2  Mileage  lh  lc
##   3   3  Mileage  lh  lc
##   3   4  Mileage  lh  lc
##   3   5  Mileage  lh  lc
##   3   6  Mileage  lh  lc
##   4   1  Mileage  lh  lc
##   4   2  Mileage  lh  lc
##   4   3  Mileage  lh  lc
##   4   4  Mileage  lh  lc
##   4   5  Mileage  lh  lc
##   4   6  Mileage  lh  lc
##   5   1  Mileage  lh  lc
##   5   2  Mileage  lh  lc
##   5   3  Mileage  lh  lc
##   5   4  Mileage  lh  lc
##   5   5  Mileage  lh  lc
##   5   6  Mileage  lh  lc
## Warning: Number of logged events: 1
modelFit2 <- with(impute2,lm(vehicle ~ fm+Mileage+lh+lc+mc))
summary(pool(modelFit2))    # Pooling the results and fitting a linear mode2
##          term      estimate    std.error   statistic       df     p.value
## 1 (Intercept)  8.392989e+02 2.442832e+01 34.35761303 1615.705 0.000000000
## 2          fm -5.093714e+00 2.619779e+00 -1.94432974 1574.076 0.052033642
## 3     Mileage -2.855875e-05 8.972616e-04 -0.03182879 1485.501 0.974612864
## 4          lh -4.054530e+01 1.877205e+01 -2.15987577 1411.785 0.030950177
## 5          lc  6.588474e-01 2.450062e-01  2.68910507 1411.107 0.007248773
## 6          mc -2.384689e-02 6.080655e-02 -0.39217632 1615.741 0.694979662

Thank you