This is my seminar R markdown of dealing with missing_ness out put document
Missing data can be a not so trivial problem when analyzing a data set and accounting for it is usually not so straightforward either. If the amount of missing data is very small relatively to the size of the data set, then leaving out the few samples with missing features may be the best strategy in order not to bias the analysis, however leaving out available data points deprives the data of some amount of information and depending on the situation you face, you may want to look for other fixes before wiping out potentially useful data points from your data set.
let’s import data from excel
library(mice)
## Warning: package 'mice' was built under R version 4.2.1
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)
## Warning: package 'VIM' was built under R version 4.2.1
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.2.1
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.1
misdat <- read.csv("C:/Users/User/Downloads/vehicleMiss.csv")
View(misdat)
str(misdat)
## 'data.frame': 1624 obs. of 7 variables:
## $ vehicle: int 1 2 3 4 5 6 7 8 9 10 ...
## $ fm : int 0 10 15 0 13 21 11 5 8 1 ...
## $ Mileage: int 863 4644 16330 13 22537 40931 34762 11051 7003 11 ...
## $ lh : num 1.1 2.4 4.2 1 4.5 3.1 0.7 2.9 3.4 0.7 ...
## $ lc : num 66.3 233 325.1 66.6 328.7 ...
## $ mc : num 697 120 175 0 175 ...
## $ State : chr "MS" "CA" "WI" "OR" ...
summary(misdat)
## vehicle fm Mileage lh
## Min. : 1.0 Min. :-1.000 Min. : 1 Min. : 0.000
## 1st Qu.: 406.8 1st Qu.: 4.000 1st Qu.: 5778 1st Qu.: 1.500
## Median : 812.5 Median :10.000 Median :17000 Median : 2.600
## Mean : 812.5 Mean : 9.414 Mean :20559 Mean : 3.294
## 3rd Qu.:1218.2 3rd Qu.:14.000 3rd Qu.:30061 3rd Qu.: 4.300
## Max. :1624.0 Max. :23.000 Max. :99983 Max. :35.200
## NA's :13 NA's :6
## lc mc State
## Min. : 0.0 Min. : 0.0 Length:1624
## 1st Qu.: 106.5 1st Qu.: 119.7 Class :character
## Median : 195.4 Median : 119.7 Mode :character
## Mean : 242.8 Mean : 179.4
## 3rd Qu.: 317.8 3rd Qu.: 175.5
## Max. :3234.4 Max. :3891.1
## NA's :8
Missing data Find Location of Missing Values
#is.na(misdat) # to check the existence of missing value
which(is.na(misdat)) # to know the position of missing value
## [1] 3267 3268 3501 3502 3503 3504 4109 4110 4111 4816 4817 4818
## [13] 4819 5701 5702 5703 6450 6451 6452 6562 6563 6564 6565 6566
## [25] 6567 6568 6569 9812 9813 9886 9887 9888 9889 9890 9891 10572
## [37] 10573 10574 11092 11342 11343 11344
#is.na(misdat$Mileage) # to check the missing value for single column
which(is.na((misdat$Mileage))) # to know where is missed value
## [1] 19 20 253 254 255 256 861 862 863 1568 1569 1570 1571
#misdat
Count Total Missing Values
sum(is.na(misdat)) #count total number of missing in all column
## [1] 42
sum(is.na(misdat$Mileage)) #count total number of missing value in a single column
## [1] 13
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large data sets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
Using mice for looking at missing data pattern The mice package provides a nice function md.pattern() to get a better understanding of the pattern of missing data
p <- function(x) {sum(is.na(x))/length(x)*100} # Look for missing > 5% variables
apply(misdat, 2, p) # Check each column
## vehicle fm Mileage lh lc mc State
## 0.0000000 0.0000000 0.8004926 0.3694581 0.4926108 0.0000000 0.9236453
#apply(misdat, 1, p) # Check each row
md.pattern(misdat) # Missing data pattern
## vehicle fm mc lh lc Mileage State
## 1586 1 1 1 1 1 1 1 0
## 11 1 1 1 1 1 1 0 1
## 13 1 1 1 1 1 0 1 1
## 6 1 1 1 1 0 1 1 1
## 2 1 1 1 1 0 1 0 2
## 4 1 1 1 0 1 1 1 1
## 2 1 1 1 0 1 1 0 2
## 0 0 0 6 8 13 15 42
md.pairs(misdat)
## $rr
## vehicle fm Mileage lh lc mc State
## vehicle 1624 1624 1611 1618 1616 1624 1609
## fm 1624 1624 1611 1618 1616 1624 1609
## Mileage 1611 1611 1611 1605 1603 1611 1596
## lh 1618 1618 1605 1618 1610 1618 1605
## lc 1616 1616 1603 1610 1616 1616 1603
## mc 1624 1624 1611 1618 1616 1624 1609
## State 1609 1609 1596 1605 1603 1609 1609
##
## $rm
## vehicle fm Mileage lh lc mc State
## vehicle 0 0 13 6 8 0 15
## fm 0 0 13 6 8 0 15
## Mileage 0 0 0 6 8 0 15
## lh 0 0 13 0 8 0 13
## lc 0 0 13 6 0 0 13
## mc 0 0 13 6 8 0 15
## State 0 0 13 4 6 0 0
##
## $mr
## vehicle fm Mileage lh lc mc State
## vehicle 0 0 0 0 0 0 0
## fm 0 0 0 0 0 0 0
## Mileage 13 13 0 13 13 13 13
## lh 6 6 6 0 6 6 4
## lc 8 8 8 8 0 8 6
## mc 0 0 0 0 0 0 0
## State 15 15 15 13 13 15 0
##
## $mm
## vehicle fm Mileage lh lc mc State
## vehicle 0 0 0 0 0 0 0
## fm 0 0 0 0 0 0 0
## Mileage 0 0 13 0 0 0 0
## lh 0 0 0 6 0 0 2
## lc 0 0 0 0 8 0 2
## mc 0 0 0 0 0 0 0
## State 0 0 0 2 2 0 15
marginplot(misdat[,c('Mileage', 'lc')])
library(VIM)
aggr_plot <- aggr(misdat, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## State 0.009236453
## Mileage 0.008004926
## lc 0.004926108
## lh 0.003694581
## vehicle 0.000000000
## fm 0.000000000
## mc 0.000000000
marginplot(misdat[c(1,2)])
A couple of notes on the parameters:
m=5 refers to the number of imputed data sets. Five is the default value. meth=‘pmm’ refers to the imputation method. In this case we are using predictive mean matching as imputation method. Other imputation methods can be used, type methods(mice) for a list of the available imputation methods.
impute <- mice(misdat,m=3,maxit=5,meth='pmm',seed=123)
##
## iter imp variable
## 1 1 Mileage lh lc
## 1 2 Mileage lh lc
## 1 3 Mileage lh lc
## 2 1 Mileage lh lc
## 2 2 Mileage lh lc
## 2 3 Mileage lh lc
## 3 1 Mileage lh lc
## 3 2 Mileage lh lc
## 3 3 Mileage lh lc
## 4 1 Mileage lh lc
## 4 2 Mileage lh lc
## 4 3 Mileage lh lc
## 5 1 Mileage lh lc
## 5 2 Mileage lh lc
## 5 3 Mileage lh lc
## Warning: Number of logged events: 1
print(impute)
## Class: mids
## Number of multiple imputations: 3
## Imputation methods:
## vehicle fm Mileage lh lc mc State
## "" "" "pmm" "pmm" "pmm" "" ""
## PredictorMatrix:
## vehicle fm Mileage lh lc mc State
## vehicle 0 1 1 1 1 1 0
## fm 1 0 1 1 1 1 0
## Mileage 1 1 0 1 1 1 0
## lh 1 1 1 0 1 1 0
## lc 1 1 1 1 0 1 0
## mc 1 1 1 1 1 0 0
## Number of logged events: 1
## it im dep meth out
## 1 0 0 constant State
summary(impute)
## Class: mids
## Number of multiple imputations: 3
## Imputation methods:
## vehicle fm Mileage lh lc mc State
## "" "" "pmm" "pmm" "pmm" "" ""
## PredictorMatrix:
## vehicle fm Mileage lh lc mc State
## vehicle 0 1 1 1 1 1 0
## fm 1 0 1 1 1 1 0
## Mileage 1 1 0 1 1 1 0
## lh 1 1 1 0 1 1 0
## lc 1 1 1 1 0 1 0
## mc 1 1 1 1 1 0 0
## Number of logged events: 1
## it im dep meth out
## 1 0 0 constant State
If you would like to check the imputed data, for instance for the variable Mileage, you need to enter the following line of code
impute$imp$Mileage
## 1 2 3
## 19 17817 20635 23179
## 20 30595 4952 16408
## 253 1336 713 45
## 254 14422 15204 7289
## 255 32750 45273 33011
## 256 25306 26553 24348
## 861 7219 18584 17334
## 862 1923 7160 1048
## 863 16093 20370 25622
## 1568 55273 55273 14231
## 1569 26646 14660 45927
## 1570 4 26 11
## 1571 30718 13923 3948
The output shows the imputed data for each observation (first column left) within each imputed data set (first row at the top).
If you need to check the imputation method used for each variable, mice makes it very easy to do
impute$method # What imputation method did we use?
## vehicle fm Mileage lh lc mc State
## "" "" "pmm" "pmm" "pmm" "" ""
methods(mice) # Possible imputation models provided by mice() are
## Warning in .S3methods(generic.function, class, envir): function 'mice' appears
## not to be S3 generic; found functions that look like S3 methods
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lasso.logreg
## [11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm mice.impute.lda
## [15] mice.impute.logreg mice.impute.logreg.boot
## [17] mice.impute.mean mice.impute.midastouch
## [19] mice.impute.mnar.logreg mice.impute.mnar.norm
## [21] mice.impute.norm mice.impute.norm.boot
## [23] mice.impute.norm.nob mice.impute.norm.predict
## [25] mice.impute.panImpute mice.impute.passive
## [27] mice.impute.pmm mice.impute.polr
## [29] mice.impute.polyreg mice.impute.quadratic
## [31] mice.impute.rf mice.impute.ri
## [33] mice.impute.sample mice.mids
## [35] mice.theme
## see '?methods' for accessing help and source code
Now we can get back the completed data set using the complete() function. It is almost plain English:
completedData <- complete(impute,1) # Get completed datasets (observed and imputed)
The missing values have been replaced with the imputed values in the first of the five data sets. If you wish to use another one, just change the second parameter in the complete() function.
Inspecting the distribution of original and imputed data Let’s compare the distributions of original and imputed data using a some useful plots. First of all we can use a scatter plot and plot Ozone against all the other variables
xyplot(impute, vehicle ~ fm+Mileage+lh+lc+mc+State,pch=18,cex=1) #Scatter plot vehicle vs all
## Warning in order(as.numeric(x)): NAs introduced by coercion
## Warning in split(as.numeric(x)[ord], gg[ord]): NAs introduced by coercion
## Warning in panel.superpose(x, y, type = type, groups = groups, pch = pch, : NAs
## introduced by coercion
What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”. Another helpful plot is the density plot:
densityplot(impute) # Density plot original vs imputed data set
## Warning in (function (x, darg, groups = NULL, weights = NULL, subscripts =
## TRUE, : NAs introduced by coercion
## Warning in panel.superpose(x, darg = darg, plot.points = plot.points, ref =
## FALSE, : NAs introduced by coercion
he density of the imputed data for each imputed data set is showed in magenta while the density of the observed data is showed in blue. Again, under our previous assumptions we expect the distributions to be similar.
Another useful visual take on the distributions can be obtained using the stripplot() function that shows the distributions of the variables as individual points
stripplot(impute, pch = 20, cex = 1.2) # Another take on the density: stripplot()
Pooling Suppose that the next step in our analysis is to fit a linear model to the data. You may ask what imputed data set to choose. The mice package makes it again very easy to fit a a model to each of the imputed data set and then pool the results together
modelFit1 <- with(impute,lm(vehicle ~ fm+Mileage+lh+lc+mc))
summary(pool(modelFit1)) # Pooling the results and fitting a linear model
## term estimate std.error statistic df p.value
## 1 (Intercept) 8.394778e+02 2.442800e+01 34.3653939 1615.124 0.000000000
## 2 fm -5.397482e+00 2.613242e+00 -2.0654357 1603.673 0.039042151
## 3 Mileage 1.165737e-04 8.932166e-04 0.1305101 1581.587 0.896179496
## 4 lh -4.298063e+01 1.870605e+01 -2.2976861 1345.373 0.021732526
## 5 lc 6.915074e-01 2.442459e-01 2.8311930 1352.017 0.004706155
## 6 mc -2.458190e-02 6.075833e-02 -0.4045848 1615.829 0.685836256
The variable modelFit1 contains the results of the fitting performed over the imputed data sets, while the pool() function pools them all together.
Remember that we initialized the mice function with a specific seed, therefore the results are somewhat dependent on our initial choice. To reduce this effect, we can impute a higher number of data set, by changing the default m=5 parameter in the mice() function as follows
impute2 <- mice(misdat,m=6,seed=245) # Using more imputed data sets
##
## iter imp variable
## 1 1 Mileage lh lc
## 1 2 Mileage lh lc
## 1 3 Mileage lh lc
## 1 4 Mileage lh lc
## 1 5 Mileage lh lc
## 1 6 Mileage lh lc
## 2 1 Mileage lh lc
## 2 2 Mileage lh lc
## 2 3 Mileage lh lc
## 2 4 Mileage lh lc
## 2 5 Mileage lh lc
## 2 6 Mileage lh lc
## 3 1 Mileage lh lc
## 3 2 Mileage lh lc
## 3 3 Mileage lh lc
## 3 4 Mileage lh lc
## 3 5 Mileage lh lc
## 3 6 Mileage lh lc
## 4 1 Mileage lh lc
## 4 2 Mileage lh lc
## 4 3 Mileage lh lc
## 4 4 Mileage lh lc
## 4 5 Mileage lh lc
## 4 6 Mileage lh lc
## 5 1 Mileage lh lc
## 5 2 Mileage lh lc
## 5 3 Mileage lh lc
## 5 4 Mileage lh lc
## 5 5 Mileage lh lc
## 5 6 Mileage lh lc
## Warning: Number of logged events: 1
modelFit2 <- with(impute2,lm(vehicle ~ fm+Mileage+lh+lc+mc))
summary(pool(modelFit2)) # Pooling the results and fitting a linear mode2
## term estimate std.error statistic df p.value
## 1 (Intercept) 8.392989e+02 2.442832e+01 34.35761303 1615.705 0.000000000
## 2 fm -5.093714e+00 2.619779e+00 -1.94432974 1574.076 0.052033642
## 3 Mileage -2.855875e-05 8.972616e-04 -0.03182879 1485.501 0.974612864
## 4 lh -4.054530e+01 1.877205e+01 -2.15987577 1411.785 0.030950177
## 5 lc 6.588474e-01 2.450062e-01 2.68910507 1411.107 0.007248773
## 6 mc -2.384689e-02 6.080655e-02 -0.39217632 1615.741 0.694979662
Thank you