Missing values can be a problem when trying to do analysis on the data. In most models, missing values are excluded which can limit the amount of information available in the analysis. This is the case why we have to either remove the missing values, impute them or model them. In this example, missing values will be imputed.
The data was imported in to R from Kaggle.com. Just for demonstration, the values equal to 0 would be treated as NAs.
## [1] 2.28471
Only about 2% of the data is missing which is not bad but we can still work with this.
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
##
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
##
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.000692 Min. :0.001852 Min. :0.1060
## 1st Qu.:0.030880 1st Qu.:0.020895 1st Qu.:0.1619
## Median :0.064905 Median :0.034840 Median :0.1792
## Mean :0.090876 Mean :0.050063 Mean :0.1812
## 3rd Qu.:0.132325 3rd Qu.:0.074842 3rd Qu.:0.1957
## Max. :0.426800 Max. :0.201200 Max. :0.3040
## NA's :13 NA's :13
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
##
## area_se smoothness_se compactness_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080
## Median : 24.530 Median :0.006380 Median :0.020450
## Mean : 40.337 Mean :0.007041 Mean :0.025478
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450
## Max. :542.200 Max. :0.031130 Max. :0.135400
##
## concavity_se concave.points_se symmetry_se
## Min. :0.000692 Min. :0.001852 Min. :0.007882
## 1st Qu.:0.015620 1st Qu.:0.007996 1st Qu.:0.015160
## Median :0.026245 Median :0.011100 Median :0.018730
## Mean :0.032639 Mean :0.012072 Mean :0.020542
## 3rd Qu.:0.042562 3rd Qu.:0.014932 3rd Qu.:0.023480
## Max. :0.396000 Max. :0.052790 Max. :0.078950
## NA's :13 NA's :13
## fractal_dimension_se radius_worst texture_worst perimeter_worst
## Min. :0.0008948 Min. : 7.93 Min. :12.02 Min. : 50.41
## 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11
## Median :0.0031870 Median :14.97 Median :25.41 Median : 97.66
## Mean :0.0037949 Mean :16.27 Mean :25.68 Mean :107.26
## 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40
## Max. :0.0298400 Max. :36.04 Max. :49.54 Max. :251.20
##
## area_worst smoothness_worst compactness_worst concavity_worst
## Min. : 185.2 Min. :0.07117 Min. :0.02729 Min. :0.001845
## 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.121800
## Median : 686.5 Median :0.13130 Median :0.21190 Median :0.231400
## Mean : 880.6 Mean :0.13237 Mean :0.25427 Mean :0.278553
## 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.386200
## Max. :4254.0 Max. :0.22260 Max. :1.05800 Max. :1.252000
## NA's :13
## concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.008772 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.065712 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.101700 Median :0.2822 Median :0.08004
## Mean :0.117286 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.163150 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.291000 Max. :0.6638 Max. :0.20750
## NA's :13
Using this plot we can see which variables contain the missing values.
Here values are imputed using the mice function based on the method of predictive mean matching. Predictive mean matiching only imputes values that were observed for other observations. The range is always between the min and max of the observed values.
mice
package.## Warning: Number of logged events: 150
## diagnosis radius_mean texture_mean perimeter_mean
## B:357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## M:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.000692
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.030360
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.061550
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.089588
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.130700
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.426800
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.001852 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.020360 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.033500 Median :0.1792 Median :0.06154
## Mean :0.049131 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.074000 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.201200 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se
## Min. :0.001713 Min. :0.002252 Min. :0.000692
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.015850
## Median :0.006380 Median :0.020450 Median :0.026310
## Mean :0.007041 Mean :0.025478 Mean :0.032564
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.042520
## Max. :0.031130 Max. :0.135400 Max. :0.396000
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.001852 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.008094 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.011230 Median :0.018730 Median :0.0031870
## Mean :0.012203 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.015080 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst
## Min. :0.07117 Min. :0.02729 Min. :0.001845
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.116800
## Median :0.13130 Median :0.21190 Median :0.226700
## Mean :0.13237 Mean :0.25427 Mean :0.274329
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.382900
## Max. :0.22260 Max. :1.05800 Max. :1.252000
## concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.008772 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.064930 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.099930 Median :0.2822 Median :0.08004
## Mean :0.115070 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.161400 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.291000 Max. :0.6638 Max. :0.20750
The dataset is now free of missing values. Regression analysis does not work too well with missing values therefore imputation of missing values may help improve the model outcome for predicted values.
Here is an excellent example!