Intro

Missing values can be a problem when trying to do analysis on the data. In most models, missing values are excluded which can limit the amount of information available in the analysis. This is the case why we have to either remove the missing values, impute them or model them. In this example, missing values will be imputed.

library(tidyverse)
library(mice)

The data was imported in to R from Kaggle.com. Just for demonstration, the values equal to 0 would be treated as NAs.

breast_cancer <- read.csv('breast-cancer-wisconsin-data/data.csv')
breast_cancer <- breast_cancer[,-33]
breast_cancer[breast_cancer == 0] <- NA

Exploration

(nrow(breast_cancer[!complete.cases(breast_cancer),])/nrow(breast_cancer))*100
## [1] 2.28471

Only about 2% of the data is missing which is not bad but we can still work with this.

summary(breast_cancer)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##                                                                
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##                                                                       
##  concavity_mean     concave.points_mean symmetry_mean   
##  Min.   :0.000692   Min.   :0.001852    Min.   :0.1060  
##  1st Qu.:0.030880   1st Qu.:0.020895    1st Qu.:0.1619  
##  Median :0.064905   Median :0.034840    Median :0.1792  
##  Mean   :0.090876   Mean   :0.050063    Mean   :0.1812  
##  3rd Qu.:0.132325   3rd Qu.:0.074842    3rd Qu.:0.1957  
##  Max.   :0.426800   Max.   :0.201200    Max.   :0.3040  
##  NA's   :13         NA's   :13                          
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##                                                                           
##     area_se        smoothness_se      compactness_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080  
##  Median : 24.530   Median :0.006380   Median :0.020450  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400  
##                                                         
##   concavity_se      concave.points_se   symmetry_se      
##  Min.   :0.000692   Min.   :0.001852   Min.   :0.007882  
##  1st Qu.:0.015620   1st Qu.:0.007996   1st Qu.:0.015160  
##  Median :0.026245   Median :0.011100   Median :0.018730  
##  Mean   :0.032639   Mean   :0.012072   Mean   :0.020542  
##  3rd Qu.:0.042562   3rd Qu.:0.014932   3rd Qu.:0.023480  
##  Max.   :0.396000   Max.   :0.052790   Max.   :0.078950  
##  NA's   :13         NA's   :13                           
##  fractal_dimension_se  radius_worst   texture_worst   perimeter_worst 
##  Min.   :0.0008948    Min.   : 7.93   Min.   :12.02   Min.   : 50.41  
##  1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11  
##  Median :0.0031870    Median :14.97   Median :25.41   Median : 97.66  
##  Mean   :0.0037949    Mean   :16.27   Mean   :25.68   Mean   :107.26  
##  3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40  
##  Max.   :0.0298400    Max.   :36.04   Max.   :49.54   Max.   :251.20  
##                                                                       
##    area_worst     smoothness_worst  compactness_worst concavity_worst   
##  Min.   : 185.2   Min.   :0.07117   Min.   :0.02729   Min.   :0.001845  
##  1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.121800  
##  Median : 686.5   Median :0.13130   Median :0.21190   Median :0.231400  
##  Mean   : 880.6   Mean   :0.13237   Mean   :0.25427   Mean   :0.278553  
##  3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.386200  
##  Max.   :4254.0   Max.   :0.22260   Max.   :1.05800   Max.   :1.252000  
##                                                       NA's   :13        
##  concave.points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.008772     Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.065712     1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.101700     Median :0.2822   Median :0.08004        
##  Mean   :0.117286     Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.163150     3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.291000     Max.   :0.6638   Max.   :0.20750        
##  NA's   :13

Using this plot we can see which variables contain the missing values.

Amelia::missmap(breast_cancer)

Here values are imputed using the mice function based on the method of predictive mean matching. Predictive mean matiching only imputes values that were observed for other observations. The range is always between the min and max of the observed values.

Imputation using mice package.

temp_data <- mice(breast_cancer[,-1],m=3,maxit=10,meth='pmm',seed=500, printFlag = F)
## Warning: Number of logged events: 150
impute_breast_cancer <- complete(temp_data, 1)
summary(impute_breast_cancer)
##  diagnosis  radius_mean      texture_mean   perimeter_mean  
##  B:357     Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  M:212     1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##            Median :13.370   Median :18.84   Median : 86.24  
##            Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##            3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##            Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean    
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.000692  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.030360  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.061550  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.089588  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.130700  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.426800  
##  concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.001852    Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.020360    1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.033500    Median :0.1792   Median :0.06154       
##  Mean   :0.049131    Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.074000    3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.201200    Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.000692  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.015850  
##  Median :0.006380   Median :0.020450   Median :0.026310  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.032564  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.042520  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.396000  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.001852   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.008094   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.011230   Median :0.018730   Median :0.0031870   
##  Mean   :0.012203   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.015080   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst   
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.001845  
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.116800  
##  Median :0.13130   Median :0.21190   Median :0.226700  
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.274329  
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.382900  
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.252000  
##  concave.points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.008772     Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.064930     1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.099930     Median :0.2822   Median :0.08004        
##  Mean   :0.115070     Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.161400     3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.291000     Max.   :0.6638   Max.   :0.20750
Amelia::missmap(impute_breast_cancer)

The dataset is now free of missing values. Regression analysis does not work too well with missing values therefore imputation of missing values may help improve the model outcome for predicted values.

Here is an excellent example!