Task Test with UCI Dermatology Dataset

This Vaar Template Notebook aims to return:
- Whether data is more likely to be MCAR or MAR/MNAR
- A recommended approach for managing missing data that considers stability of model results
- Evaluation of imputation efforts based on data model experiment results

Step 1: Read in data and show data summary

Data.table package used to read in file (from website for example.)

library(data.table)
df <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data', header=FALSE, stringsAsFactors = FALSE)

 Downloaded 4021 bytes...
 Downloaded 12213 bytes...
 Downloaded 16309 bytes...
 Downloaded 25964 bytes...
head(df)
summary(df)
       V1              V2              V3              V4              V5               V6        
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2.000   Median :2.000   Median :2.000   Median :1.000   Median :0.0000   Median :0.0000  
 Mean   :2.068   Mean   :1.795   Mean   :1.549   Mean   :1.366   Mean   :0.6339   Mean   :0.4481  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :3.0000   Max.   :3.0000  
       V7               V8              V9              V10              V11        
 Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.1667   Mean   :0.377   Mean   :0.6148   Mean   :0.5191   Mean   :0.1257  
 3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :3.0000   Max.   :3.000   Max.   :3.0000   Max.   :3.0000   Max.   :1.0000  
      V12              V13              V14              V15              V16       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :2.000  
 Mean   :0.4044   Mean   :0.1393   Mean   :0.5464   Mean   :0.3361   Mean   :1.369  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :3.0000   Max.   :2.0000   Max.   :3.0000   Max.   :3.0000   Max.   :3.000  
      V17             V18              V19            V20              V21              V22        
 Min.   :0.000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2.000   Median :0.0000   Median :1.00   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :1.956   Mean   :0.5273   Mean   :1.29   Mean   :0.6639   Mean   :0.9918   Mean   :0.6339  
 3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:2.00   3rd Qu.:2.0000   3rd Qu.:2.0000   3rd Qu.:1.0000  
 Max.   :3.000   Max.   :3.0000   Max.   :3.00   Max.   :3.0000   Max.   :3.0000   Max.   :3.0000  
      V23              V24              V25              V26              V27        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.2951   Mean   :0.3634   Mean   :0.3934   Mean   :0.4645   Mean   :0.4563  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :3.0000   Max.   :3.0000   Max.   :3.0000   Max.   :3.0000   Max.   :3.0000  
      V28              V29              V30              V31              V32       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :2.000  
 Mean   :0.9536   Mean   :0.4536   Mean   :0.1038   Mean   :0.1148   Mean   :1.866  
 3rd Qu.:2.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :3.0000   Max.   :3.0000   Max.   :3.0000   Max.   :3.0000   Max.   :3.000  
      V33             V34                 V35       
 Min.   :0.0000   Length:366         Min.   :1.000  
 1st Qu.:0.0000   Class :character   1st Qu.:1.000  
 Median :0.0000   Mode  :character   Median :3.000  
 Mean   :0.5546                      Mean   :2.803  
 3rd Qu.:0.0000                      3rd Qu.:4.000  
 Max.   :3.0000                      Max.   :6.000  

Step 2: Copy and Tidy Data

Tidyverse package used to rename variables. Example code included below

library(tidyverse)
library(janitor)
library(naniar)
#may need to clean variable names
#df <- df %>%
        #clean_names()
dfcp <- df #copy data

#may need to assign value NA to "?"
dfcp <- dfcp %>% replace_with_na_all(condition = ~.x == "?")

#may need to rename variables to something more meaningful
dfcp <- rename (dfcp, 
                        erythema=V1, scaling=V2, def_borders=V3, itching=V4, koebner_p=V5,
                poly_papules=V6, foll_papules=V7, oral_muc=V8, knee_elb=V9, scalp=V10,
                family_his=V11, mel_incont=V12, eosinophils=V13, pnl_infil=V14, fibrosis_pap=V15,
                exocytosis=V16, acanthosis=V17, hyperkeratosis=V18, parakeratosis=V19, clubbing_rr=V20,
                elong_rr=V21, thin_sup_ep=V22, spongi_p=V23, munro_m=V24, focal_hyper=V25,
                dis_gran_lay=V26, vacuolisation=V27, spongiosis=V28, sawtooth_r=V29, foll_hplug=V30,
                perifollicular=V31, inflam_mono_inf=V32, band_inf=V33, age=V34, class=V35)


#Where columns have missing data may need to convert from characters to integers
dfcp$age <- as.integer(dfcp$age)

Check for Duplicates

Duplicates checked using n_distinct in tidyverse package

n_distinct(dfcp)
[1] 366
#filter to check and review any duplicate records
duplicates <- dfcp %>%
  filter(duplicated(.))

print(duplicates)

Remove duplicates and any unnecessary variables for model such as id

Only if unique identifiers exist.

```r
dfcp <- dfcp %>% distinct() #691 obs remaining
dfcp <- select(dfcp,-c(sampleID)) #remove sampleID column 10 variables remain

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->



# Step 3: Visualise Missing Data and run MCAR Test

Packages used: visdat to explore missing values, ggplot to explore missing data further and naniar for geom_miss_point function and MCAR test. There will be a warning with MCAR test if non-numeric columns are present. If p-value >0.05 likely to be MCAR as not statistically significant.

The naniar test may also error due to singular data and a fix has not been found for this yet (August 2023). 


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxubGlicmFyeSh2aXNkYXQpXG5saWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KG5hbmlhcilcblxudmlzX21pc3MoZGZjcClcbmBgYCJ9 -->

```r
library(visdat)
library(ggplot2)
library(naniar)

vis_miss(dfcp)

miss_var_summary(dfcp)

#optional plot for classification problem
ggplot(dfcp,
       aes (x = class,
            y = age,)) +
  geom_miss_point() +
  ggtitle ("Graph Showing Missing Data Pattern By Class Diagnosis")

ggplot
function (data = NULL, mapping = aes(), ..., environment = parent.frame()) 
{
    UseMethod("ggplot")
}
<bytecode: 0x0000024c1f8a0c08>
<environment: namespace:ggplot2>
mcar_test(dfcp)
Warning: NAs introduced by coercion to integer range

Step 4: Look at Outputs from Visualising the Missing Data and Answer the Following Questions

What is the p.value returned by the MCAR Test?

If the statistic is high and the p.value <0.05 it is likely to be MNAR or MAR and cannot be ignored. A p-value >0.05 indicates MCAR but other information will be considered also. The value is set as a variable in the following code. Enter “error” if MCAR Test not possible due to data singularity.

#set MCAR Test p.value as variable
mcar_presult = 0.007706574

if(mcar_presult=="error" || is.numeric(mcar_presult)){print(paste(mcar_presult, "is MCAR Test p.value variable value."))
}else{
    print("Please enter either a number or 'error' as the mcar_presult variable value.")}
[1] "0.007706574 is MCAR Test p.value variable value."

Is data more likely to be missing in some variables?

If so, this indicates that the data could be missing at random (MAR) or missing not at random (MNAR) and can therefore not be ignored. Yes or No is set as a variable in the following code.

#set likelihood variable
likelihood = "Yes"

if(likelihood=="Yes" || likelihood=="No") {print(paste(likelihood, "is variable value for likelihood."))
}else{
    print("Please enter either 'Yes' or 'No' as the variable value for likelihood.")}
[1] "Yes is variable value for likelihood."

Is data missing from the dependent variable only?

If so, this suggests that complete case analysis may be the best option. However, other factors need to be considered also. Yes or No is set as variable in the following code.

#set dependent missingness variable
dependent_only = "No"

if(dependent_only=="Yes" || dependent_only=="No") {print(paste(dependent_only, "is variable value for dependent_only."))
}else{
    print("Please enter either 'Yes' or 'No' as the variable value for dependent_only.")}
[1] "No is variable value for dependent_only."

What percentage of data is missing?

If it’s between 0.1-0.5 multiple imputation is likely to result in a more effective, unbiased model. The value is set as a variable in the following code in the format 0.000. This is taken from the missing data visualisation.

#set missingness variable value
missingness = 0.001

if(is.numeric(missingness)){print(paste(missingness, "is missingness variable value."))
}else{
    print("Please enter a number as the missingness variable value.")}
[1] "0.001 is missingness variable value."

It’s good practice to test different approaches and the Vaar Notebooks will consider these, as well as the variable values just set in recommendations.

Visualise Correlation

Visualise correlation for numerical variables.

#subset numerics for correlation
dfcp_x <- select_if(dfcp, is.numeric)
cor(dfcp_x)
                     erythema     scaling def_borders     itching     koebner_p poly_papules
erythema         1.0000000000  0.43546674  0.26008714 -0.04762282  0.0006944807   0.03357060
scaling          0.4354667435  1.00000000  0.35798716 -0.08419730 -0.0105762260  -0.07503524
def_borders      0.2600871419  0.35798716  1.00000000 -0.06256376  0.2413419135   0.32775351
itching         -0.0476228237 -0.08419730 -0.06256376  1.00000000  0.2678779155   0.40969552
koebner_p        0.0006944807 -0.01057623  0.24134191  0.26787792  1.0000000000   0.39410783
poly_papules     0.0335705974 -0.07503524  0.32775351  0.40969552  0.3941078288   1.00000000
foll_papules    -0.1095504379 -0.09240039 -0.16666188 -0.14482511 -0.1727411481  -0.13709339
oral_muc        -0.0268103478 -0.08296907  0.28668378  0.35916314  0.3925540373   0.86514164
knee_elb         0.1410081400  0.29839835  0.30230912 -0.29964455 -0.0663719325  -0.27605836
scalp            0.1866840084  0.30157745  0.26886161 -0.15563879  0.0152017709  -0.25323441
family_his       0.1720592053  0.18149121  0.11584377 -0.13661545 -0.0832472092  -0.15184095
mel_incont       0.0421264334 -0.07485579  0.31139783  0.35920665  0.3856865809   0.90704381
eosinophils      0.0752281360  0.04221112 -0.13202092  0.09543424 -0.0463655345   0.02882448
pnl_infil        0.2240953536  0.28248561  0.04133024 -0.14528813 -0.1693765837  -0.31451521
fibrosis_pap    -0.3545949735 -0.38358482 -0.27441368  0.20303243 -0.2474505345  -0.15804607
exocytosis       0.0215646506 -0.11080865 -0.20539141  0.21482183  0.1459602738   0.37964267
acanthosis       0.0641652239  0.10809525  0.18976118  0.06707992 -0.0544493385   0.12921046
hyperkeratosis  -0.0554322863  0.01314925  0.04389372 -0.00845992  0.0026350312  -0.16435492
parakeratosis    0.2100302704  0.29675159  0.36121323 -0.07556830 -0.0368008048  -0.04209759
clubbing_rr      0.1575581806  0.32074002  0.38720469 -0.23222364  0.0084714493  -0.29485667
elong_rr         0.0078193911  0.13907277  0.20170187 -0.12405752 -0.1638194895  -0.40054363
thin_sup_ep      0.1678675171  0.33034338  0.36926643 -0.24405215  0.0260637761  -0.28746755
spongi_p         0.1697726325  0.29778253  0.24620426 -0.15268946 -0.0020653460  -0.20653045
munro_m          0.1134635839  0.18636552  0.24223231 -0.05605599  0.1576464293  -0.20189321
focal_hyper     -0.0137609195 -0.08502106  0.29114708  0.36348761  0.4004093621   0.88097177
dis_gran_lay     0.1591004797  0.21148416  0.35476843 -0.19268375  0.0043273611  -0.11969837
vacuolisation    0.0025590558 -0.10952329  0.29809461  0.36765040  0.3764725106   0.91162573
spongiosis       0.0224681651  0.01215110 -0.25553602  0.01751425 -0.0059376210   0.08006255
                foll_papules    oral_muc    knee_elb        scalp  family_his  mel_incont
erythema        -0.109550438 -0.02681035  0.14100814  0.186684008  0.17205921  0.04212643
scaling         -0.092400390 -0.08296907  0.29839835  0.301577453  0.18149121 -0.07485579
def_borders     -0.166661877  0.28668378  0.30230912  0.268861611  0.11584377  0.31139783
itching         -0.144825107  0.35916314 -0.29964455 -0.155638789 -0.13661545  0.35920665
koebner_p       -0.172741148  0.39255404 -0.06637193  0.015201771 -0.08324721  0.38568658
poly_papules    -0.137093394  0.86514164 -0.27605836 -0.253234413 -0.15184095  0.90704381
foll_papules     1.000000000 -0.13239448  0.22225505 -0.003534584  0.19286625 -0.13616530
oral_muc        -0.132394477  1.00000000 -0.28346606 -0.259813343 -0.15182519  0.86923143
knee_elb         0.222255051 -0.28346606  1.00000000  0.659205582  0.34191470 -0.27231380
scalp           -0.003534584 -0.25981334  0.65920558  1.000000000  0.29272769 -0.25677941
family_his       0.192866249 -0.15182519  0.34191470  0.292727690  1.00000000 -0.16701343
mel_incont      -0.136165297  0.86923143 -0.27231380 -0.256779408 -0.16701343  1.00000000
eosinophils     -0.087452012  0.01412156 -0.19866966 -0.076956867 -0.10842974  0.04112887
pnl_infil       -0.119727872 -0.30373511  0.33170886  0.353082961  0.13019606 -0.31238600
fibrosis_pap     0.019698475 -0.14004653 -0.22742823 -0.226417506 -0.14955543 -0.16886249
exocytosis      -0.010869033  0.36013842 -0.52994616 -0.531623150 -0.20900460  0.39188890
acanthosis      -0.096591678  0.10156458  0.11671121  0.158394150  0.02329405  0.13027650
hyperkeratosis   0.189201254 -0.15517704  0.22953786  0.187032477  0.16071961 -0.17907140
parakeratosis   -0.029653561 -0.03925778  0.43995113  0.481269771  0.22197692 -0.02699312
clubbing_rr     -0.115856364 -0.28475037  0.72160203  0.769593769  0.31597361 -0.29286055
elong_rr        -0.105355791 -0.38681488  0.49606688  0.555905058  0.19442882 -0.39783202
thin_sup_ep     -0.128360885 -0.27761451  0.64466916  0.747035758  0.31773690 -0.28552145
spongi_p        -0.057282946 -0.19945156  0.44725233  0.464369643  0.11601901 -0.20513227
munro_m         -0.095856215 -0.19086090  0.51815454  0.632959102  0.22036509 -0.22297525
focal_hyper     -0.118710251  0.88435115 -0.26750757 -0.255554455 -0.16614190  0.89653341
dis_gran_lay    -0.157295676 -0.10670393  0.48174256  0.488803717  0.19690457 -0.13745221
vacuolisation   -0.139960042  0.88755250 -0.28215116 -0.252483049 -0.15549032  0.94165863
spongiosis      -0.013453716  0.12034380 -0.41320039 -0.420717713 -0.15236335  0.05817634
                eosinophils   pnl_infil fibrosis_pap  exocytosis  acanthosis hyperkeratosis
erythema         0.07522814  0.22409535  -0.35459497  0.02156465  0.06416522   -0.055432286
scaling          0.04221112  0.28248561  -0.38358482 -0.11080865  0.10809525    0.013149254
def_borders     -0.13202092  0.04133024  -0.27441368 -0.20539141  0.18976118    0.043893722
itching          0.09543424 -0.14528813   0.20303243  0.21482183  0.06707992   -0.008459920
koebner_p       -0.04636553 -0.16937658  -0.24745053  0.14596027 -0.05444934    0.002635031
poly_papules     0.02882448 -0.31451521  -0.15804607  0.37964267  0.12921046   -0.164354919
foll_papules    -0.08745201 -0.11972787   0.01969847 -0.01086903 -0.09659168    0.189201254
oral_muc         0.01412156 -0.30373511  -0.14004653  0.36013842  0.10156458   -0.155177036
knee_elb        -0.19866966  0.33170886  -0.22742823 -0.52994616  0.11671121    0.229537864
scalp           -0.07695687  0.35308296  -0.22641751 -0.53162315  0.15839415    0.187032477
family_his      -0.10842974  0.13019606  -0.14955543 -0.20900460  0.02329405    0.160719607
mel_incont       0.04112887 -0.31238600  -0.16886249  0.39188890  0.13027650   -0.179071403
eosinophils      1.00000000  0.09081832  -0.05567622  0.20595771  0.03949383   -0.060576645
pnl_infil        0.09081832  1.00000000  -0.26469366 -0.27613441 -0.03421877    0.029001955
fibrosis_pap    -0.05567622 -0.26469366   1.00000000 -0.17553618  0.16846138    0.093905239
exocytosis       0.20595771 -0.27613441  -0.17553618  1.00000000 -0.17790520   -0.193931745
acanthosis       0.03949383 -0.03421877   0.16846138 -0.17790520  1.00000000    0.012377610
hyperkeratosis  -0.06057664  0.02900196   0.09390524 -0.19393175  0.01237761    1.000000000
parakeratosis   -0.05634359  0.24927079  -0.23667160 -0.34902705  0.17866309    0.162103332
clubbing_rr     -0.18169071  0.38534905  -0.21472389 -0.62586369  0.17327137    0.235782396
elong_rr        -0.14645324  0.22734378   0.29845426 -0.70844488  0.32381243    0.213544642
thin_sup_ep     -0.18211117  0.37730869  -0.23882757 -0.63178196  0.17515211    0.233082419
spongi_p        -0.03025278  0.50094393  -0.16423664 -0.38042581  0.06147720    0.081209851
munro_m         -0.11851251  0.23581141  -0.16357131 -0.51283497  0.04967252    0.247044052
focal_hyper      0.06214859 -0.30729165  -0.16783812  0.38517164  0.11903488   -0.165868233
dis_gran_lay    -0.16683837  0.16744026  -0.19727459 -0.37488685  0.09083522    0.009853847
vacuolisation    0.03992115 -0.31757323  -0.16856923  0.38034170  0.13007071   -0.163193192
spongiosis       0.15523082 -0.12696918  -0.18266911  0.57128653 -0.07057417   -0.253059536
                parakeratosis  clubbing_rr     elong_rr thin_sup_ep     spongi_p     munro_m
erythema          0.210030270  0.157558181  0.007819391  0.16786752  0.169772632  0.11346358
scaling           0.296751593  0.320740018  0.139072765  0.33034338  0.297782529  0.18636552
def_borders       0.361213233  0.387204686  0.201701874  0.36926643  0.246204257  0.24223231
itching          -0.075568303 -0.232223641 -0.124057523 -0.24405215 -0.152689457 -0.05605599
koebner_p        -0.036800805  0.008471449 -0.163819490  0.02606378 -0.002065346  0.15764643
poly_papules     -0.042097593 -0.294856673 -0.400543631 -0.28746755 -0.206530447 -0.20189321
foll_papules     -0.029653561 -0.115856364 -0.105355791 -0.12836089 -0.057282946 -0.09585621
oral_muc         -0.039257785 -0.284750373 -0.386814879 -0.27761451 -0.199451555 -0.19086090
knee_elb          0.439951127  0.721602033  0.496066880  0.64466916  0.447252335  0.51815454
scalp             0.481269771  0.769593769  0.555905058  0.74703576  0.464369643  0.63295910
family_his        0.221976917  0.315973610  0.194428822  0.31773690  0.116019007  0.22036509
mel_incont       -0.026993124 -0.292860548 -0.397832024 -0.28552145 -0.205132274 -0.22297525
eosinophils      -0.056343591 -0.181690713 -0.146453242 -0.18211117 -0.030252780 -0.11851251
pnl_infil         0.249270792  0.385349046  0.227343781  0.37730869  0.500943929  0.23581141
fibrosis_pap     -0.236671603 -0.214723891  0.298454263 -0.23882757 -0.164236636 -0.16357131
exocytosis       -0.349027048 -0.625863694 -0.708444882 -0.63178196 -0.380425809 -0.51283497
acanthosis        0.178663090  0.173271369  0.323812431  0.17515211  0.061477197  0.04967252
hyperkeratosis    0.162103332  0.235782396  0.213544642  0.23308242  0.081209851  0.24704405
parakeratosis     1.000000000  0.535744953  0.403034687  0.52453939  0.279278795  0.35561035
clubbing_rr       0.535744953  1.000000000  0.724950967  0.87914479  0.553967947  0.67460195
elong_rr          0.403034687  0.724950967  1.000000000  0.71503280  0.424976746  0.49055928
thin_sup_ep       0.524539387  0.879144792  0.715032796  1.00000000  0.471917767  0.69235528
spongi_p          0.279278795  0.553967947  0.424976746  0.47191777  1.000000000  0.26218926
munro_m           0.355610351  0.674601951  0.490559277  0.69235528  0.262189260  1.00000000
focal_hyper      -0.030599976 -0.291792822 -0.396381588 -0.28448048 -0.204384393 -0.22216231
dis_gran_lay      0.289180711  0.590870567  0.396296690  0.54861699  0.273206792  0.43456570
vacuolisation    -0.032414741 -0.295592344 -0.403981365 -0.29070617 -0.210849038 -0.22163596
spongiosis       -0.230053233 -0.506274465 -0.584346583 -0.50881737 -0.285529173 -0.38552787
                focal_hyper dis_gran_lay vacuolisation   spongiosis   sawtooth_r   foll_hplug
erythema        -0.01376092  0.159100480   0.002559056  0.022468165 -0.005779296 -0.005449948
scaling         -0.08502106  0.211484156  -0.109523286  0.012151102 -0.114466838 -0.010518042
def_borders      0.29114708  0.354768426   0.298094613 -0.255536017  0.274575520 -0.086249885
itching          0.36348761 -0.192683746   0.367650396  0.017514247  0.376186022 -0.175866309
koebner_p        0.40040936  0.004327361   0.376472511 -0.005937621  0.381687053 -0.147954844
poly_papules     0.88097177 -0.119698366   0.911625731  0.080062554  0.895107362 -0.095476947
foll_papules    -0.11871025 -0.157295676  -0.139960042 -0.013453716 -0.139140808  0.785282153
oral_muc         0.88435115 -0.106703928   0.887552499  0.120343798  0.875209241 -0.082600606
knee_elb        -0.26750757  0.481742560  -0.282151156 -0.413200387 -0.280394744  0.232903105
scalp           -0.25555446  0.488803717  -0.252483049 -0.420717713 -0.257209536 -0.051894442
family_his      -0.16614190  0.196904566  -0.155490321 -0.152363346 -0.163069657  0.242311911
mel_incont       0.89653341 -0.137452212   0.941658627  0.058176343  0.900232313 -0.093466144
eosinophils      0.06214859 -0.166838372   0.039921147  0.155230816  0.040897588 -0.078212148
pnl_infil       -0.30729165  0.167440259  -0.317573230 -0.126969180 -0.315693292 -0.117591085
fibrosis_pap    -0.16783812 -0.197274590  -0.168569235 -0.182669110 -0.160734553 -0.083917427
exocytosis       0.38517164 -0.374886849   0.380341704  0.571286532  0.383949906  0.065998097
acanthosis       0.11903488  0.090835221   0.130070708 -0.070574172  0.097692794 -0.028502019
hyperkeratosis  -0.16586823  0.009853847  -0.163193192 -0.253059536 -0.138475670  0.096097013
parakeratosis   -0.03059998  0.289180711  -0.032414741 -0.230053233 -0.037768205  0.006592687
clubbing_rr     -0.29179282  0.590870567  -0.295592344 -0.506274465 -0.299260194 -0.127938884
elong_rr        -0.39638159  0.396296690  -0.403981365 -0.584346583 -0.406525528 -0.176316604
thin_sup_ep     -0.28448048  0.548616991  -0.290706166 -0.508817371 -0.291760717 -0.135688949
spongi_p        -0.20438439  0.273206792  -0.210849038 -0.285529173 -0.209614865 -0.092637443
munro_m         -0.22216231  0.434565699  -0.221635962 -0.385527867 -0.201407584 -0.110554716
focal_hyper      1.00000000 -0.137555831   0.909650320  0.047628296  0.884075940 -0.071255891
dis_gran_lay    -0.13755583  1.000000000  -0.144532965 -0.255348843 -0.133057964 -0.124126221
vacuolisation    0.90965032 -0.144532965   1.000000000  0.052696049  0.938397312 -0.091336544
spongiosis       0.04762830 -0.255348843   0.052696049  1.000000000  0.065280638  0.057935934
                perifollicular inflam_mono_inf    band_inf age       class
erythema           0.009539006     0.075758095 -0.00695452  NA -0.33553703
scaling           -0.011134969    -0.010977442 -0.13207915  NA -0.46868826
def_borders       -0.123947333     0.111881519  0.28257806  NA -0.39197414
itching           -0.184074144     0.062780814  0.38669107  NA  0.05477205
koebner_p         -0.164364136     0.083357727  0.38284377  NA -0.09132350
poly_papules      -0.110203624     0.228426273  0.90582234  NA  0.05778546
foll_papules       0.844928701    -0.084863861 -0.13387078  NA  0.47781329
oral_muc          -0.106426362     0.250936677  0.89234104  NA  0.05580485
knee_elb           0.257661488    -0.068621099 -0.27671119  NA -0.38330476
scalp             -0.017352246     0.035151781 -0.24997480  NA -0.53320838
family_his         0.231725064    -0.089133282 -0.16802241  NA -0.14438146
mel_incont        -0.109457565     0.259466872  0.91684833  NA  0.05739426
eosinophils       -0.079672201    -0.047391413  0.02835438  NA -0.06232307
pnl_infil         -0.109655288    -0.084324354 -0.31877602  NA -0.55019461
fibrosis_pap      -0.092746551    -0.024469781 -0.17196742  NA  0.52697644
exocytosis         0.058414149     0.143742074  0.38160792  NA  0.28343250
acanthosis        -0.079967636     0.131637294  0.08648629  NA -0.07977066
hyperkeratosis     0.087758781    -0.115426337 -0.14412413  NA -0.05442842
parakeratosis      0.011217531    -0.007437855 -0.06693765  NA -0.42074192
clubbing_rr       -0.132002313     0.009042738 -0.30187511  NA -0.66878054
elong_rr          -0.186462751     0.018176100 -0.41639380  NA -0.35792398
thin_sup_ep       -0.144208861     0.029385859 -0.29605370  NA -0.68486408
spongi_p          -0.086886974     0.064476921 -0.22129809  NA -0.44940738
munro_m           -0.112618534    -0.010896479 -0.19815587  NA -0.52099455
focal_hyper       -0.095858959     0.232228332  0.90451944  NA  0.06324106
dis_gran_lay      -0.126443390     0.081839585 -0.14118162  NA -0.42734771
vacuolisation     -0.112508002     0.254309426  0.93756140  NA  0.05540233
spongiosis         0.069198950     0.055834518  0.03601253  NA  0.21340113
 [ reached getOption("max.print") -- omitted 7 rows ]
vis_cor(dfcp_x)

library(GGally) #ggplot2 extension for pairs matrix
#if classification problem
#Change class from integer to factor
dfcp$class <- as.factor(dfcp$class)

pm <- ggpairs(dfcp, columns = 1:10, ggplot2::aes(colour = class), lower=list(combo=wrap("facethist",  
                                                                                                  binwidth=0.5)))
pm

Calculate Skew

library(psych) #for skew function
skew(dfcp_x) #using numeric subset created for correlation
 [1] -109.508412  -27.289419 -105.555722   35.648980  430.777146  655.822858 1291.344182  711.762408
 [9]  448.022741  528.509989  823.181140  681.964800 1118.986693  443.331537  855.565227   -2.978114
[17] -142.837906  443.189369   13.087247  428.595912  196.939167  450.488413  840.015652  738.376425
[25]  699.819281  591.758872  641.687669  211.976836  651.682622 1734.627933 1588.550298 -127.551644
[33]  575.026126   26.040408  147.266458

Step 5: which variables and values are important for predicting proportion of missingness?

rpart and rpart.plot packages are used to plot a simple classification tree.

library(rpart)
library(rpart.plot)

dfcp %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data=.) %>%
  prp(type=4, extra = 101, roundint=FALSE, prefix="Prop.Miss = ")

Which variable with missing data is nearest the top of the tree?

Variable value for missingness influencer set in code below. There may only be one variable missing data.

#set value for missingness influencer or only missing data variable
miss_influencer="age"
library(mice)
#fxpt <- fluxplot(dfcp)
# Variables with higher outflux may be more powerful predictors. More meaningful where data is missing from more than one variable.

Change Columns to Factors Where Necessary

dfcp <- as.data.frame(dfcp) 

dfcp <- dfcp  %>% mutate(across(c("family_his"), as.factor))

Change Columns to Integers Where Necessary

#dfcp <- dfcp  %>% mutate(across(c("var3", "var4"), as.integer))

Step 6: Create Complete Case and Multiple Imputed Datasets

Create complete dataset by deleting records with missing values and run Multiple Imputation using MICE. Try adjusting maxit or manually set suitable imputation method for each variable (e.g. meth = init$method=c(“pmm”, “pmm”, “pmm”, “pmm”, “logreg”)) if mice errors received.

dfcp_cca <- dfcp[complete.cases(dfcp), ] 

init = mice(dfcp, maxit=5) 
meth = init$method
predM = init$predictorMatrix

#create imputed data
set.seed(123)
dfcp_imp = mice(dfcp, method=meth, predictorMatrix=predM, m=5)
#create dataset after imputation
dfcp_mi <- complete(dfcp_imp)

Create visualisations to ensure complete data returned.

#check for missing data in MI and CCA dataset
vis_miss(dfcp_mi)

vis_miss(dfcp_cca)

As mice is designed for MAR data, imputations may not be possible for MNAR data with high levels of missingness. Also check for collinearity and that all variables are suitable for imputation if errors occur. For example, dates or values based on other columns can be corrected separately.

Step 7: Conduct MNAR Sensitivity Test

Use the variable with missing data which has the most influence on proportion of missingness, according to step 5 and look at the range of values in this variable (in the summary at the top of the notebook for example). Generate imputations under delta adjustment to imitate deviations from MAR (Gink and Van Buuren, no date). The code uses 0 for MAR and a reasonable range based on the variable range as the delta values to test MNAR.

max(dfcp_mi$age)
[1] 75
min(dfcp_mi$age)
[1] 0
mean(dfcp_mi$age)
[1] 36.34153

#Generate imputations under delta adjustment
delta <- c(0, +3, +6, +9, +12) #examples to be changed
imp.d <- vector("list", length(delta))
post <-dfcp_imp$post

for (i in 1:length(delta)) {
  d <- delta[i]
  cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
  post["age"] <- cmd
  imp <- mice(dfcp, post = post, maxit = 5,
              seed = i, print=FALSE)
  imp.d[[i]] <- imp
}

Inspect Imputations

The first plot is based on no delta adjustment and the second plot is the highest adjustment. The Y axis scale is set to the same value, so that differences can be visualised more easily.

densityplot(imp.d[[1]],lwd=3, ylim=c(0,0.4)) #check ylim is appropriate for variables

densityplot(imp.d[[5]],lwd=3, ylim=c(0,0.4)) #check ylim is appropriate for variables

Find out more about sensitivity analysis in the context of missing data from Gerko Vink and Stef van Buuren’s vignette mice: An approach to sensitivity analysis (Vink and van Buuren, no date).

The second density plot considers imputations under the largest adjustment, look at the blue line in particular.

Create Complete Datasets with Delta Imputations 1 to 5

Create complete datasets and check imputations have worked by visualising missing data.

#delta 1 is the dfcp_mi data
dfcp_mi_d2 <- complete(imp.d[[2]])
dfcp_mi_d3 <- complete(imp.d[[3]])
dfcp_mi_d4 <- complete(imp.d[[4]])
dfcp_mi_d5 <- complete(imp.d[[5]])


vis_miss(dfcp_mi_d2)

vis_miss(dfcp_mi_d3)

vis_miss(dfcp_mi_d4)

vis_miss(dfcp_mi_d5)

Optional Step: Remove Outliers

There can be valuable information in outliers and it is good practice to remove obvious errors only - such as impossible values for particular variables.

Step 8: Select Features

This code is an example of how to determine which variables have above 0.9 correlation, as highly-correlated features do not generally improve models. Then, remove any variables identified from all datasets.

library(caret)
#remove class, factors or dependent variables as looking for co-correlation issues
#0.9 and above
dfcp_mi_cor = subset(dfcp_mi, select = -c(class,family_his))
                
dfcp_mi_cor_res <- cor(dfcp_mi_cor)
dfcp_mi_cor_res <- as.data.frame(as.table(dfcp_mi_cor_res))

dfcp_mi_cor_res %>%  arrange(desc(Freq)) %>% filter(Freq>0.9)
dfcp_mi_cor_res %>%  arrange(desc(Freq)) %>% filter(Freq< -0.9)
NA
#remove identified variables from other datasets also

#vacuolisation
#band_inf
#mel_incont
#perifollicular

dfcp_mi = subset(dfcp_mi, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_cca = subset(dfcp_cca, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d2 = subset(dfcp_mi_d2, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d3 = subset(dfcp_mi_d3, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d4 = subset(dfcp_mi_d4, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d5 = subset(dfcp_mi_d5, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))

Optional step: Transform data

Scaling data allows models to compare relative relationships between data points more effectively. This dataset already scaled.

#scale the numeric columns in all the datasets to be used in the test models

#dfcp_mi <- dfcp_mi %>% mutate(across(where(is.numeric), scale))
#dfcp_cca <- dfcp_cca %>% mutate(across(where(is.numeric), scale))
#dfcp_mi_d5 <- dfcp_mi_d5 %>% mutate(across(where(is.numeric), scale))

Log transformation can be helpful for linear models for example.

Step 9: Build and Test Models

Create Training and Test Sets

Seeds set for reproducibility.

#mi data
set.seed(123)
index <- sample(2, nrow(dfcp_mi),
              replace = TRUE,
              prob = c(0.7, 0.3))
train_mi <- dfcp_mi[index==1,]
test_mi <- dfcp_mi[index==2,] 

#cca data
set.seed(123)
index1 <- sample(2, nrow(dfcp_cca),
                 replace = TRUE,
                 prob = c(0.7, 0.3))
train_cca <- dfcp_cca[index1==1,] 
test_cca <- dfcp_cca[index1==2,] 

#delta5 data
set.seed(123)
index5 <- sample(2, nrow(dfcp_mi_d5),
                 replace = TRUE,
                 prob = c(0.7, 0.3))
train_d5 <- dfcp_mi_d5[index5==1,] 
test_d5 <- dfcp_mi_d5[index5==2,] 

Predict Against Test Data

Start with baseline model using multiple imputation. RF classification included as code example only.

library(randomForest)
#Create x and y values
#dfcp_mi data
x <- train_mi[,-31] #remove dependent variable 
y <- as.factor(train_mi$class) 
set.seed(345) #for reproducibility

#Fit model with training data and review details
rf = randomForest(x = x,
                  y = y,
                  ntree = 500)
rf

Call:
 randomForest(x = x, y = y, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 5

        OOB estimate of  error rate: 3.44%
Confusion matrix:
   1  2  3  4  5  6 class.error
1 75  0  0  0  0  0  0.00000000
2  1 44  0  3  0  0  0.08333333
3  0  0 47  0  0  0  0.00000000
4  0  4  0 28  0  0  0.12500000
5  0  0  0  0 42  0  0.00000000
6  1  0  0  0  0 17  0.05555556
#predict results
xt <- test_mi[,-31] #remove dependent variable like class
yt <-as.factor(test_mi$class) #keep class only
pred_yt <- predict(rf, newdata=xt)

#check accuracy
confusionMatrix(pred_yt, yt, mode="everything")
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4  5  6
         1 37  0  0  0  0  0
         2  0 13  0  2  0  0
         3  0  0 25  0  0  0
         4  0  0  0 15  0  0
         5  0  0  0  0 10  0
         6  0  0  0  0  0  2

Overall Statistics
                                          
               Accuracy : 0.9808          
                 95% CI : (0.9323, 0.9977)
    No Information Rate : 0.3558          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9748          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity            1.0000   1.0000   1.0000   0.8824  1.00000  1.00000
Specificity            1.0000   0.9780   1.0000   1.0000  1.00000  1.00000
Pos Pred Value         1.0000   0.8667   1.0000   1.0000  1.00000  1.00000
Neg Pred Value         1.0000   1.0000   1.0000   0.9775  1.00000  1.00000
Precision              1.0000   0.8667   1.0000   1.0000  1.00000  1.00000
Recall                 1.0000   1.0000   1.0000   0.8824  1.00000  1.00000
F1                     1.0000   0.9286   1.0000   0.9375  1.00000  1.00000
Prevalence             0.3558   0.1250   0.2404   0.1635  0.09615  0.01923
Detection Rate         0.3558   0.1250   0.2404   0.1442  0.09615  0.01923
Detection Prevalence   0.3558   0.1442   0.2404   0.1442  0.09615  0.01923
Balanced Accuracy      1.0000   0.9890   1.0000   0.9412  1.00000  1.00000
varImpPlot(rf)

Sample code block for CCA data.

#Create x and y values
#dfcp_cca data
x1 <- train_cca[,-31]
y1 <- as.factor(train_cca$class)
set.seed(345) #for reproducibility

#Fit model with training data and review details
rf1 = randomForest(x = x1,
                  y = y1,
                  ntree = 500)
rf1

Call:
 randomForest(x = x1, y = y1, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 5

        OOB estimate of  error rate: 3.11%
Confusion matrix:
   1  2  3  4  5  6 class.error
1 80  0  0  0  0  0  0.00000000
2  1 40  0  2  0  0  0.06976744
3  0  0 52  0  0  0  0.00000000
4  0  4  0 30  0  0  0.11764706
5  0  0  0  0 34  0  0.00000000
6  1  0  0  0  0 13  0.07142857
#predict results
x1t <- test_cca[,-31] #remove dependent variable like class
y1t <-as.factor(test_cca$class) #keep class only
pred_y1t <- predict(rf1, newdata=x1t)

#check accuracy
confusionMatrix(pred_y1t, y1t, mode="everything")
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4  5  6
         1 31  0  0  0  0  0
         2  0 16  0  1  0  0
         3  0  0 19  0  0  0
         4  0  1  0 13  0  0
         5  0  0  0  0 14  0
         6  0  0  0  0  0  6

Overall Statistics
                                          
               Accuracy : 0.9802          
                 95% CI : (0.9303, 0.9976)
    No Information Rate : 0.3069          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9753          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity            1.0000   0.9412   1.0000   0.9286   1.0000  1.00000
Specificity            1.0000   0.9881   1.0000   0.9885   1.0000  1.00000
Pos Pred Value         1.0000   0.9412   1.0000   0.9286   1.0000  1.00000
Neg Pred Value         1.0000   0.9881   1.0000   0.9885   1.0000  1.00000
Precision              1.0000   0.9412   1.0000   0.9286   1.0000  1.00000
Recall                 1.0000   0.9412   1.0000   0.9286   1.0000  1.00000
F1                     1.0000   0.9412   1.0000   0.9286   1.0000  1.00000
Prevalence             0.3069   0.1683   0.1881   0.1386   0.1386  0.05941
Detection Rate         0.3069   0.1584   0.1881   0.1287   0.1386  0.05941
Detection Prevalence   0.3069   0.1683   0.1881   0.1386   0.1386  0.05941
Balanced Accuracy      1.0000   0.9646   1.0000   0.9585   1.0000  1.00000
varImpPlot(rf1)

Which Model Produced the Best Results?

The following code block captures the model which produced the best results, based on Kappa and core metric best suited to the problem.

# set model variable for best performance as MI or CCA
better_model = "MI"

if(better_model=="MI" || better_model=="CCA") {print(paste(better_model, "produced the best result in this data experiment test."))
}else{
    print("Please enter either 'MI' or 'CCA' as the variable value for better_model.")}
[1] "MI produced the best result in this data experiment test."

Does Highest Delta Adjustment Have Big Impact on Results?

Code block included to edit for delta model.

#Create x and y values
#dfcp_cca data
x5 <- train_d5[,-31]
y5 <- as.factor(train_d5$class)
set.seed(345) #for reproducibility

#Fit model with training data and review details
rf5 = randomForest(x = x5,
                  y = y5,
                  ntree = 500)
rf5

Call:
 randomForest(x = x5, y = y5, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 5

        OOB estimate of  error rate: 3.05%
Confusion matrix:
   1  2  3  4  5  6 class.error
1 75  0  0  0  0  0  0.00000000
2  1 44  0  3  0  0  0.08333333
3  0  0 47  0  0  0  0.00000000
4  0  4  0 28  0  0  0.12500000
5  0  0  0  0 42  0  0.00000000
6  0  0  0  0  0 18  0.00000000
#test with test data
x5t <- test_d5[,-31]
y5t <-as.factor(test_d5$class)
pred_y5t <- predict(rf5, newdata=x5t)

#check accuracy
confusionMatrix(pred_y5t, y5t, mode="everything")
Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4  5  6
         1 37  0  0  0  0  0
         2  0 13  0  2  0  0
         3  0  0 25  0  0  0
         4  0  0  0 15  0  0
         5  0  0  0  0 10  0
         6  0  0  0  0  0  2

Overall Statistics
                                          
               Accuracy : 0.9808          
                 95% CI : (0.9323, 0.9977)
    No Information Rate : 0.3558          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9748          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity            1.0000   1.0000   1.0000   0.8824  1.00000  1.00000
Specificity            1.0000   0.9780   1.0000   1.0000  1.00000  1.00000
Pos Pred Value         1.0000   0.8667   1.0000   1.0000  1.00000  1.00000
Neg Pred Value         1.0000   1.0000   1.0000   0.9775  1.00000  1.00000
Precision              1.0000   0.8667   1.0000   1.0000  1.00000  1.00000
Recall                 1.0000   1.0000   1.0000   0.8824  1.00000  1.00000
F1                     1.0000   0.9286   1.0000   0.9375  1.00000  1.00000
Prevalence             0.3558   0.1250   0.2404   0.1635  0.09615  0.01923
Detection Rate         0.3558   0.1250   0.2404   0.1442  0.09615  0.01923
Detection Prevalence   0.3558   0.1442   0.2404   0.1442  0.09615  0.01923
Balanced Accuracy      1.0000   0.9890   1.0000   0.9412  1.00000  1.00000
varImpPlot(rf5)

The impact result is captured as a Yes or No response in the code below.

#Is there a big impact on result?
delta_impact = "No"

if(delta_impact=="Yes" || delta_impact=="No") {print(paste(delta_impact, "is variable value for delta_impact."))
}else{
    print("Please enter either 'Yes' or 'No' as the variable value for delta_impact.")}
[1] "No is variable value for delta_impact."

Summary and Recommendations

The following code will print a summary, based on the variable information recorded in earlier code blocks.

Variable Parameters Set In Experiment

print(paste(missingness, "is missing data level."))
[1] "0.001 is missing data level."
print(paste("Is data more more likely to be missing in some variables than others?", likelihood))
[1] "Is data more more likely to be missing in some variables than others? Yes"
print(paste("Is data missing from dependent variable(s) only?", dependent_only))
[1] "Is data missing from dependent variable(s) only? No"
print(paste("Does higher delta adjustment have big impact on results?", delta_impact))
[1] "Does higher delta adjustment have big impact on results? No"

Is Data MCAR or MAR/MNAR?


if (likelihood=="No" & mcar_presult=="error") {
   hypothesis="Missing data pattern provides some support for MCAR hypothesis. However, no result available for MCAR Test."
   } else if (likelihood=="No" & mcar_presult<0.05) {
   hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
   } else if (likelihood=="No" & mcar_presult>=0.05) {
     hypothesis="MCAR hypothesis supported by MCAR Test and missing data pattern."
   } else if(likelihood=="Yes" & mcar_presult=="error") {
     hypothesis="Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
   } else if (likelihood=="Yes" & mcar_presult<0.05) {
     hypothesis="MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
   }else if (likelihood=="Yes" & mcar_presult>=0.05) {
     hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
   }else {
   print("No hypothesis found.")
}

print(hypothesis)
[1] "MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."

Consideration of Test Model Results in Data Experiment


if (better_model=="MI" & approach==data_approach1) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, there is a risk that it would not generalise well on new data.")
   } else if (better_model=="CCA" & approach==data_approach1) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, there is a risk that it would not generalise well on new data.")
   } else if (better_model=="MI" & approach==data_approach2) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, complete case analysis would be likely to produce unbiased results also.")
   } else if (better_model=="CCA" & approach==data_approach2) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, multiple imputation may produce a more effective model in some circumstances.")
   } else if (better_model=="MI" & approach==data_approach3) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This result is expected, given variables provided.")
   } else if (better_model=="CCA" & approach==data_approach3) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, caution should be noted over results due to high level of missingness.")
   } else if (better_model=="MI" & approach==data_approach4) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, caution should be taken with how missing data in dependent variables is handled, as CCA may be more reliable.")
   } else if (better_model=="CCA" & approach==data_approach4) {
   print("For this particular data experiment, complete case analysis produced the most effective model. This should be a reliable approach for this missing data problem.")
   } else if (better_model=="MI" & approach==data_approach5) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.")
   } else if (better_model=="CCA" & approach==data_approach5) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, the MCAR hypothesis is either not supported or unclear so caution is advised on the results produced.")
   } else if (better_model=="MI" & approach==data_approach6) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also.")
   } else if (better_model=="CCA" & approach==data_approach6) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, it is recommended that a pattern mixture model is considered also.")
   } else {
   print("No model evaluation found.")
}
[1] "For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem."

References

–Last updated: August 2023
–End–

---
title: "Vaar R UCI Dermatology Dataset"
author: "Amanda Harris"
output: html_notebook
params:
  printcode: true
---

# Task Test with UCI Dermatology Dataset
This Vaar Template Notebook aims to return:  
-  Whether data is more likely to be MCAR or MAR/MNAR  
-  A recommended approach for managing missing data that considers stability of model results  
-  Evaluation of imputation efforts based on data model experiment results  


# Step 1: Read in data and show data summary  

Data.table package used to read in file (from website for example.)

```{r}
library(data.table)
df <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data', header=FALSE, stringsAsFactors = FALSE)
head(df)
summary(df)
```


# Step 2: Copy and Tidy Data
Tidyverse package used to rename variables. Example code included below

```{r}
library(tidyverse)
library(janitor)
library(naniar)
#may need to clean variable names
#df <- df %>%
        #clean_names()
dfcp <- df #copy data

#may need to assign value NA to "?"
dfcp <- dfcp %>% replace_with_na_all(condition = ~.x == "?")

#may need to rename variables to something more meaningful
dfcp <- rename (dfcp, 
                        erythema=V1, scaling=V2, def_borders=V3, itching=V4, koebner_p=V5,
                poly_papules=V6, foll_papules=V7, oral_muc=V8, knee_elb=V9, scalp=V10,
                family_his=V11, mel_incont=V12, eosinophils=V13, pnl_infil=V14, fibrosis_pap=V15,
                exocytosis=V16, acanthosis=V17, hyperkeratosis=V18, parakeratosis=V19, clubbing_rr=V20,
                elong_rr=V21, thin_sup_ep=V22, spongi_p=V23, munro_m=V24, focal_hyper=V25,
                dis_gran_lay=V26, vacuolisation=V27, spongiosis=V28, sawtooth_r=V29, foll_hplug=V30,
                perifollicular=V31, inflam_mono_inf=V32, band_inf=V33, age=V34, class=V35)


#Where columns have missing data may need to convert from characters to integers
dfcp$age <- as.integer(dfcp$age)

```


## Check for Duplicates
Duplicates checked using n_distinct in tidyverse package

```{r}
n_distinct(dfcp)

#filter to check and review any duplicate records
duplicates <- dfcp %>%
  filter(duplicated(.))

print(duplicates)
```

### Remove duplicates and any unnecessary variables for model such as id
Only if unique identifiers exist.
```{r}
#dfcp <- dfcp %>% distinct() #remove duplicates
#dfcp <- select(dfcp,-c(new1)) #remove unnecessary columns
```


# Step 3: Visualise Missing Data and run MCAR Test

Packages used: visdat to explore missing values, ggplot to explore missing data further and naniar for geom_miss_point function and MCAR test. There will be a warning with MCAR test if non-numeric columns are present. If p-value >0.05 likely to be MCAR as not statistically significant.

The naniar test may also error due to singular data and a fix has not been found for this yet (August 2023). 

```{r}
library(visdat)
library(ggplot2)
library(naniar)

vis_miss(dfcp)
miss_var_summary(dfcp)

#optional plot for classification problem
ggplot(dfcp,
       aes (x = class,
            y = age,)) +
  geom_miss_point() +
  ggtitle ("Graph Showing Missing Data Pattern By Class Diagnosis")
ggplot

mcar_test(dfcp)
```
# Step 4: Look at Outputs from Visualising the Missing Data and Answer the Following Questions

#### What is the p.value returned by the MCAR Test?
If the statistic is high and the p.value <0.05 it is likely to be MNAR or MAR and cannot be ignored. A p-value >0.05 indicates MCAR but other information will be considered also. The value is set as a variable in the following code. Enter "error" if MCAR Test not possible due to data singularity.

```{r}
#set MCAR Test p.value as variable
mcar_presult = 0.007706574

if(mcar_presult=="error" || is.numeric(mcar_presult)){print(paste(mcar_presult, "is MCAR Test p.value variable value."))
}else{
    print("Please enter either a number or 'error' as the mcar_presult variable value.")}

```


#### Is data more likely to be missing in some variables?
If so, this indicates that the data could be missing at random (MAR) or missing not at random (MNAR) and can therefore not be ignored. Yes or No is set as a variable in the following code.
```{r}
#set likelihood variable
likelihood = "Yes"

if(likelihood=="Yes" || likelihood=="No") {print(paste(likelihood, "is variable value for likelihood."))
}else{
    print("Please enter either 'Yes' or 'No' as the variable value for likelihood.")}
```

#### Is data missing from the dependent variable only?
If so, this suggests that complete case analysis may be the best option. However, other factors need to be considered also. Yes or No is set as variable in the following code.

```{r}
#set dependent missingness variable
dependent_only = "No"

if(dependent_only=="Yes" || dependent_only=="No") {print(paste(dependent_only, "is variable value for dependent_only."))
}else{
    print("Please enter either 'Yes' or 'No' as the variable value for dependent_only.")}
```


#### What percentage of data is missing?
If it's between 0.1-0.5 multiple imputation is likely to result in a more effective, unbiased model. The value is set as a variable in the following code in the format 0.000. This is taken from the missing data visualisation.

```{r}
#set missingness variable value
missingness = 0.001

if(is.numeric(missingness)){print(paste(missingness, "is missingness variable value."))
}else{
    print("Please enter a number as the missingness variable value.")}
```


It's good practice to test different approaches and the Vaar Notebooks will consider these, as well as the variable values just set in recommendations.


## Visualise Correlation
Visualise correlation for numerical variables.

```{r}
#subset numerics for correlation
dfcp_x <- select_if(dfcp, is.numeric)
cor(dfcp_x)

vis_cor(dfcp_x)
```

```{r}
library(GGally) #ggplot2 extension for pairs matrix
#if classification problem
#Change class from integer to factor
dfcp$class <- as.factor(dfcp$class)

pm <- ggpairs(dfcp, columns = 1:10, ggplot2::aes(colour = class), lower=list(combo=wrap("facethist",  
                                                                                                  binwidth=0.5)))
pm

```

## Calculate Skew

```{r}
library(psych) #for skew function
skew(dfcp_x) #using numeric subset created for correlation
```

# Step 5: which variables and values are important for predicting proportion of missingness?

rpart and rpart.plot packages are used to plot a simple classification tree.

```{r}
library(rpart)
library(rpart.plot)

dfcp %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data=.) %>%
  prp(type=4, extra = 101, roundint=FALSE, prefix="Prop.Miss = ")
```

#### Which variable with missing data is nearest the top of the tree?
Variable value for missingness influencer set in code below. There may only be one variable missing data.
```{r}
#set value for missingness influencer or only missing data variable
miss_influencer="age"
```

```{r}
library(mice)
#fxpt <- fluxplot(dfcp)
# Variables with higher outflux may be more powerful predictors. More meaningful where data is missing from more than one variable.
```

#### Change Columns to Factors Where Necessary
```{r}
dfcp <- as.data.frame(dfcp) 

dfcp <- dfcp  %>% mutate(across(c("family_his"), as.factor))

```

#### Change Columns to Integers Where Necessary
```{r}
#dfcp <- dfcp  %>% mutate(across(c("var3", "var4"), as.integer))
```


# Step 6: Create Complete Case and Multiple Imputed Datasets
Create complete dataset by deleting records with missing values and run Multiple Imputation using MICE. Try adjusting maxit or manually set suitable imputation method for each variable (e.g. meth = init$method=c("pmm", "pmm", "pmm", "pmm", "logreg")) if mice errors received. 

```{r}
dfcp_cca <- dfcp[complete.cases(dfcp), ] 

init = mice(dfcp, maxit=5) 
meth = init$method
predM = init$predictorMatrix

#create imputed data
set.seed(123)
dfcp_imp = mice(dfcp, method=meth, predictorMatrix=predM, m=5)
#create dataset after imputation
dfcp_mi <- complete(dfcp_imp)
```

Create visualisations to ensure complete data returned.
```{r}
#check for missing data in MI and CCA dataset
vis_miss(dfcp_mi)
vis_miss(dfcp_cca)
```
As mice is designed for MAR data, imputations may not be possible for MNAR data with high levels of missingness. Also check for collinearity and that all variables are suitable for imputation if errors occur. For example, dates or values based on other columns can be corrected separately.

# Step 7: Conduct MNAR Sensitivity Test

Use the variable with missing data which has the most influence on proportion of missingness, according to step 5 and look at the range of values in this variable (in the summary at the top of the notebook for example). Generate imputations under delta adjustment to imitate deviations from MAR (Gink and Van Buuren, no date). The code uses 0 for MAR and a reasonable range based on the variable range as the delta values to test MNAR.
```{r}
max(dfcp_mi$age)
min(dfcp_mi$age)
mean(dfcp_mi$age)
```

```{r}

#Generate imputations under delta adjustment
delta <- c(0, +3, +6, +9, +12) #examples to be changed
imp.d <- vector("list", length(delta))
post <-dfcp_imp$post

for (i in 1:length(delta)) {
  d <- delta[i]
  cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
  post["age"] <- cmd
  imp <- mice(dfcp, post = post, maxit = 5,
              seed = i, print=FALSE)
  imp.d[[i]] <- imp
}
```

## Inspect Imputations

The first plot is based on no delta adjustment and the second plot is the highest adjustment. The Y axis scale is set to the same value, so that differences can be visualised more easily. 

```{r}
densityplot(imp.d[[1]],lwd=3, ylim=c(0,0.4)) #check ylim is appropriate for variables
densityplot(imp.d[[5]],lwd=3, ylim=c(0,0.4)) #check ylim is appropriate for variables
```

Find out more about sensitivity analysis in the context of missing data from Gerko Vink and Stef van Buuren's vignette [mice: An approach to sensitivity analysis](https://www.gerkovink.com/miceVignettes/Sensitivity_analysis/Sensitivity_analysis.html) (Vink and van Buuren, no date). 

The second density plot considers imputations under the largest adjustment, look at the blue line in particular.

## Create Complete Datasets with Delta Imputations 1 to 5
Create complete datasets and check imputations have worked by visualising missing data.

```{r}
#delta 1 is the dfcp_mi data
dfcp_mi_d2 <- complete(imp.d[[2]])
dfcp_mi_d3 <- complete(imp.d[[3]])
dfcp_mi_d4 <- complete(imp.d[[4]])
dfcp_mi_d5 <- complete(imp.d[[5]])


vis_miss(dfcp_mi_d2)
vis_miss(dfcp_mi_d3)
vis_miss(dfcp_mi_d4)
vis_miss(dfcp_mi_d5)

```

## Optional Step: Remove Outliers
There can be valuable information in outliers and it is good practice to remove obvious errors only - such as impossible values for particular variables. 


# Step 8: Select Features
This code is an example of how to determine which variables have above 0.9 correlation, as highly-correlated features do not generally improve models. Then, remove any variables identified from all datasets.

```{r}
library(caret)
#remove class, factors or dependent variables as looking for co-correlation issues
#0.9 and above
dfcp_mi_cor = subset(dfcp_mi, select = -c(class,family_his))
                
dfcp_mi_cor_res <- cor(dfcp_mi_cor)
dfcp_mi_cor_res <- as.data.frame(as.table(dfcp_mi_cor_res))

dfcp_mi_cor_res %>%  arrange(desc(Freq)) %>% filter(Freq>0.9)
dfcp_mi_cor_res %>%  arrange(desc(Freq)) %>% filter(Freq< -0.9)

```


```{r}
#remove identified variables from other datasets also

#vacuolisation
#band_inf
#mel_incont
#perifollicular

dfcp_mi = subset(dfcp_mi, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_cca = subset(dfcp_cca, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d2 = subset(dfcp_mi_d2, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d3 = subset(dfcp_mi_d3, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d4 = subset(dfcp_mi_d4, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d5 = subset(dfcp_mi_d5, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
```

## Optional step: Transform data
Scaling data allows models to compare relative relationships between data points more effectively. This dataset already scaled.

```{r}
#scale the numeric columns in all the datasets to be used in the test models

#dfcp_mi <- dfcp_mi %>% mutate(across(where(is.numeric), scale))
#dfcp_cca <- dfcp_cca %>% mutate(across(where(is.numeric), scale))
#dfcp_mi_d5 <- dfcp_mi_d5 %>% mutate(across(where(is.numeric), scale))

```

Log transformation can be helpful for linear models for example. 


# Step 9: Build and Test Models

## Create Training and Test Sets
Seeds set for reproducibility.

```{r}
#mi data
set.seed(123)
index <- sample(2, nrow(dfcp_mi),
              replace = TRUE,
              prob = c(0.7, 0.3))
train_mi <- dfcp_mi[index==1,]
test_mi <- dfcp_mi[index==2,] 

#cca data
set.seed(123)
index1 <- sample(2, nrow(dfcp_cca),
                 replace = TRUE,
                 prob = c(0.7, 0.3))
train_cca <- dfcp_cca[index1==1,] 
test_cca <- dfcp_cca[index1==2,] 

#delta5 data
set.seed(123)
index5 <- sample(2, nrow(dfcp_mi_d5),
                 replace = TRUE,
                 prob = c(0.7, 0.3))
train_d5 <- dfcp_mi_d5[index5==1,] 
test_d5 <- dfcp_mi_d5[index5==2,] 
```


## Predict Against Test Data
Start with baseline model using multiple imputation. RF classification included as code example only. 

```{r}
library(randomForest)
#Create x and y values
#dfcp_mi data
x <- train_mi[,-31] #remove dependent variable 
y <- as.factor(train_mi$class) 
set.seed(345) #for reproducibility

#Fit model with training data and review details
rf = randomForest(x = x,
                  y = y,
                  ntree = 500)
rf

#predict results
xt <- test_mi[,-31] #remove dependent variable like class
yt <-as.factor(test_mi$class) #keep class only
pred_yt <- predict(rf, newdata=xt)

#check accuracy
confusionMatrix(pred_yt, yt, mode="everything")

varImpPlot(rf)
```

Sample code block for CCA data.

```{r}
#Create x and y values
#dfcp_cca data
x1 <- train_cca[,-31]
y1 <- as.factor(train_cca$class)
set.seed(345) #for reproducibility

#Fit model with training data and review details
rf1 = randomForest(x = x1,
                  y = y1,
                  ntree = 500)
rf1

#predict results
x1t <- test_cca[,-31] #remove dependent variable like class
y1t <-as.factor(test_cca$class) #keep class only
pred_y1t <- predict(rf1, newdata=x1t)

#check accuracy
confusionMatrix(pred_y1t, y1t, mode="everything")

varImpPlot(rf1)
```
### Which Model Produced the Best Results?
The following code block captures the model which produced the best results, based on Kappa and core metric best suited to the problem.
```{r}
# set model variable for best performance as MI or CCA
better_model = "MI"

if(better_model=="MI" || better_model=="CCA") {print(paste(better_model, "produced the best result in this data experiment test."))
}else{
    print("Please enter either 'MI' or 'CCA' as the variable value for better_model.")}
```



### Does Highest Delta Adjustment Have Big Impact on Results?
Code block included to edit for delta model.

```{r}
#Create x and y values
#dfcp_d5 data
x5 <- train_d5[,-31]
y5 <- as.factor(train_d5$class)
set.seed(345) #for reproducibility

#Fit model with training data and review details
rf5 = randomForest(x = x5,
                  y = y5,
                  ntree = 500)
rf5

#test with test data
x5t <- test_d5[,-31]
y5t <-as.factor(test_d5$class)
pred_y5t <- predict(rf5, newdata=x5t)

#check accuracy
confusionMatrix(pred_y5t, y5t, mode="everything")

varImpPlot(rf5)
```

The impact result is captured as a Yes or No response in the code below.
```{r}
#Is there a big impact on result?
delta_impact = "No"

if(delta_impact=="Yes" || delta_impact=="No") {print(paste(delta_impact, "is variable value for delta_impact."))
}else{
    print("Please enter either 'Yes' or 'No' as the variable value for delta_impact.")}
```


# Summary and Recommendations
The following code will print a summary, based on the variable information recorded in earlier code blocks. 

### Variable Parameters Set In Experiment
```{r}
print(paste(missingness, "is missing data level."))
print(paste("Is data more more likely to be missing in some variables than others?", likelihood))
print(paste("Is data missing from dependent variable(s) only?", dependent_only))
print(paste("Does higher delta adjustment have big impact on results?", delta_impact))
```


### Is Data MCAR or MAR/MNAR?
```{r}

if (likelihood=="No" & mcar_presult=="error") {
   hypothesis="Missing data pattern provides some support for MCAR hypothesis. However, no result available for MCAR Test."
   } else if (likelihood=="No" & mcar_presult<0.05) {
   hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
   } else if (likelihood=="No" & mcar_presult>=0.05) {
     hypothesis="MCAR hypothesis supported by MCAR Test and missing data pattern."
   } else if(likelihood=="Yes" & mcar_presult=="error") {
     hypothesis="Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
   } else if (likelihood=="Yes" & mcar_presult<0.05) {
     hypothesis="MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
   }else if (likelihood=="Yes" & mcar_presult>=0.05) {
     hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
   }else {
   print("No hypothesis found.")
}

print(hypothesis)
```

### Recommended Approach for Missing Data
```{r}
#define different recommended approaches
data_approach1 = "Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data."

data_approach2 = "As MCAR hypothesis is supported and missing data is less than 10%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances."

data_approach3 = "MCAR hypothesis is supported. However, as missing data is between 10-50%, multiple imputation is likely to produce a more effective model."

data_approach4 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 10% and missing from the dependent variable only, complete case analysis may be the more reliable approach."

data_approach5 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."

data_approach6 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023)."

#define conditional statements
if (missingness>=0.5) {
   approach=data_approach1
   } else if (missingness>0.1 & dependent_only=='Yes') {
     approach=data_approach1
   } else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.1) {
       approach=data_approach2 
   } else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness>0.1 & missingness<0.5) {
     approach=data_approach3 
   } else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.1 & dependent_only=="Yes") {
     approach=data_approach4
   } else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.5 & delta_impact=="No") {
     approach=data_approach5
   } else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern."  & delta_impact=="Yes") {
     approach=data_approach6
   }else {
   print("No approach found.")
}

print(approach)

```

### Consideration of Test Model Results in Data Experiment
```{r}

if (better_model=="MI" & approach==data_approach1) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, there is a risk that it would not generalise well on new data.")
   } else if (better_model=="CCA" & approach==data_approach1) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, there is a risk that it would not generalise well on new data.")
   } else if (better_model=="MI" & approach==data_approach2) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, complete case analysis would be likely to produce unbiased results also.")
   } else if (better_model=="CCA" & approach==data_approach2) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, multiple imputation may produce a more effective model in some circumstances.")
   } else if (better_model=="MI" & approach==data_approach3) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This result is expected, given variables provided.")
   } else if (better_model=="CCA" & approach==data_approach3) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, caution should be noted over results due to high level of missingness.")
   } else if (better_model=="MI" & approach==data_approach4) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, caution should be taken with how missing data in dependent variables is handled, as CCA may be more reliable.")
   } else if (better_model=="CCA" & approach==data_approach4) {
   print("For this particular data experiment, complete case analysis produced the most effective model. This should be a reliable approach for this missing data problem.")
   } else if (better_model=="MI" & approach==data_approach5) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.")
   } else if (better_model=="CCA" & approach==data_approach5) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, the MCAR hypothesis is either not supported or unclear so caution is advised on the results produced.")
   } else if (better_model=="MI" & approach==data_approach6) {
   print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also.")
   } else if (better_model=="CCA" & approach==data_approach6) {
   print("For this particular data experiment, complete case analysis produced the most effective model. However, it is recommended that a pattern mixture model is considered also.")
   } else {
   print("No model evaluation found.")
}
```



# References
- Ilter, N. and Guvenir, H. (1998) Dermatology, UCI Machine Learning Repository. Available at: https://doi.org/10.24432/C5FK5P.  
- van Buuren, S. et al. (2023) ‘mice: Multivariate Imputation by Chained Equations’. CRAN. Available at: https://CRAN.R-project.org/package=mice (Accessed: 10 August 2023).
- Vink, G. and van Buuren, S. (no date) mice: An approach to sensitivity analysis, www.gerkovink.com. Available at: https://www.gerkovink.com/miceVignettes/Sensitivity_analysis/Sensitivity_analysis.html (Accessed: 5 August 2023).


--Last updated: August 2023  
--End--