This Vaar Template Notebook aims to return:
- Whether data is more likely to be MCAR or MAR/MNAR
- A recommended approach for managing missing data that considers
stability of model results
- Evaluation of imputation efforts based on data model experiment
results
Data.table package used to read in file (from website for example.)
library(data.table)
df <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data', header=FALSE, stringsAsFactors = FALSE)
Downloaded 4021 bytes...
Downloaded 12213 bytes...
Downloaded 16309 bytes...
Downloaded 25964 bytes...
head(df)
summary(df)
V1 V2 V3 V4 V5 V6
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :2.000 Median :2.000 Median :2.000 Median :1.000 Median :0.0000 Median :0.0000
Mean :2.068 Mean :1.795 Mean :1.549 Mean :1.366 Mean :0.6339 Mean :0.4481
3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.0000 Max. :3.0000
V7 V8 V9 V10 V11
Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1667 Mean :0.377 Mean :0.6148 Mean :0.5191 Mean :0.1257
3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :3.0000 Max. :3.000 Max. :3.0000 Max. :3.0000 Max. :1.0000
V12 V13 V14 V15 V16
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :2.000
Mean :0.4044 Mean :0.1393 Mean :0.5464 Mean :0.3361 Mean :1.369
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:2.000
Max. :3.0000 Max. :2.0000 Max. :3.0000 Max. :3.0000 Max. :3.000
V17 V18 V19 V20 V21 V22
Min. :0.000 Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :2.000 Median :0.0000 Median :1.00 Median :0.0000 Median :0.0000 Median :0.0000
Mean :1.956 Mean :0.5273 Mean :1.29 Mean :0.6639 Mean :0.9918 Mean :0.6339
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:2.00 3rd Qu.:2.0000 3rd Qu.:2.0000 3rd Qu.:1.0000
Max. :3.000 Max. :3.0000 Max. :3.00 Max. :3.0000 Max. :3.0000 Max. :3.0000
V23 V24 V25 V26 V27
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.2951 Mean :0.3634 Mean :0.3934 Mean :0.4645 Mean :0.4563
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :3.0000
V28 V29 V30 V31 V32
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :2.000
Mean :0.9536 Mean :0.4536 Mean :0.1038 Mean :0.1148 Mean :1.866
3rd Qu.:2.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:2.000
Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :3.000
V33 V34 V35
Min. :0.0000 Length:366 Min. :1.000
1st Qu.:0.0000 Class :character 1st Qu.:1.000
Median :0.0000 Mode :character Median :3.000
Mean :0.5546 Mean :2.803
3rd Qu.:0.0000 3rd Qu.:4.000
Max. :3.0000 Max. :6.000
Tidyverse package used to rename variables. Example code included below
library(tidyverse)
library(janitor)
library(naniar)
#may need to clean variable names
#df <- df %>%
#clean_names()
dfcp <- df #copy data
#may need to assign value NA to "?"
dfcp <- dfcp %>% replace_with_na_all(condition = ~.x == "?")
#may need to rename variables to something more meaningful
dfcp <- rename (dfcp,
erythema=V1, scaling=V2, def_borders=V3, itching=V4, koebner_p=V5,
poly_papules=V6, foll_papules=V7, oral_muc=V8, knee_elb=V9, scalp=V10,
family_his=V11, mel_incont=V12, eosinophils=V13, pnl_infil=V14, fibrosis_pap=V15,
exocytosis=V16, acanthosis=V17, hyperkeratosis=V18, parakeratosis=V19, clubbing_rr=V20,
elong_rr=V21, thin_sup_ep=V22, spongi_p=V23, munro_m=V24, focal_hyper=V25,
dis_gran_lay=V26, vacuolisation=V27, spongiosis=V28, sawtooth_r=V29, foll_hplug=V30,
perifollicular=V31, inflam_mono_inf=V32, band_inf=V33, age=V34, class=V35)
#Where columns have missing data may need to convert from characters to integers
dfcp$age <- as.integer(dfcp$age)
Duplicates checked using n_distinct in tidyverse package
n_distinct(dfcp)
[1] 366
#filter to check and review any duplicate records
duplicates <- dfcp %>%
filter(duplicated(.))
print(duplicates)
Only if unique identifiers exist.
```r
dfcp <- dfcp %>% distinct() #691 obs remaining
dfcp <- select(dfcp,-c(sampleID)) #remove sampleID column 10 variables remain
<!-- rnb-source-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
# Step 3: Visualise Missing Data and run MCAR Test
Packages used: visdat to explore missing values, ggplot to explore missing data further and naniar for geom_miss_point function and MCAR test. There will be a warning with MCAR test if non-numeric columns are present. If p-value >0.05 likely to be MCAR as not statistically significant.
The naniar test may also error due to singular data and a fix has not been found for this yet (August 2023).
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxubGlicmFyeSh2aXNkYXQpXG5saWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KG5hbmlhcilcblxudmlzX21pc3MoZGZjcClcbmBgYCJ9 -->
```r
library(visdat)
library(ggplot2)
library(naniar)
vis_miss(dfcp)
miss_var_summary(dfcp)
#optional plot for classification problem
ggplot(dfcp,
aes (x = class,
y = age,)) +
geom_miss_point() +
ggtitle ("Graph Showing Missing Data Pattern By Class Diagnosis")
ggplot
function (data = NULL, mapping = aes(), ..., environment = parent.frame())
{
UseMethod("ggplot")
}
<bytecode: 0x0000024c1f8a0c08>
<environment: namespace:ggplot2>
mcar_test(dfcp)
Warning: NAs introduced by coercion to integer range
If the statistic is high and the p.value <0.05 it is likely to be MNAR or MAR and cannot be ignored. A p-value >0.05 indicates MCAR but other information will be considered also. The value is set as a variable in the following code. Enter “error” if MCAR Test not possible due to data singularity.
#set MCAR Test p.value as variable
mcar_presult = 0.007706574
if(mcar_presult=="error" || is.numeric(mcar_presult)){print(paste(mcar_presult, "is MCAR Test p.value variable value."))
}else{
print("Please enter either a number or 'error' as the mcar_presult variable value.")}
[1] "0.007706574 is MCAR Test p.value variable value."
If so, this indicates that the data could be missing at random (MAR) or missing not at random (MNAR) and can therefore not be ignored. Yes or No is set as a variable in the following code.
#set likelihood variable
likelihood = "Yes"
if(likelihood=="Yes" || likelihood=="No") {print(paste(likelihood, "is variable value for likelihood."))
}else{
print("Please enter either 'Yes' or 'No' as the variable value for likelihood.")}
[1] "Yes is variable value for likelihood."
If so, this suggests that complete case analysis may be the best option. However, other factors need to be considered also. Yes or No is set as variable in the following code.
#set dependent missingness variable
dependent_only = "No"
if(dependent_only=="Yes" || dependent_only=="No") {print(paste(dependent_only, "is variable value for dependent_only."))
}else{
print("Please enter either 'Yes' or 'No' as the variable value for dependent_only.")}
[1] "No is variable value for dependent_only."
If it’s between 0.1-0.5 multiple imputation is likely to result in a more effective, unbiased model. The value is set as a variable in the following code in the format 0.000. This is taken from the missing data visualisation.
#set missingness variable value
missingness = 0.001
if(is.numeric(missingness)){print(paste(missingness, "is missingness variable value."))
}else{
print("Please enter a number as the missingness variable value.")}
[1] "0.001 is missingness variable value."
It’s good practice to test different approaches and the Vaar Notebooks will consider these, as well as the variable values just set in recommendations.
Visualise correlation for numerical variables.
#subset numerics for correlation
dfcp_x <- select_if(dfcp, is.numeric)
cor(dfcp_x)
erythema scaling def_borders itching koebner_p poly_papules
erythema 1.0000000000 0.43546674 0.26008714 -0.04762282 0.0006944807 0.03357060
scaling 0.4354667435 1.00000000 0.35798716 -0.08419730 -0.0105762260 -0.07503524
def_borders 0.2600871419 0.35798716 1.00000000 -0.06256376 0.2413419135 0.32775351
itching -0.0476228237 -0.08419730 -0.06256376 1.00000000 0.2678779155 0.40969552
koebner_p 0.0006944807 -0.01057623 0.24134191 0.26787792 1.0000000000 0.39410783
poly_papules 0.0335705974 -0.07503524 0.32775351 0.40969552 0.3941078288 1.00000000
foll_papules -0.1095504379 -0.09240039 -0.16666188 -0.14482511 -0.1727411481 -0.13709339
oral_muc -0.0268103478 -0.08296907 0.28668378 0.35916314 0.3925540373 0.86514164
knee_elb 0.1410081400 0.29839835 0.30230912 -0.29964455 -0.0663719325 -0.27605836
scalp 0.1866840084 0.30157745 0.26886161 -0.15563879 0.0152017709 -0.25323441
family_his 0.1720592053 0.18149121 0.11584377 -0.13661545 -0.0832472092 -0.15184095
mel_incont 0.0421264334 -0.07485579 0.31139783 0.35920665 0.3856865809 0.90704381
eosinophils 0.0752281360 0.04221112 -0.13202092 0.09543424 -0.0463655345 0.02882448
pnl_infil 0.2240953536 0.28248561 0.04133024 -0.14528813 -0.1693765837 -0.31451521
fibrosis_pap -0.3545949735 -0.38358482 -0.27441368 0.20303243 -0.2474505345 -0.15804607
exocytosis 0.0215646506 -0.11080865 -0.20539141 0.21482183 0.1459602738 0.37964267
acanthosis 0.0641652239 0.10809525 0.18976118 0.06707992 -0.0544493385 0.12921046
hyperkeratosis -0.0554322863 0.01314925 0.04389372 -0.00845992 0.0026350312 -0.16435492
parakeratosis 0.2100302704 0.29675159 0.36121323 -0.07556830 -0.0368008048 -0.04209759
clubbing_rr 0.1575581806 0.32074002 0.38720469 -0.23222364 0.0084714493 -0.29485667
elong_rr 0.0078193911 0.13907277 0.20170187 -0.12405752 -0.1638194895 -0.40054363
thin_sup_ep 0.1678675171 0.33034338 0.36926643 -0.24405215 0.0260637761 -0.28746755
spongi_p 0.1697726325 0.29778253 0.24620426 -0.15268946 -0.0020653460 -0.20653045
munro_m 0.1134635839 0.18636552 0.24223231 -0.05605599 0.1576464293 -0.20189321
focal_hyper -0.0137609195 -0.08502106 0.29114708 0.36348761 0.4004093621 0.88097177
dis_gran_lay 0.1591004797 0.21148416 0.35476843 -0.19268375 0.0043273611 -0.11969837
vacuolisation 0.0025590558 -0.10952329 0.29809461 0.36765040 0.3764725106 0.91162573
spongiosis 0.0224681651 0.01215110 -0.25553602 0.01751425 -0.0059376210 0.08006255
foll_papules oral_muc knee_elb scalp family_his mel_incont
erythema -0.109550438 -0.02681035 0.14100814 0.186684008 0.17205921 0.04212643
scaling -0.092400390 -0.08296907 0.29839835 0.301577453 0.18149121 -0.07485579
def_borders -0.166661877 0.28668378 0.30230912 0.268861611 0.11584377 0.31139783
itching -0.144825107 0.35916314 -0.29964455 -0.155638789 -0.13661545 0.35920665
koebner_p -0.172741148 0.39255404 -0.06637193 0.015201771 -0.08324721 0.38568658
poly_papules -0.137093394 0.86514164 -0.27605836 -0.253234413 -0.15184095 0.90704381
foll_papules 1.000000000 -0.13239448 0.22225505 -0.003534584 0.19286625 -0.13616530
oral_muc -0.132394477 1.00000000 -0.28346606 -0.259813343 -0.15182519 0.86923143
knee_elb 0.222255051 -0.28346606 1.00000000 0.659205582 0.34191470 -0.27231380
scalp -0.003534584 -0.25981334 0.65920558 1.000000000 0.29272769 -0.25677941
family_his 0.192866249 -0.15182519 0.34191470 0.292727690 1.00000000 -0.16701343
mel_incont -0.136165297 0.86923143 -0.27231380 -0.256779408 -0.16701343 1.00000000
eosinophils -0.087452012 0.01412156 -0.19866966 -0.076956867 -0.10842974 0.04112887
pnl_infil -0.119727872 -0.30373511 0.33170886 0.353082961 0.13019606 -0.31238600
fibrosis_pap 0.019698475 -0.14004653 -0.22742823 -0.226417506 -0.14955543 -0.16886249
exocytosis -0.010869033 0.36013842 -0.52994616 -0.531623150 -0.20900460 0.39188890
acanthosis -0.096591678 0.10156458 0.11671121 0.158394150 0.02329405 0.13027650
hyperkeratosis 0.189201254 -0.15517704 0.22953786 0.187032477 0.16071961 -0.17907140
parakeratosis -0.029653561 -0.03925778 0.43995113 0.481269771 0.22197692 -0.02699312
clubbing_rr -0.115856364 -0.28475037 0.72160203 0.769593769 0.31597361 -0.29286055
elong_rr -0.105355791 -0.38681488 0.49606688 0.555905058 0.19442882 -0.39783202
thin_sup_ep -0.128360885 -0.27761451 0.64466916 0.747035758 0.31773690 -0.28552145
spongi_p -0.057282946 -0.19945156 0.44725233 0.464369643 0.11601901 -0.20513227
munro_m -0.095856215 -0.19086090 0.51815454 0.632959102 0.22036509 -0.22297525
focal_hyper -0.118710251 0.88435115 -0.26750757 -0.255554455 -0.16614190 0.89653341
dis_gran_lay -0.157295676 -0.10670393 0.48174256 0.488803717 0.19690457 -0.13745221
vacuolisation -0.139960042 0.88755250 -0.28215116 -0.252483049 -0.15549032 0.94165863
spongiosis -0.013453716 0.12034380 -0.41320039 -0.420717713 -0.15236335 0.05817634
eosinophils pnl_infil fibrosis_pap exocytosis acanthosis hyperkeratosis
erythema 0.07522814 0.22409535 -0.35459497 0.02156465 0.06416522 -0.055432286
scaling 0.04221112 0.28248561 -0.38358482 -0.11080865 0.10809525 0.013149254
def_borders -0.13202092 0.04133024 -0.27441368 -0.20539141 0.18976118 0.043893722
itching 0.09543424 -0.14528813 0.20303243 0.21482183 0.06707992 -0.008459920
koebner_p -0.04636553 -0.16937658 -0.24745053 0.14596027 -0.05444934 0.002635031
poly_papules 0.02882448 -0.31451521 -0.15804607 0.37964267 0.12921046 -0.164354919
foll_papules -0.08745201 -0.11972787 0.01969847 -0.01086903 -0.09659168 0.189201254
oral_muc 0.01412156 -0.30373511 -0.14004653 0.36013842 0.10156458 -0.155177036
knee_elb -0.19866966 0.33170886 -0.22742823 -0.52994616 0.11671121 0.229537864
scalp -0.07695687 0.35308296 -0.22641751 -0.53162315 0.15839415 0.187032477
family_his -0.10842974 0.13019606 -0.14955543 -0.20900460 0.02329405 0.160719607
mel_incont 0.04112887 -0.31238600 -0.16886249 0.39188890 0.13027650 -0.179071403
eosinophils 1.00000000 0.09081832 -0.05567622 0.20595771 0.03949383 -0.060576645
pnl_infil 0.09081832 1.00000000 -0.26469366 -0.27613441 -0.03421877 0.029001955
fibrosis_pap -0.05567622 -0.26469366 1.00000000 -0.17553618 0.16846138 0.093905239
exocytosis 0.20595771 -0.27613441 -0.17553618 1.00000000 -0.17790520 -0.193931745
acanthosis 0.03949383 -0.03421877 0.16846138 -0.17790520 1.00000000 0.012377610
hyperkeratosis -0.06057664 0.02900196 0.09390524 -0.19393175 0.01237761 1.000000000
parakeratosis -0.05634359 0.24927079 -0.23667160 -0.34902705 0.17866309 0.162103332
clubbing_rr -0.18169071 0.38534905 -0.21472389 -0.62586369 0.17327137 0.235782396
elong_rr -0.14645324 0.22734378 0.29845426 -0.70844488 0.32381243 0.213544642
thin_sup_ep -0.18211117 0.37730869 -0.23882757 -0.63178196 0.17515211 0.233082419
spongi_p -0.03025278 0.50094393 -0.16423664 -0.38042581 0.06147720 0.081209851
munro_m -0.11851251 0.23581141 -0.16357131 -0.51283497 0.04967252 0.247044052
focal_hyper 0.06214859 -0.30729165 -0.16783812 0.38517164 0.11903488 -0.165868233
dis_gran_lay -0.16683837 0.16744026 -0.19727459 -0.37488685 0.09083522 0.009853847
vacuolisation 0.03992115 -0.31757323 -0.16856923 0.38034170 0.13007071 -0.163193192
spongiosis 0.15523082 -0.12696918 -0.18266911 0.57128653 -0.07057417 -0.253059536
parakeratosis clubbing_rr elong_rr thin_sup_ep spongi_p munro_m
erythema 0.210030270 0.157558181 0.007819391 0.16786752 0.169772632 0.11346358
scaling 0.296751593 0.320740018 0.139072765 0.33034338 0.297782529 0.18636552
def_borders 0.361213233 0.387204686 0.201701874 0.36926643 0.246204257 0.24223231
itching -0.075568303 -0.232223641 -0.124057523 -0.24405215 -0.152689457 -0.05605599
koebner_p -0.036800805 0.008471449 -0.163819490 0.02606378 -0.002065346 0.15764643
poly_papules -0.042097593 -0.294856673 -0.400543631 -0.28746755 -0.206530447 -0.20189321
foll_papules -0.029653561 -0.115856364 -0.105355791 -0.12836089 -0.057282946 -0.09585621
oral_muc -0.039257785 -0.284750373 -0.386814879 -0.27761451 -0.199451555 -0.19086090
knee_elb 0.439951127 0.721602033 0.496066880 0.64466916 0.447252335 0.51815454
scalp 0.481269771 0.769593769 0.555905058 0.74703576 0.464369643 0.63295910
family_his 0.221976917 0.315973610 0.194428822 0.31773690 0.116019007 0.22036509
mel_incont -0.026993124 -0.292860548 -0.397832024 -0.28552145 -0.205132274 -0.22297525
eosinophils -0.056343591 -0.181690713 -0.146453242 -0.18211117 -0.030252780 -0.11851251
pnl_infil 0.249270792 0.385349046 0.227343781 0.37730869 0.500943929 0.23581141
fibrosis_pap -0.236671603 -0.214723891 0.298454263 -0.23882757 -0.164236636 -0.16357131
exocytosis -0.349027048 -0.625863694 -0.708444882 -0.63178196 -0.380425809 -0.51283497
acanthosis 0.178663090 0.173271369 0.323812431 0.17515211 0.061477197 0.04967252
hyperkeratosis 0.162103332 0.235782396 0.213544642 0.23308242 0.081209851 0.24704405
parakeratosis 1.000000000 0.535744953 0.403034687 0.52453939 0.279278795 0.35561035
clubbing_rr 0.535744953 1.000000000 0.724950967 0.87914479 0.553967947 0.67460195
elong_rr 0.403034687 0.724950967 1.000000000 0.71503280 0.424976746 0.49055928
thin_sup_ep 0.524539387 0.879144792 0.715032796 1.00000000 0.471917767 0.69235528
spongi_p 0.279278795 0.553967947 0.424976746 0.47191777 1.000000000 0.26218926
munro_m 0.355610351 0.674601951 0.490559277 0.69235528 0.262189260 1.00000000
focal_hyper -0.030599976 -0.291792822 -0.396381588 -0.28448048 -0.204384393 -0.22216231
dis_gran_lay 0.289180711 0.590870567 0.396296690 0.54861699 0.273206792 0.43456570
vacuolisation -0.032414741 -0.295592344 -0.403981365 -0.29070617 -0.210849038 -0.22163596
spongiosis -0.230053233 -0.506274465 -0.584346583 -0.50881737 -0.285529173 -0.38552787
focal_hyper dis_gran_lay vacuolisation spongiosis sawtooth_r foll_hplug
erythema -0.01376092 0.159100480 0.002559056 0.022468165 -0.005779296 -0.005449948
scaling -0.08502106 0.211484156 -0.109523286 0.012151102 -0.114466838 -0.010518042
def_borders 0.29114708 0.354768426 0.298094613 -0.255536017 0.274575520 -0.086249885
itching 0.36348761 -0.192683746 0.367650396 0.017514247 0.376186022 -0.175866309
koebner_p 0.40040936 0.004327361 0.376472511 -0.005937621 0.381687053 -0.147954844
poly_papules 0.88097177 -0.119698366 0.911625731 0.080062554 0.895107362 -0.095476947
foll_papules -0.11871025 -0.157295676 -0.139960042 -0.013453716 -0.139140808 0.785282153
oral_muc 0.88435115 -0.106703928 0.887552499 0.120343798 0.875209241 -0.082600606
knee_elb -0.26750757 0.481742560 -0.282151156 -0.413200387 -0.280394744 0.232903105
scalp -0.25555446 0.488803717 -0.252483049 -0.420717713 -0.257209536 -0.051894442
family_his -0.16614190 0.196904566 -0.155490321 -0.152363346 -0.163069657 0.242311911
mel_incont 0.89653341 -0.137452212 0.941658627 0.058176343 0.900232313 -0.093466144
eosinophils 0.06214859 -0.166838372 0.039921147 0.155230816 0.040897588 -0.078212148
pnl_infil -0.30729165 0.167440259 -0.317573230 -0.126969180 -0.315693292 -0.117591085
fibrosis_pap -0.16783812 -0.197274590 -0.168569235 -0.182669110 -0.160734553 -0.083917427
exocytosis 0.38517164 -0.374886849 0.380341704 0.571286532 0.383949906 0.065998097
acanthosis 0.11903488 0.090835221 0.130070708 -0.070574172 0.097692794 -0.028502019
hyperkeratosis -0.16586823 0.009853847 -0.163193192 -0.253059536 -0.138475670 0.096097013
parakeratosis -0.03059998 0.289180711 -0.032414741 -0.230053233 -0.037768205 0.006592687
clubbing_rr -0.29179282 0.590870567 -0.295592344 -0.506274465 -0.299260194 -0.127938884
elong_rr -0.39638159 0.396296690 -0.403981365 -0.584346583 -0.406525528 -0.176316604
thin_sup_ep -0.28448048 0.548616991 -0.290706166 -0.508817371 -0.291760717 -0.135688949
spongi_p -0.20438439 0.273206792 -0.210849038 -0.285529173 -0.209614865 -0.092637443
munro_m -0.22216231 0.434565699 -0.221635962 -0.385527867 -0.201407584 -0.110554716
focal_hyper 1.00000000 -0.137555831 0.909650320 0.047628296 0.884075940 -0.071255891
dis_gran_lay -0.13755583 1.000000000 -0.144532965 -0.255348843 -0.133057964 -0.124126221
vacuolisation 0.90965032 -0.144532965 1.000000000 0.052696049 0.938397312 -0.091336544
spongiosis 0.04762830 -0.255348843 0.052696049 1.000000000 0.065280638 0.057935934
perifollicular inflam_mono_inf band_inf age class
erythema 0.009539006 0.075758095 -0.00695452 NA -0.33553703
scaling -0.011134969 -0.010977442 -0.13207915 NA -0.46868826
def_borders -0.123947333 0.111881519 0.28257806 NA -0.39197414
itching -0.184074144 0.062780814 0.38669107 NA 0.05477205
koebner_p -0.164364136 0.083357727 0.38284377 NA -0.09132350
poly_papules -0.110203624 0.228426273 0.90582234 NA 0.05778546
foll_papules 0.844928701 -0.084863861 -0.13387078 NA 0.47781329
oral_muc -0.106426362 0.250936677 0.89234104 NA 0.05580485
knee_elb 0.257661488 -0.068621099 -0.27671119 NA -0.38330476
scalp -0.017352246 0.035151781 -0.24997480 NA -0.53320838
family_his 0.231725064 -0.089133282 -0.16802241 NA -0.14438146
mel_incont -0.109457565 0.259466872 0.91684833 NA 0.05739426
eosinophils -0.079672201 -0.047391413 0.02835438 NA -0.06232307
pnl_infil -0.109655288 -0.084324354 -0.31877602 NA -0.55019461
fibrosis_pap -0.092746551 -0.024469781 -0.17196742 NA 0.52697644
exocytosis 0.058414149 0.143742074 0.38160792 NA 0.28343250
acanthosis -0.079967636 0.131637294 0.08648629 NA -0.07977066
hyperkeratosis 0.087758781 -0.115426337 -0.14412413 NA -0.05442842
parakeratosis 0.011217531 -0.007437855 -0.06693765 NA -0.42074192
clubbing_rr -0.132002313 0.009042738 -0.30187511 NA -0.66878054
elong_rr -0.186462751 0.018176100 -0.41639380 NA -0.35792398
thin_sup_ep -0.144208861 0.029385859 -0.29605370 NA -0.68486408
spongi_p -0.086886974 0.064476921 -0.22129809 NA -0.44940738
munro_m -0.112618534 -0.010896479 -0.19815587 NA -0.52099455
focal_hyper -0.095858959 0.232228332 0.90451944 NA 0.06324106
dis_gran_lay -0.126443390 0.081839585 -0.14118162 NA -0.42734771
vacuolisation -0.112508002 0.254309426 0.93756140 NA 0.05540233
spongiosis 0.069198950 0.055834518 0.03601253 NA 0.21340113
[ reached getOption("max.print") -- omitted 7 rows ]
vis_cor(dfcp_x)
library(GGally) #ggplot2 extension for pairs matrix
#if classification problem
#Change class from integer to factor
dfcp$class <- as.factor(dfcp$class)
pm <- ggpairs(dfcp, columns = 1:10, ggplot2::aes(colour = class), lower=list(combo=wrap("facethist",
binwidth=0.5)))
pm
library(psych) #for skew function
skew(dfcp_x) #using numeric subset created for correlation
[1] -109.508412 -27.289419 -105.555722 35.648980 430.777146 655.822858 1291.344182 711.762408
[9] 448.022741 528.509989 823.181140 681.964800 1118.986693 443.331537 855.565227 -2.978114
[17] -142.837906 443.189369 13.087247 428.595912 196.939167 450.488413 840.015652 738.376425
[25] 699.819281 591.758872 641.687669 211.976836 651.682622 1734.627933 1588.550298 -127.551644
[33] 575.026126 26.040408 147.266458
rpart and rpart.plot packages are used to plot a simple classification tree.
library(rpart)
library(rpart.plot)
dfcp %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data=.) %>%
prp(type=4, extra = 101, roundint=FALSE, prefix="Prop.Miss = ")
Variable value for missingness influencer set in code below. There may only be one variable missing data.
#set value for missingness influencer or only missing data variable
miss_influencer="age"
library(mice)
#fxpt <- fluxplot(dfcp)
# Variables with higher outflux may be more powerful predictors. More meaningful where data is missing from more than one variable.
dfcp <- as.data.frame(dfcp)
dfcp <- dfcp %>% mutate(across(c("family_his"), as.factor))
#dfcp <- dfcp %>% mutate(across(c("var3", "var4"), as.integer))
Create complete dataset by deleting records with missing values and run Multiple Imputation using MICE. Try adjusting maxit or manually set suitable imputation method for each variable (e.g. meth = init$method=c(“pmm”, “pmm”, “pmm”, “pmm”, “logreg”)) if mice errors received.
dfcp_cca <- dfcp[complete.cases(dfcp), ]
init = mice(dfcp, maxit=5)
meth = init$method
predM = init$predictorMatrix
#create imputed data
set.seed(123)
dfcp_imp = mice(dfcp, method=meth, predictorMatrix=predM, m=5)
#create dataset after imputation
dfcp_mi <- complete(dfcp_imp)
Create visualisations to ensure complete data returned.
#check for missing data in MI and CCA dataset
vis_miss(dfcp_mi)
vis_miss(dfcp_cca)
As mice is designed for MAR data, imputations may not be possible for MNAR data with high levels of missingness. Also check for collinearity and that all variables are suitable for imputation if errors occur. For example, dates or values based on other columns can be corrected separately.
Use the variable with missing data which has the most influence on proportion of missingness, according to step 5 and look at the range of values in this variable (in the summary at the top of the notebook for example). Generate imputations under delta adjustment to imitate deviations from MAR (Gink and Van Buuren, no date). The code uses 0 for MAR and a reasonable range based on the variable range as the delta values to test MNAR.
max(dfcp_mi$age)
[1] 75
min(dfcp_mi$age)
[1] 0
mean(dfcp_mi$age)
[1] 36.34153
#Generate imputations under delta adjustment
delta <- c(0, +3, +6, +9, +12) #examples to be changed
imp.d <- vector("list", length(delta))
post <-dfcp_imp$post
for (i in 1:length(delta)) {
d <- delta[i]
cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
post["age"] <- cmd
imp <- mice(dfcp, post = post, maxit = 5,
seed = i, print=FALSE)
imp.d[[i]] <- imp
}
The first plot is based on no delta adjustment and the second plot is the highest adjustment. The Y axis scale is set to the same value, so that differences can be visualised more easily.
densityplot(imp.d[[1]],lwd=3, ylim=c(0,0.4)) #check ylim is appropriate for variables
densityplot(imp.d[[5]],lwd=3, ylim=c(0,0.4)) #check ylim is appropriate for variables
Find out more about sensitivity analysis in the context of missing data from Gerko Vink and Stef van Buuren’s vignette mice: An approach to sensitivity analysis (Vink and van Buuren, no date).
The second density plot considers imputations under the largest adjustment, look at the blue line in particular.
Create complete datasets and check imputations have worked by visualising missing data.
#delta 1 is the dfcp_mi data
dfcp_mi_d2 <- complete(imp.d[[2]])
dfcp_mi_d3 <- complete(imp.d[[3]])
dfcp_mi_d4 <- complete(imp.d[[4]])
dfcp_mi_d5 <- complete(imp.d[[5]])
vis_miss(dfcp_mi_d2)
vis_miss(dfcp_mi_d3)
vis_miss(dfcp_mi_d4)
vis_miss(dfcp_mi_d5)
There can be valuable information in outliers and it is good practice to remove obvious errors only - such as impossible values for particular variables.
This code is an example of how to determine which variables have above 0.9 correlation, as highly-correlated features do not generally improve models. Then, remove any variables identified from all datasets.
library(caret)
#remove class, factors or dependent variables as looking for co-correlation issues
#0.9 and above
dfcp_mi_cor = subset(dfcp_mi, select = -c(class,family_his))
dfcp_mi_cor_res <- cor(dfcp_mi_cor)
dfcp_mi_cor_res <- as.data.frame(as.table(dfcp_mi_cor_res))
dfcp_mi_cor_res %>% arrange(desc(Freq)) %>% filter(Freq>0.9)
dfcp_mi_cor_res %>% arrange(desc(Freq)) %>% filter(Freq< -0.9)
NA
#remove identified variables from other datasets also
#vacuolisation
#band_inf
#mel_incont
#perifollicular
dfcp_mi = subset(dfcp_mi, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_cca = subset(dfcp_cca, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d2 = subset(dfcp_mi_d2, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d3 = subset(dfcp_mi_d3, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d4 = subset(dfcp_mi_d4, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
dfcp_mi_d5 = subset(dfcp_mi_d5, select = -c(vacuolisation,band_inf,mel_incont,perifollicular))
Scaling data allows models to compare relative relationships between data points more effectively. This dataset already scaled.
#scale the numeric columns in all the datasets to be used in the test models
#dfcp_mi <- dfcp_mi %>% mutate(across(where(is.numeric), scale))
#dfcp_cca <- dfcp_cca %>% mutate(across(where(is.numeric), scale))
#dfcp_mi_d5 <- dfcp_mi_d5 %>% mutate(across(where(is.numeric), scale))
Log transformation can be helpful for linear models for example.
Seeds set for reproducibility.
#mi data
set.seed(123)
index <- sample(2, nrow(dfcp_mi),
replace = TRUE,
prob = c(0.7, 0.3))
train_mi <- dfcp_mi[index==1,]
test_mi <- dfcp_mi[index==2,]
#cca data
set.seed(123)
index1 <- sample(2, nrow(dfcp_cca),
replace = TRUE,
prob = c(0.7, 0.3))
train_cca <- dfcp_cca[index1==1,]
test_cca <- dfcp_cca[index1==2,]
#delta5 data
set.seed(123)
index5 <- sample(2, nrow(dfcp_mi_d5),
replace = TRUE,
prob = c(0.7, 0.3))
train_d5 <- dfcp_mi_d5[index5==1,]
test_d5 <- dfcp_mi_d5[index5==2,]
Start with baseline model using multiple imputation. RF classification included as code example only.
library(randomForest)
#Create x and y values
#dfcp_mi data
x <- train_mi[,-31] #remove dependent variable
y <- as.factor(train_mi$class)
set.seed(345) #for reproducibility
#Fit model with training data and review details
rf = randomForest(x = x,
y = y,
ntree = 500)
rf
Call:
randomForest(x = x, y = y, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 3.44%
Confusion matrix:
1 2 3 4 5 6 class.error
1 75 0 0 0 0 0 0.00000000
2 1 44 0 3 0 0 0.08333333
3 0 0 47 0 0 0 0.00000000
4 0 4 0 28 0 0 0.12500000
5 0 0 0 0 42 0 0.00000000
6 1 0 0 0 0 17 0.05555556
#predict results
xt <- test_mi[,-31] #remove dependent variable like class
yt <-as.factor(test_mi$class) #keep class only
pred_yt <- predict(rf, newdata=xt)
#check accuracy
confusionMatrix(pred_yt, yt, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4 5 6
1 37 0 0 0 0 0
2 0 13 0 2 0 0
3 0 0 25 0 0 0
4 0 0 0 15 0 0
5 0 0 0 0 10 0
6 0 0 0 0 0 2
Overall Statistics
Accuracy : 0.9808
95% CI : (0.9323, 0.9977)
No Information Rate : 0.3558
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9748
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity 1.0000 1.0000 1.0000 0.8824 1.00000 1.00000
Specificity 1.0000 0.9780 1.0000 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 0.8667 1.0000 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.0000 1.0000 0.9775 1.00000 1.00000
Precision 1.0000 0.8667 1.0000 1.0000 1.00000 1.00000
Recall 1.0000 1.0000 1.0000 0.8824 1.00000 1.00000
F1 1.0000 0.9286 1.0000 0.9375 1.00000 1.00000
Prevalence 0.3558 0.1250 0.2404 0.1635 0.09615 0.01923
Detection Rate 0.3558 0.1250 0.2404 0.1442 0.09615 0.01923
Detection Prevalence 0.3558 0.1442 0.2404 0.1442 0.09615 0.01923
Balanced Accuracy 1.0000 0.9890 1.0000 0.9412 1.00000 1.00000
varImpPlot(rf)
Sample code block for CCA data.
#Create x and y values
#dfcp_cca data
x1 <- train_cca[,-31]
y1 <- as.factor(train_cca$class)
set.seed(345) #for reproducibility
#Fit model with training data and review details
rf1 = randomForest(x = x1,
y = y1,
ntree = 500)
rf1
Call:
randomForest(x = x1, y = y1, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 3.11%
Confusion matrix:
1 2 3 4 5 6 class.error
1 80 0 0 0 0 0 0.00000000
2 1 40 0 2 0 0 0.06976744
3 0 0 52 0 0 0 0.00000000
4 0 4 0 30 0 0 0.11764706
5 0 0 0 0 34 0 0.00000000
6 1 0 0 0 0 13 0.07142857
#predict results
x1t <- test_cca[,-31] #remove dependent variable like class
y1t <-as.factor(test_cca$class) #keep class only
pred_y1t <- predict(rf1, newdata=x1t)
#check accuracy
confusionMatrix(pred_y1t, y1t, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4 5 6
1 31 0 0 0 0 0
2 0 16 0 1 0 0
3 0 0 19 0 0 0
4 0 1 0 13 0 0
5 0 0 0 0 14 0
6 0 0 0 0 0 6
Overall Statistics
Accuracy : 0.9802
95% CI : (0.9303, 0.9976)
No Information Rate : 0.3069
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9753
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity 1.0000 0.9412 1.0000 0.9286 1.0000 1.00000
Specificity 1.0000 0.9881 1.0000 0.9885 1.0000 1.00000
Pos Pred Value 1.0000 0.9412 1.0000 0.9286 1.0000 1.00000
Neg Pred Value 1.0000 0.9881 1.0000 0.9885 1.0000 1.00000
Precision 1.0000 0.9412 1.0000 0.9286 1.0000 1.00000
Recall 1.0000 0.9412 1.0000 0.9286 1.0000 1.00000
F1 1.0000 0.9412 1.0000 0.9286 1.0000 1.00000
Prevalence 0.3069 0.1683 0.1881 0.1386 0.1386 0.05941
Detection Rate 0.3069 0.1584 0.1881 0.1287 0.1386 0.05941
Detection Prevalence 0.3069 0.1683 0.1881 0.1386 0.1386 0.05941
Balanced Accuracy 1.0000 0.9646 1.0000 0.9585 1.0000 1.00000
varImpPlot(rf1)
The following code block captures the model which produced the best results, based on Kappa and core metric best suited to the problem.
# set model variable for best performance as MI or CCA
better_model = "MI"
if(better_model=="MI" || better_model=="CCA") {print(paste(better_model, "produced the best result in this data experiment test."))
}else{
print("Please enter either 'MI' or 'CCA' as the variable value for better_model.")}
[1] "MI produced the best result in this data experiment test."
Code block included to edit for delta model.
#Create x and y values
#dfcp_cca data
x5 <- train_d5[,-31]
y5 <- as.factor(train_d5$class)
set.seed(345) #for reproducibility
#Fit model with training data and review details
rf5 = randomForest(x = x5,
y = y5,
ntree = 500)
rf5
Call:
randomForest(x = x5, y = y5, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 3.05%
Confusion matrix:
1 2 3 4 5 6 class.error
1 75 0 0 0 0 0 0.00000000
2 1 44 0 3 0 0 0.08333333
3 0 0 47 0 0 0 0.00000000
4 0 4 0 28 0 0 0.12500000
5 0 0 0 0 42 0 0.00000000
6 0 0 0 0 0 18 0.00000000
#test with test data
x5t <- test_d5[,-31]
y5t <-as.factor(test_d5$class)
pred_y5t <- predict(rf5, newdata=x5t)
#check accuracy
confusionMatrix(pred_y5t, y5t, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4 5 6
1 37 0 0 0 0 0
2 0 13 0 2 0 0
3 0 0 25 0 0 0
4 0 0 0 15 0 0
5 0 0 0 0 10 0
6 0 0 0 0 0 2
Overall Statistics
Accuracy : 0.9808
95% CI : (0.9323, 0.9977)
No Information Rate : 0.3558
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9748
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity 1.0000 1.0000 1.0000 0.8824 1.00000 1.00000
Specificity 1.0000 0.9780 1.0000 1.0000 1.00000 1.00000
Pos Pred Value 1.0000 0.8667 1.0000 1.0000 1.00000 1.00000
Neg Pred Value 1.0000 1.0000 1.0000 0.9775 1.00000 1.00000
Precision 1.0000 0.8667 1.0000 1.0000 1.00000 1.00000
Recall 1.0000 1.0000 1.0000 0.8824 1.00000 1.00000
F1 1.0000 0.9286 1.0000 0.9375 1.00000 1.00000
Prevalence 0.3558 0.1250 0.2404 0.1635 0.09615 0.01923
Detection Rate 0.3558 0.1250 0.2404 0.1442 0.09615 0.01923
Detection Prevalence 0.3558 0.1442 0.2404 0.1442 0.09615 0.01923
Balanced Accuracy 1.0000 0.9890 1.0000 0.9412 1.00000 1.00000
varImpPlot(rf5)
The impact result is captured as a Yes or No response in the code below.
#Is there a big impact on result?
delta_impact = "No"
if(delta_impact=="Yes" || delta_impact=="No") {print(paste(delta_impact, "is variable value for delta_impact."))
}else{
print("Please enter either 'Yes' or 'No' as the variable value for delta_impact.")}
[1] "No is variable value for delta_impact."
The following code will print a summary, based on the variable information recorded in earlier code blocks.
print(paste(missingness, "is missing data level."))
[1] "0.001 is missing data level."
print(paste("Is data more more likely to be missing in some variables than others?", likelihood))
[1] "Is data more more likely to be missing in some variables than others? Yes"
print(paste("Is data missing from dependent variable(s) only?", dependent_only))
[1] "Is data missing from dependent variable(s) only? No"
print(paste("Does higher delta adjustment have big impact on results?", delta_impact))
[1] "Does higher delta adjustment have big impact on results? No"
if (likelihood=="No" & mcar_presult=="error") {
hypothesis="Missing data pattern provides some support for MCAR hypothesis. However, no result available for MCAR Test."
} else if (likelihood=="No" & mcar_presult<0.05) {
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
} else if (likelihood=="No" & mcar_presult>=0.05) {
hypothesis="MCAR hypothesis supported by MCAR Test and missing data pattern."
} else if(likelihood=="Yes" & mcar_presult=="error") {
hypothesis="Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
} else if (likelihood=="Yes" & mcar_presult<0.05) {
hypothesis="MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
}else if (likelihood=="Yes" & mcar_presult>=0.05) {
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
}else {
print("No hypothesis found.")
}
print(hypothesis)
[1] "MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
#define different recommended approaches
data_approach1 = "Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data."
data_approach2 = "As MCAR hypothesis is supported and missing data is less than 20%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances."
data_approach3 = "MCAR hypothesis is supported. However, as missing data is between 20-50%, multiple imputation is likely to produce a more effective model."
data_approach4 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 20% and missing from the dependent variable only, complete case analysis may be the more reliable approach."
data_approach5 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."
data_approach6 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023)."
#define conditional statements
if (missingness>=0.5) {
approach=data_approach1
} else if (missingness>0.2 & dependent_only=='Yes') {
approach=data_approach1
} else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.2) {
approach=data_approach2
} else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness>0.2 & missingness<0.5) {
approach=data_approach3
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.2 & dependent_only=="Yes") {
approach=data_approach4
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.5 & delta_impact=="No") {
approach=data_approach5
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & delta_impact=="Yes") {
approach=data_approach6
}else {
print("No approach found.")
}
print(approach)
[1] "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."
if (better_model=="MI" & approach==data_approach1) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, there is a risk that it would not generalise well on new data.")
} else if (better_model=="CCA" & approach==data_approach1) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, there is a risk that it would not generalise well on new data.")
} else if (better_model=="MI" & approach==data_approach2) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, complete case analysis would be likely to produce unbiased results also.")
} else if (better_model=="CCA" & approach==data_approach2) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, multiple imputation may produce a more effective model in some circumstances.")
} else if (better_model=="MI" & approach==data_approach3) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This result is expected, given variables provided.")
} else if (better_model=="CCA" & approach==data_approach3) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, caution should be noted over results due to high level of missingness.")
} else if (better_model=="MI" & approach==data_approach4) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, caution should be taken with how missing data in dependent variables is handled, as CCA may be more reliable.")
} else if (better_model=="CCA" & approach==data_approach4) {
print("For this particular data experiment, complete case analysis produced the most effective model. This should be a reliable approach for this missing data problem.")
} else if (better_model=="MI" & approach==data_approach5) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.")
} else if (better_model=="CCA" & approach==data_approach5) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, the MCAR hypothesis is either not supported or unclear so caution is advised on the results produced.")
} else if (better_model=="MI" & approach==data_approach6) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also.")
} else if (better_model=="CCA" & approach==data_approach6) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, it is recommended that a pattern mixture model is considered also.")
} else {
print("No model evaluation found.")
}
[1] "For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem."
–Last updated: August 2023
–End–