This is an exemplar workbook showing how the Vaar Notebook can be
applied to UCI’s Cervical Cancer (Risk Factors) dataset (Fernandes et
al., 2017) to return:
- Whether data is more likely to be MCAR or MAR/MNAR
- A recommended approach for managing missing data that considers
stability of model results
- Evaluation of imputation efforts based on data model experiment
results
Data.table package used to read in file from website.(Fernandes et al., 2017)
library(data.table)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
data.table 1.14.6 using 10 threads (see ?getDTthreads). Latest news: r-datatable.com
df <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/00383/risk_factors_cervical_cancer.csv', header=TRUE, stringsAsFactors = FALSE)
Downloaded 4021 bytes...
Downloaded 8117 bytes...
Downloaded 12213 bytes...
Downloaded 16309 bytes...
Downloaded 28597 bytes...
Downloaded 40885 bytes...
Downloaded 44981 bytes...
Downloaded 53173 bytes...
Downloaded 65461 bytes...
Downloaded 77749 bytes...
Downloaded 94133 bytes...
Downloaded 98229 bytes...
Downloaded 102059 bytes...
head(df)
summary(df)
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years)
Min. :13.00 Length:858 Length:858 Length:858 Length:858 Length:858
1st Qu.:20.00 Class :character Class :character Class :character Class :character Class :character
Median :25.00 Mode :character Mode :character Mode :character Mode :character Mode :character
Mean :26.82
3rd Qu.:32.00
Max. :84.00
Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years)
Length:858 Length:858 Length:858 Length:858 Length:858
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
STDs STDs (number) STDs:condylomatosis STDs:cervical condylomatosis STDs:vaginal condylomatosis
Length:858 Length:858 Length:858 Length:858 Length:858
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
STDs:vulvo-perineal condylomatosis STDs:syphilis STDs:pelvic inflammatory disease STDs:genital herpes
Length:858 Length:858 Length:858 Length:858
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
STDs:molluscum contagiosum STDs:AIDS STDs:HIV STDs:Hepatitis B STDs:HPV
Length:858 Length:858 Length:858 Length:858 Length:858
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN
Min. :0.00000 Length:858 Length:858 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 Class :character Class :character 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Mode :character Mode :character Median :0.00000 Median :0.00000
Mean :0.08741 Mean :0.02098 Mean :0.01049
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :3.00000 Max. :1.00000 Max. :1.00000
Dx:HPV Dx Hinselmann Schiller Citology Biopsy
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
Mean :0.02098 Mean :0.02797 Mean :0.04079 Mean :0.08625 Mean :0.05128 Mean :0.0641
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
Lots of variables imported as characters as there is missing data and lots of boolean variables also.
str(df)
Classes ‘data.table’ and 'data.frame': 858 obs. of 36 variables:
$ Age : int 18 15 34 52 46 42 51 26 45 44 ...
$ Number of sexual partners : chr "4.0" "1.0" "1.0" "5.0" ...
$ First sexual intercourse : chr "15.0" "14.0" "?" "16.0" ...
$ Num of pregnancies : chr "1.0" "1.0" "1.0" "4.0" ...
$ Smokes : chr "0.0" "0.0" "0.0" "1.0" ...
$ Smokes (years) : chr "0.0" "0.0" "0.0" "37.0" ...
$ Smokes (packs/year) : chr "0.0" "0.0" "0.0" "37.0" ...
$ Hormonal Contraceptives : chr "0.0" "0.0" "0.0" "1.0" ...
$ Hormonal Contraceptives (years) : chr "0.0" "0.0" "0.0" "3.0" ...
$ IUD : chr "0.0" "0.0" "0.0" "0.0" ...
$ IUD (years) : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs (number) : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:condylomatosis : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:cervical condylomatosis : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:vaginal condylomatosis : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:vulvo-perineal condylomatosis: chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:syphilis : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:pelvic inflammatory disease : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:genital herpes : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:molluscum contagiosum : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:AIDS : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:HIV : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:Hepatitis B : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs:HPV : chr "0.0" "0.0" "0.0" "0.0" ...
$ STDs: Number of diagnosis : int 0 0 0 0 0 0 0 0 0 0 ...
$ STDs: Time since first diagnosis : chr "?" "?" "?" "?" ...
$ STDs: Time since last diagnosis : chr "?" "?" "?" "?" ...
$ Dx:Cancer : int 0 0 0 1 0 0 0 0 1 0 ...
$ Dx:CIN : int 0 0 0 0 0 0 0 0 0 0 ...
$ Dx:HPV : int 0 0 0 1 0 0 0 0 1 0 ...
$ Dx : int 0 0 0 0 0 0 0 0 1 0 ...
$ Hinselmann : int 0 0 0 0 0 0 1 0 0 0 ...
$ Schiller : int 0 0 0 0 0 0 1 0 0 0 ...
$ Citology : int 0 0 0 0 0 0 0 0 0 0 ...
$ Biopsy : int 0 0 0 0 0 0 1 0 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
library(tidyverse)
library(janitor)
library(naniar)
#clean variable names
df <- df %>%
clean_names()
dfcp <- df #copy data
#assign value NA to "?"
dfcp <- dfcp %>% replace_with_na_all(condition = ~.x == "?")
Summary identified that st_ds_cervical_condylomatosis and st_ds_aids have no variance. Removing them from dataframe as there are lots of variables in this data and these will not add value to the model.
dfcp <- select(dfcp,-c(st_ds_aids))
dfcp <- select(dfcp,-c(st_ds_cervical_condylomatosis))
Duplicates checked using n_distinct in tidyverse package
n_distinct(dfcp)
[1] 835
#filter to check and review any duplicate records
duplicates <- dfcp %>%
filter(duplicated(.))
print(duplicates)
There are no unique identifiers, so these may not be duplicates and will not be removed.
Packages used: visdat to explore missing values, ggplot to explore
missing data further and naniar for geom_miss_point function and MCAR
test.
There will be a warning with MCAR test if non-numeric columns
are present. If p-value >0.05 likely to be MCAR as not statistically
significant. The naniar test may also error due to singular
data and a fix has not been found for this yet (August 2023). However,
the data owners (Fernandes et al., 2017) advise that several patients
did not answer questions due to privacy concerns indicating that some of
the data is likely to be MNAR.
There are four target variables in this dataset: Hinselmann, Schiller, Cytology and Biopsy.(Sindiani AM et al, 2020) investigated the risk factors associated with cervical cancer to study their possible association with the decision to take a cervical biopsy. The data problem to be tested in this notebook is the ability to accurately classify the decision to take a cervical biopsy.
library(visdat)
library(ggplot2)
vis_miss(dfcp)
Warning: `gather_()` was deprecated in tidyr 1.2.0.
Please use `gather()` instead.
miss_var_summary(dfcp)
#mcar_test(dfcp) # computationally singular
If the statistic is high and the p.value <0.05 it is likely to be MNAR or MAR and cannot be ignored. A p-value >0.05 indicates MCAR but other information will be considered also. The value is set as a variable in the following code. As the MCAR Test errored due to data singularity, “error” has been entered here.
#set MCAR Test p.value as variable
mcar_presult = "error"
if(mcar_presult=="error" || is.numeric(mcar_presult)){print(paste(mcar_presult, "is MCAR Test p.value variable value."))
}else{
print("Please enter either a number or 'error' as the mcar_presult variable value.")}
[1] "error is MCAR Test p.value variable value."
If so, this indicates that the data could be missing at random (MAR) or missing not at random (MNAR) and can therefore not be ignored. Yes or No is set as a variable in the following code.
#set likelihood variable
likelihood = "Yes"
if(likelihood=="Yes" || likelihood=="No") {print(paste(likelihood, "is variable value for likelihood."))
}else{
print("Please enter either 'Yes' or 'No' as the variable value for likelihood.")}
[1] "Yes is variable value for likelihood."
If so, this suggests that complete case analysis may be the best option. However, other factors need to be considered also. Yes or No is set as variable in the following code.
#set dependent missingness variable
dependent_only = "No"
if(dependent_only=="Yes" || dependent_only=="No") {print(paste(dependent_only, "is variable value for dependent_only."))
}else{
print("Please enter either 'Yes' or 'No' as the variable value for dependent_only.")}
[1] "No is variable value for dependent_only."
If it’s between 0.1-0.5 multiple imputation is likely to result in a more effective, unbiased model. The value is set as a variable in the following code in the format 0.000. This is taken from the missing data visualisation.
#set missingness variable value
missingness = 0.117
if(is.numeric(missingness)){print(paste(missingness, "is missingness variable value."))
}else{
print("Please enter a number as the missingness variable value.")}
[1] "0.117 is missingness variable value."
Correlation for numerical variables.There are a large number of variables and indications of co-correlation in the matrix, so feature selection will be important.
#subset numerics for correlation only
#this will remove dependent variables also
dfcp_cor <- select_if(dfcp, is.numeric)
cor(dfcp_cor)
age st_ds_number_of_diagnosis dx_cancer dx_cin dx_hpv dx hinselmann
age 1.000000000 -0.001605942 0.11033968 0.061443423 0.10172170 0.092635125 -0.003966847
st_ds_number_of_diagnosis -0.001605942 1.000000000 -0.01542288 0.008069606 -0.01542288 -0.002288585 0.076786976
dx_cancer 0.110339681 -0.015422882 1.00000000 -0.015071762 0.88650794 0.665647059 0.134263602
dx_cin 0.061443423 0.008069606 -0.01507176 1.000000000 -0.01507176 0.606938678 -0.021232519
dx_hpv 0.101721696 -0.015422882 0.88650794 -0.015071762 1.00000000 0.616327096 0.134263602
dx 0.092635125 -0.002288585 0.66564706 0.606938678 0.61632710 1.000000000 0.072214849
hinselmann -0.003966847 0.076786976 0.13426360 -0.021232519 0.13426360 0.072214849 1.000000000
schiller 0.103282769 0.130872847 0.15781160 0.009119105 0.15781160 0.098952103 0.650249194
citology -0.016862074 0.055114464 0.11344608 -0.023937652 0.11344608 0.088739964 0.192467108
biopsy 0.055955515 0.097448921 0.16090497 0.113172334 0.16090497 0.157606644 0.547416628
schiller citology biopsy
age 0.103282769 -0.01686207 0.05595552
st_ds_number_of_diagnosis 0.130872847 0.05511446 0.09744892
dx_cancer 0.157811599 0.11344608 0.16090497
dx_cin 0.009119105 -0.02393765 0.11317233
dx_hpv 0.157811599 0.11344608 0.16090497
dx 0.098952103 0.08873996 0.15760664
hinselmann 0.650249194 0.19246711 0.54741663
schiller 1.000000000 0.36148649 0.73320388
citology 0.361486486 1.00000000 0.32746639
biopsy 0.733203881 0.32746639 1.00000000
library(psych) #for skewness function
skew(dfcp_cor)
rpart and rpart.plot packages are used to plot a simple classification tree.
library(rpart)
library(rpart.plot)
dfcp %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data=.) %>%
prp(type=4, extra = 101, roundint=FALSE, prefix="Prop.Miss = ")
Variable value for missingness influencer set in code below.
#set value for missingness influencer
miss_influencer="st_ds_number_of_diagnosis"
library(mice)
Attaching package: ‘mice’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
cbind, rbind
fxpt <- fluxplot(dfcp)
# Variables with higher outflux may be more powerful predictors. More meaningful where data is missing from more than one variable.
dfcp <- as.data.frame(dfcp)
dfcp <- dfcp %>% mutate(across(c("smokes", "hormonal_contraceptives", "iud", "st_ds", "st_ds_condylomatosis", "st_ds_vaginal_condylomatosis", "st_ds_vulvo_perineal_condylomatosis", "st_ds_syphilis", "st_ds_pelvic_inflammatory_disease", "st_ds_genital_herpes", "st_ds_molluscum_contagiosum", "st_ds_hiv", "st_ds_hepatitis_b", "st_ds_hpv", "dx_cancer", "dx_cin", "dx_hpv", "dx", "hinselmann", "schiller", "citology", "biopsy"), as.factor))
dfcp <- dfcp %>% mutate(across(c("number_of_sexual_partners", "first_sexual_intercourse", "num_of_pregnancies", "smokes_years", "smokes_packs_year","hormonal_contraceptives_years", "iud_years", "st_ds_number","st_ds_time_since_first_diagnosis", "st_ds_time_since_last_diagnosis"), as.integer))
Create complete dataset by deleting records with missing values and run Multiple Imputation using MICE.
dfcp_cca <- dfcp[complete.cases(dfcp), ]
init = mice(dfcp, maxit=10)
meth = init$method=c("pmm", "pmm", "pmm", "pmm", "logreg", "pmm", "pmm", "logreg", "pmm", "logreg", "pmm", "logreg", "pmm", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "pmm", "pmm", "pmm", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg", "logreg")
predM = init$predictorMatrix
#create imputed data
set.seed(123)
dfcp_imp = mice(dfcp, method=meth, predictorMatrix=predM, m=5, nnet.MaxNWts=3000)
#create dataset after imputation
dfcp_mi <- complete(dfcp_imp)
Warnings given, visualisations created to ensure complete data returned.
#check for missing data in MI and CCA dataset
library(visdat)
vis_miss(dfcp_mi)
vis_miss(dfcp_cca)
Use the variable with missing data which has the most influence on proportion of missingness, according to step 5 and look at the range of values in this variable. Generate imputations under delta adjustment to imitate deviations from MAR (Gink and Van Buuren, no date). The code uses 0 for MAR and a reasonable range based on the variable range as the delta values to test MNAR.
#Generate imputations under delta adjustment
delta <- c(0, +0.2, +0.4, +0.6, +0.8) #mean is 0.8, max is 3
imp.d <- vector("list", length(delta))
post <-dfcp_imp$post
for (i in 1:length(delta)) {
d <- delta[i]
cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
post["st_ds_number_of_diagnosis"] <- cmd
imp <- mice(dfcp, post = post, maxit = 5,
seed = i, print=FALSE)
imp.d[[i]] <- imp
}
The first plot is based on no delta adjustment and the second plot is the highest adjustment. The Y axis scale is set to the same value, so that differences can be visualised more easily.
densityplot(imp.d[[1]],lwd=3, ylim=c(0,3))
densityplot(imp.d[[5]],lwd=3, ylim=c(0,3))
Find out more about sensitivity analysis in the context of missing data from Gerko Vink and Stef van Buuren’s vignette mice: An approach to sensitivity analysis (Vink and van Buuren, no date).
The second density plot considers imputations under the largest adjustment, looking at the blue line in particular there does seem to be some change.
Create complete datasets and check imputations have worked by visualising missing data.
#delta 1 is the dfcp_mi data
dfcp_mi_d2 <- complete(imp.d[[2]])
dfcp_mi_d3 <- complete(imp.d[[3]])
dfcp_mi_d4 <- complete(imp.d[[4]])
dfcp_mi_d5 <- complete(imp.d[[5]])
vis_miss(dfcp_mi_d2)
vis_miss(dfcp_mi_d3)
vis_miss(dfcp_mi_d4)
vis_miss(dfcp_mi_d5)
There can be valuable information in outliers and it is good practice to remove obvious errors only - such as impossible values for particular variables. No obvious errors identified.
library(caret)
#subset numerics for correlation only
#this will remove dependent variables also
dfcp_mi_cor <- dfcp_mi
dfcp_mi_cor <- select_if(dfcp_mi_cor, is.numeric)
cor(dfcp_mi_cor)
Co-correlation is indicated between st_ds_time_since_first_diagnosis and st_ds_time_since_first_diagnosis. Look at further feature selection methods as number of booleans are so high.
# ensure the results are reproducible
set.seed(7)
# load the library
library(mlbench)
Warning: package ‘mlbench’ was built under R version 4.2.3
# rf control
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm against one of the target variables only
results <- rfe(dfcp_mi[,1:30],dfcp_mi[,34], sizes=c(1:30), rfeControl=control)
print(results)
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
The top 5 variables (out of 9):
st_ds_time_since_first_diagnosis, st_ds_time_since_last_diagnosis, st_ds_number_of_diagnosis, dx_cancer, st_ds
# list features
predictors(results)
[1] "st_ds_time_since_first_diagnosis" "st_ds_time_since_last_diagnosis" "st_ds_number_of_diagnosis"
[4] "dx_cancer" "st_ds" "st_ds_number"
[7] "dx_hpv" "dx" "st_ds_molluscum_contagiosum"
The variable st_ds_time_since_last_diagnosis was slightly lower in importance than the one it has high correlation with, so this one will be removed.
#remove redundant feature from all datasets
dfcp_mi <- dfcp_mi %>% select(-st_ds_time_since_last_diagnosis)
dfcp_cca <- dfcp_cca %>% select(-st_ds_time_since_last_diagnosis)
dfcp_mi_d2 <- dfcp_mi_d2 %>% select(-st_ds_time_since_last_diagnosis)
dfcp_mi_d3 <- dfcp_mi_d3 %>% select(-st_ds_time_since_last_diagnosis)
dfcp_mi_d4 <- dfcp_mi_d4 %>% select(-st_ds_time_since_last_diagnosis)
dfcp_mi_d5 <- dfcp_mi_d5 %>% select(-st_ds_time_since_last_diagnosis)
Scaling data allows models to compare relative relationships between data points more effectively.
#scale the numeric columns in all the datasets to be used in the test models
dfcp_mi <- dfcp_mi %>% mutate(across(where(is.numeric), scale))
dfcp_cca <- dfcp_cca %>% mutate(across(where(is.numeric), scale))
dfcp_mi_d5 <- dfcp_mi_d5 %>% mutate(across(where(is.numeric), scale))
Seeds set for reproducibility.
#mi data
set.seed(123)
index <- sample(2, nrow(dfcp_mi),
replace = TRUE,
prob = c(0.7, 0.3))
train_mi <- dfcp_mi[index==1,]
test_mi <- dfcp_mi[index==2,]
#cca data
set.seed(123)
index1 <- sample(2, nrow(dfcp_cca),
replace = TRUE,
prob = c(0.7, 0.3))
train_cca <- dfcp_cca[index1==1,]
test_cca <- dfcp_cca[index1==2,]
#delta5 data
set.seed(123)
index5 <- sample(2, nrow(dfcp_mi_d5),
replace = TRUE,
prob = c(0.7, 0.3))
train_d5 <- dfcp_mi_d5[index5==1,]
test_d5 <- dfcp_mi_d5[index5==2,]
Start with baseline model using multiple imputation. Random Forest classification for each dependent variable.
library(randomForest)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:psych’:
outlier
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin
#Create x and y values
#dfcp_mi data
x <- train_mi[,-33] #remove dependent variable
y <- as.factor(train_mi$biopsy)
set.seed(345) #for reproducibility
#Fit model with training data and review details
rf = randomForest(x = x,
y = y,
ntree = 500)
rf
Call:
randomForest(x = x, y = y, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 1.99%
Confusion matrix:
0 1 class.error
0 560 1 0.001782531
1 11 31 0.261904762
#predict results
xt <- test_mi[,-33] #remove dependent variable like class
yt <-as.factor(test_mi$biopsy) #keep class only
pred_yt <- predict(rf, newdata=xt)
#check accuracy
confusionMatrix(pred_yt, yt, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 242 5
1 0 8
Accuracy : 0.9804
95% CI : (0.9548, 0.9936)
No Information Rate : 0.949
P-Value [Acc > NIR] : 0.009348
Kappa : 0.7523
Mcnemar's Test P-Value : 0.073638
Sensitivity : 1.0000
Specificity : 0.6154
Pos Pred Value : 0.9798
Neg Pred Value : 1.0000
Precision : 0.9798
Recall : 1.0000
F1 : 0.9898
Prevalence : 0.9490
Detection Rate : 0.9490
Detection Prevalence : 0.9686
Balanced Accuracy : 0.8077
'Positive' Class : 0
varImpPlot(rf)
CCA data model.
#Create x and y values
#dfcp_cca data
x1 <- train_cca[,-33]
y1 <- as.factor(train_cca$biopsy)
set.seed(345) #for reproducibility
#Fit model with training data and review details
rf1 = randomForest(x = x1,
y = y1,
ntree = 500)
rf1
Call:
randomForest(x = x1, y = y1, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 12.2%
Confusion matrix:
0 1 class.error
0 35 0 0.0000000
1 5 1 0.8333333
#predict results
x1t <- test_cca[,-33] #remove dependent variable like class
y1t <-as.factor(test_cca$biopsy) #keep class only
pred_y1t <- predict(rf1, newdata=x1t)
#check accuracy
confusionMatrix(pred_y1t, y1t, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 15 2
1 0 1
Accuracy : 0.8889
95% CI : (0.6529, 0.9862)
No Information Rate : 0.8333
P-Value [Acc > NIR] : 0.4027
Kappa : 0.4545
Mcnemar's Test P-Value : 0.4795
Sensitivity : 1.0000
Specificity : 0.3333
Pos Pred Value : 0.8824
Neg Pred Value : 1.0000
Precision : 0.8824
Recall : 1.0000
F1 : 0.9375
Prevalence : 0.8333
Detection Rate : 0.8333
Detection Prevalence : 0.9444
Balanced Accuracy : 0.6667
'Positive' Class : 0
varImpPlot(rf1)
The Variable Importance plots are very different across the two models, due to the data that is lost through CCA.
The following code block captures the model which produced the best results, based on Kappa and core metric of balanced accuracy.
# set model variable for best performance as MI or CCA
better_model = "MI"
if(better_model=="MI" || better_model=="CCA") {print(paste(better_model, "produced the best result in this data experiment test."))
}else{
print("Please enter either 'MI' or 'CCA' as the variable value for better_model.")}
[1] "MI produced the best result in this data experiment test."
Code block included to edit for delta model.
#Create x and y values
#dfcp_cca data
x5 <- train_d5[,-33]
y5 <- as.factor(train_d5$biopsy)
set.seed(345) #for reproducibility
#Fit model with training data and review details
rf5 = randomForest(x = x5,
y = y5,
ntree = 500)
rf5
Call:
randomForest(x = x5, y = y5, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 5.31%
Confusion matrix:
0 1 class.error
0 551 10 0.01782531
1 22 20 0.52380952
#test with test data
x5t <- test_d5[,-33]
y5t <-as.factor(test_d5$biopsy)
pred_y5t <- predict(rf5, newdata=x5t)
#check accuracy
confusionMatrix(pred_y5t, y5t, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 238 3
1 4 10
Accuracy : 0.9725
95% CI : (0.9443, 0.9889)
No Information Rate : 0.949
P-Value [Acc > NIR] : 0.04973
Kappa : 0.7263
Mcnemar's Test P-Value : 1.00000
Sensitivity : 0.9835
Specificity : 0.7692
Pos Pred Value : 0.9876
Neg Pred Value : 0.7143
Precision : 0.9876
Recall : 0.9835
F1 : 0.9855
Prevalence : 0.9490
Detection Rate : 0.9333
Detection Prevalence : 0.9451
Balanced Accuracy : 0.8764
'Positive' Class : 0
varImpPlot(rf5)
The impact result is captured as a Yes or No response in the code below. The Kappa is lower but the Balanced Accuracy is higher as specificity is improved. There has been a big impact on results.
#Is there a big impact on results?
delta_impact = 'Yes'
if(delta_impact=="Yes" || delta_impact=="No") {print(paste(delta_impact, "is variable value for delta_impact."))
}else{
print("Please enter either 'Yes' or 'No' as the variable value for delta_impact.")}
[1] "Yes is variable value for delta_impact."
The following code will print a summary, based on the variable information recorded in earlier code blocks.
print(paste(missingness, "is missing data level."))
[1] "0.117 is missing data level."
print(paste("Is data more more likely to be missing in some variables than others?", likelihood))
[1] "Is data more more likely to be missing in some variables than others? Yes"
print(paste("Is data missing from dependent variable(s) only?", dependent_only))
[1] "Is data missing from dependent variable(s) only? No"
print(paste("Does higher delta adjustment have big impact on results?", delta_impact))
[1] "Does higher delta adjustment have big impact on results? Yes"
if (likelihood=="No" & mcar_presult=="error") {
hypothesis="Missing data pattern provides some support for MCAR hypothesis. However, no result available for MCAR Test."
} else if (likelihood=="No" & mcar_presult<0.05) {
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
} else if (likelihood=="No" & mcar_presult>=0.05) {
hypothesis="MCAR hypothesis supported by MCAR Test and missing data pattern."
} else if(likelihood=="Yes" & mcar_presult=="error") {
hypothesis="Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
} else if (likelihood=="Yes" & mcar_presult<0.05) {
hypothesis="MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
}else if (likelihood=="Yes" & mcar_presult>=0.05) {
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
}else {
print("No hypothesis found.")
}
print(hypothesis)
[1] "Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
#define different recommended approaches
data_approach1 = "Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data."
data_approach2 = "As MCAR hypothesis is supported and missing data is less than 10%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances."
data_approach3 = "MCAR hypothesis is supported. However, as missing data is between 10-50%, multiple imputation is likely to produce a more effective model."
data_approach4 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 10% and missing from the dependent variable only, complete case analysis may be the more reliable approach."
data_approach5 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."
data_approach6 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023)."
#define conditional statements
if (missingness>=0.5) {
approach=data_approach1
} else if (missingness>0.1 & dependent_only=='Yes') {
approach=data_approach1
} else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.1) {
approach=data_approach2
} else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness>0.1 & missingness<0.5) {
approach=data_approach3
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.1 & dependent_only=="Yes") {
approach=data_approach4
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.5 & delta_impact=="No") {
approach=data_approach5
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & delta_impact=="Yes") {
approach=data_approach6
}else {
print("No approach found.")
}
print(approach)
[1] "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023)."
if (better_model=="MI" & approach==data_approach1) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, there is a risk that it would not generalise well on new data.")
} else if (better_model=="CCA" & approach==data_approach1) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, there is a risk that it would not generalise well on new data.")
} else if (better_model=="MI" & approach==data_approach2) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, complete case analysis would be likely to produce unbiased results also.")
} else if (better_model=="CCA" & approach==data_approach2) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, multiple imputation may produce a more effective model in some circumstances.")
} else if (better_model=="MI" & approach==data_approach3) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This result is expected, given variables provided.")
} else if (better_model=="CCA" & approach==data_approach3) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, caution should be noted over results due to high level of missingness.")
} else if (better_model=="MI" & approach==data_approach4) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, caution should be taken with how missing data in dependent variables is handled, as CCA may be more reliable.")
} else if (better_model=="CCA" & approach==data_approach4) {
print("For this particular data experiment, complete case analysis produced the most effective model. This should be a reliable approach for this missing data problem.")
} else if (better_model=="MI" & approach==data_approach5) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.")
} else if (better_model=="CCA" & approach==data_approach5) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, the MCAR hypothesis is either not supported or unclear so caution is advised on the results produced.")
} else if (better_model=="MI" & approach==data_approach6) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also.")
} else if (better_model=="CCA" & approach==data_approach6) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, it is recommended that a pattern mixture model is considered also.")
} else {
print("No model evaluation found.")
}
[1] "For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also."
–Last updated: August 2023
–End–