This is an exemplar workbook showing how the Vaar Notebook can be
applied to UCI’s Breast Cancer Wisconsin Original Data (Wolberg, 1992)
dataset to return:
- Whether data is more likely to be MCAR or MAR/MNAR
- A recommended approach for managing missing data that considers
stability of model results
- Evaluation of imputation efforts based on data model experiment
results
Data.table package used to read in file from website. This is a binary classification problem to diagnose breast cancer malignancy, where the target diagnosis is the minority class.
library(data.table)
df <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=FALSE, stringsAsFactors = FALSE)
Downloaded 4021 bytes...
Downloaded 12213 bytes...
Downloaded 16309 bytes...
Downloaded 19889 bytes...
head(df)
summary(df)
V1 V2 V3 V4 V5
Min. : 61634 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.000
Mean : 1071704 Mean : 4.418 Mean : 3.134 Mean : 3.207 Mean : 2.807
3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
V6 V7 V8 V9 V10
Min. : 1.000 Length:699 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 Class :character 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 2.000 Mode :character Median : 3.000 Median : 1.000 Median : 1.000
Mean : 3.216 Mean : 3.438 Mean : 2.867 Mean : 1.589
3rd Qu.: 4.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 1.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
V11
Min. :2.00
1st Qu.:2.00
Median :2.00
Mean :2.69
3rd Qu.:4.00
Max. :4.00
library(tidyverse) # to use pipes
#count obs per class
df %>%
count(V11)
Note in summary of dataframe that V7 is the only character variable, the others are integers. There are 241 malignant and 458 benign observations in the data.
Tidyverse package used to rename variables.
dfcp <- df #copy data
#rename variables to something more meaningful
dfcp <- rename (dfcp,
sampleID=V1, ClumpThickness=V2, CellSize=V3, CellShape=V4, Adhesion=V5,
SingleESize=V6, BareNuclei=V7, BlandChromatin=V8, NormalNucleoli=V9,
Mitoses=V10, Diagnosis=V11)
#Convert V7 variable to integer also
#This is now named BareNuclei
dfcp$BareNuclei <- as.integer(dfcp$BareNuclei)
Warning: NAs introduced by coercion
Duplicates checked using n_distinct in tidyverse package
n_distinct(dfcp)
[1] 691
#filter to check and review any duplicate records
duplicates <- dfcp %>%
filter(duplicated(.))
print(duplicates)
There are 691 unique records out of 699 observations(8 duplicates printed).
dfcp <- dfcp %>% distinct() #691 obs remaining
dfcp <- select(dfcp,-c(sampleID)) #remove sampleID column 10 variables remain
Packages used: visdat to explore missing values, ggplot to explore
missing data further and naniar for geom_miss_point function and MCAR
test.
There will be a warning with MCAR test if non-numeric columns
are present. If p-value >0.05 likely to be MCAR as not statistically
significant.
library(visdat)
library(ggplot2)
library(naniar)
vis_miss(dfcp)
miss_var_summary(dfcp)
ggplot(dfcp,
aes (x = Diagnosis,
y = BareNuclei,)) +
geom_miss_point() +
ggtitle ("Graph Showing Missing Data Pattern By Class Diagnosis")
ggplot
function (data = NULL, mapping = aes(), ..., environment = parent.frame())
{
UseMethod("ggplot")
}
<bytecode: 0x0000024c1f8a0c08>
<environment: namespace:ggplot2>
mcar_test(dfcp)
If the statistic is high and the p.value <0.05 it is likely to be MNAR or MAR and cannot be ignored. A p-value >0.05 indicates MCAR but other information will be considered also. The value is set as a variable in the following code.
#set MCAR Test p.value as variable
mcar_presult = 0.2168686
If so, this indicates that the data could be missing at random (MAR) or missing not at random (MNAR) and can therefore not be ignored. Yes or No is set as a variable in the following code.
#set likelihood variable
likelihood = "Yes"
If so, this suggests that complete case analysis may be the best option. However, other factors need to be considered also. Yes or No is set as variable in the following code.
#set dependent missingness variable
dependent_only = "No"
If it’s between 20-50% multiple imputation is likely to result in a more effective, unbiased model. The value is set as a variable in the following code.
#set missingness variable value
missingness = 0.02315485
It’s good practice to test different approaches and the Vaar Notebooks will consider these, as well as the variable values just set in recommendations.
Visualise correlation for numerical variables.
dfcp_x <- select_if(dfcp, is.numeric)
cor(dfcp_x)
ClumpThickness CellSize CellShape Adhesion SingleESize BareNuclei BlandChromatin
ClumpThickness 1.0000000 0.6433396 0.6537521 0.4879492 0.5174478 NA 0.5610764
CellSize 0.6433396 1.0000000 0.9054195 0.7131170 0.7471112 NA 0.7595252
CellShape 0.6537521 0.9054195 1.0000000 0.6909890 0.7143934 NA 0.7384546
Adhesion 0.4879492 0.7131170 0.6909890 1.0000000 0.6084773 NA 0.6698127
SingleESize 0.5174478 0.7471112 0.7143934 0.6084773 1.0000000 NA 0.6205176
BareNuclei NA NA NA NA NA 1 NA
BlandChromatin 0.5610764 0.7595252 0.7384546 0.6698127 0.6205176 NA 1.0000000
NormalNucleoli 0.5357118 0.7272394 0.7246932 0.6024535 0.6340581 NA 0.6690593
Mitoses 0.3503536 0.4600643 0.4405924 0.4171673 0.4826440 NA 0.3438210
Diagnosis 0.7169385 0.8177198 0.8176933 0.7013708 0.6812326 NA 0.7566183
NormalNucleoli Mitoses Diagnosis
ClumpThickness 0.5357118 0.3503536 0.7169385
CellSize 0.7272394 0.4600643 0.8177198
CellShape 0.7246932 0.4405924 0.8176933
Adhesion 0.6024535 0.4171673 0.7013708
SingleESize 0.6340581 0.4826440 0.6812326
BareNuclei NA NA NA
BlandChromatin 0.6690593 0.3438210 0.7566183
NormalNucleoli 1.0000000 0.4276435 0.7155401
Mitoses 0.4276435 1.0000000 0.4241115
Diagnosis 0.7155401 0.4241115 1.0000000
vis_cor(dfcp_x)
There is a high level of correlation between variables and likely co-correlation between CellSize and CellShape for this dataset.
library(GGally) #ggplot2 extension for pairs matrix
#Change Diagnosis from integer to factor
dfcp$Diagnosis <- as.factor(dfcp$Diagnosis)
pm <- ggpairs(dfcp, columns = 1:10, ggplot2::aes(colour = Diagnosis), lower=list(combo=wrap("facethist",
binwidth=0.5)))
pm
Majority class - benign - has a lot more outliers and there is a high positive skew to the benign class also in the data. There is a moderate negative skew on some of the malignant class variables.
library(psych) #for skewness function
skew(dfcp_x)
[1] 0.5889370 1.2278924 1.1586510 1.5010237 1.7115854 0.9912070 1.0979701 1.4013807 3.5291814
[10] 0.6533674
rpart and rpart.plot packages are used to plot a simple classification tree.
library(rpart)
library(rpart.plot)
dfcp %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data=.) %>%
prp(type=4, extra = 101, roundint=FALSE, prefix="Prop.Miss = ")
Adhesion is returned for this dataset. Variable value for missingness influencer set in code below.
#set value for missingness influencer
miss_influencer="Adhesion"
fxpt <- fluxplot(dfcp)
# Variables with higher outflux may be more powerful predictors. More meaningful where data is missing from more than one variable.
Create complete dataset by deleting records with missing values and run Multiple Imputation using MICE.
dfcp_cca <- dfcp %>%
filter(!is.na(BareNuclei))
library(mice)
init = mice(dfcp, maxit=0)
meth = init$method
predM = init$predictorMatrix
#create imputed data
set.seed(123)
dfcp_imp = mice(dfcp, method=meth, predictorMatrix=predM, m=5)
iter imp variable
1 1 BareNuclei
1 2 BareNuclei
1 3 BareNuclei
1 4 BareNuclei
1 5 BareNuclei
2 1 BareNuclei
2 2 BareNuclei
2 3 BareNuclei
2 4 BareNuclei
2 5 BareNuclei
3 1 BareNuclei
3 2 BareNuclei
3 3 BareNuclei
3 4 BareNuclei
3 5 BareNuclei
4 1 BareNuclei
4 2 BareNuclei
4 3 BareNuclei
4 4 BareNuclei
4 5 BareNuclei
5 1 BareNuclei
5 2 BareNuclei
5 3 BareNuclei
5 4 BareNuclei
5 5 BareNuclei
#create dataset after imputation
dfcp_mi <- complete(dfcp_imp)
As warnings given, check imputation was possible.
#check for missing data in MI and CCA dataset
vis_miss(dfcp_mi)
vis_miss(dfcp_cca)
Use the variable which has the most influence on proportion of missingness, according to step 5 and look at the range of values in this variable. Generate imputations under delta adjustment to imitate deviations from MAR (Gink and Van Buuren, no date). The code uses 0 for MAR and a reasonable range based on the variable range as the delta values to test MNAR.
#Generate imputations under delta adjustment
delta <- c(0, +1, +2, +3, +4)
imp.d <- vector("list", length(delta))
post <-dfcp_imp$post
for (i in 1:length(delta)) {
d <- delta[i]
cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
post["BareNuclei"] <- cmd
imp <- mice(dfcp, post = post, maxit = 5,
seed = i, print=FALSE)
imp.d[[i]] <- imp
}
The first plot is based on no delta adjustment and the second plot is the highest adjustment. The Y axis scale is set to the same value, so that differences can be visualised more easily.
densityplot(imp.d[[1]],lwd=3, ylim=c(0,4))
densityplot(imp.d[[5]],lwd=3, ylim=c(0,4))
Find out more about sensitivity analysis in the context of missing data from Gerko Vink and Stef van Buuren’s vignette mice: An approach to sensitivity analysis (Vink and van Buuren, no date).
The second density plot considers imputations under the largest adjustment and it has not really had an effect on the imputations. The blue line is fairly consistent between the two plots.
Create complete datasets and check imputations have worked by visualising missing data.
#review imputations
print(imp.d[[2]]$imp$BareNuclei)
print(imp.d[[3]]$imp$BareNuclei)
print(imp.d[[4]]$imp$BareNuclei)
print(imp.d[[5]]$imp$BareNuclei)
#delta 1 is the dfcp_mi data
dfcp_mi_d2 <- complete(imp.d[[2]])
dfcp_mi_d3 <- complete(imp.d[[3]])
dfcp_mi_d4 <- complete(imp.d[[4]])
dfcp_mi_d5 <- complete(imp.d[[5]])
vis_miss(dfcp_mi_d2)
vis_miss(dfcp_mi_d3)
vis_miss(dfcp_mi_d4)
vis_miss(dfcp_mi_d5)
There can be valuable information in outliers and it is good practice to remove obvious errors only - such as impossible values for particular variables. The large number of outliers observed in the benign class in the pairs chart could be an accurate representation of the natural variability in data and diagnostic challenge.
This code is an example of how to determine which variables have above 0.9 correlation, using the caret package, as highly-correlated features do not generally improve models. Then, remove any variables identified from all datasets.
library(caret)
dfcp_mi_cor <- cor(dfcp_mi%>% select(-Diagnosis))
dfcp_mi <- dfcp_mi %>% select(-findCorrelation(dfcp_mi_cor, cutoff = 0.9))
#show column name difference between original data and selected features
setdiff(names(dfcp), names(dfcp_mi))
[1] "CellSize"
CellSize has been removed.
#remove CellSize from other datasets also
dfcp_cca <- dfcp_cca %>% select(-CellSize)
dfcp_mi_d2 <- dfcp_mi_d2 %>% select(-CellSize)
dfcp_mi_d3 <- dfcp_mi_d3 %>% select(-CellSize)
dfcp_mi_d4 <- dfcp_mi_d4 %>% select(-CellSize)
dfcp_mi_d5 <- dfcp_mi_d5 %>% select(-CellSize)
Scaling data allows models to compare relative relationships between data points more effectively. The Breast Cancer Wisconsin (Original) Data Set is already scaled (1 -10).
This notebook uses a Support Vector Machine (SVM) Model. Other models are available.
Seeds set for reproducibility.
#mi data
set.seed(123)
index <- sample(2, nrow(dfcp_mi),
replace = TRUE,
prob = c(0.7, 0.3))
train_mi <- dfcp_mi[index==1,]
test_mi <- dfcp_mi[index==2,]
#cca data
set.seed(123)
index1 <- sample(2, nrow(dfcp_cca),
replace = TRUE,
prob = c(0.7, 0.3))
train_cca <- dfcp_cca[index1==1,]
test_cca <- dfcp_cca[index1==2,]
#delta2 data
set.seed(123)
index2 <- sample(2, nrow(dfcp_mi_d2),
replace = TRUE,
prob = c(0.7, 0.3))
train_d2 <- dfcp_mi_d2[index2==1,]
test_d2 <- dfcp_mi_d2[index2==2,]
#delta3 data
set.seed(123)
index3 <- sample(2, nrow(dfcp_mi_d3),
replace = TRUE,
prob = c(0.7, 0.3))
train_d3 <- dfcp_mi_d3[index3==1,]
test_d3 <- dfcp_mi_d3[index3==2,]
#delta4 data
set.seed(123)
index4 <- sample(2, nrow(dfcp_mi_d4),
replace = TRUE,
prob = c(0.7, 0.3))
train_d4 <- dfcp_mi_d4[index4==1,]
test_d4 <- dfcp_mi_d4[index4==2,]
#delta5 data
set.seed(123)
index5 <- sample(2, nrow(dfcp_mi_d5),
replace = TRUE,
prob = c(0.7, 0.3))
train_d5 <- dfcp_mi_d5[index5==1,]
test_d5 <- dfcp_mi_d5[index5==2,]
This code is based on the vignette from (Sallan, 2020) Comparing SVM kernels.
library(e1071) #using SVM model from e1071 package
Attaching package: ‘e1071’
The following objects are masked from ‘package:misty’:
kurtosis, skewness
library(knitr) #to produce kable to visualise results
accuracy <- sapply(list(train_mi, train_cca, train_d2,
train_d3, train_d4, train_d5),
function(df){
svm_models <- lapply(list("linear", "poly", "radial", "sigmoid"), function(meth){
return(svm(df[,-9], as.factor(df$Diagnosis), kernel=meth))
})
names(svm_models) <- c("linear", "poly", "radial", "sigmoid")
accuracy <- sapply(svm_models, function(x) sum(x$fitted == df$Diagnosis)/nrow(df))
names(accuracy) <- names(svm_models)
return(accuracy)
})
accuracy <- data.frame(accuracy)
colnames(accuracy) <- c('train_mi', 'train_cca', 'train_d2',
'train_d3', 'train_d4', 'train_d5')
accuracy %>%
kable(digits = 3)
| train_mi | train_cca | train_d2 | train_d3 | train_d4 | train_d5 | |
|---|---|---|---|---|---|---|
| linear | 0.969 | 0.977 | 0.969 | 0.965 | 0.969 | 0.965 |
| poly | 0.969 | 0.972 | 0.967 | 0.971 | 0.971 | 0.969 |
| radial | 0.973 | 0.979 | 0.973 | 0.969 | 0.971 | 0.971 |
| sigmoid | 0.960 | 0.968 | 0.960 | 0.960 | 0.960 | 0.960 |
According to this test against training data, SVM Radial produces the best result.
Start with baseline model using multiple imputation.
#Create x and y values
#dfcp_mi data
x <- train_mi[,-9]
y <- as.factor(train_mi$Diagnosis)
set.seed(345) #for reproducibility
#Fit model with training data and review details
model_svm <- svm(x, y, kernel='radial')
print(model_svm)
Call:
svm.default(x = x, y = y, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 84
summary(model_svm)
Call:
svm.default(x = x, y = y, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 84
( 26 58 )
Number of Classes: 2
Levels:
2 4
#test with test data
xt <- test_mi[,-9]
yt <-as.factor(test_mi$Diagnosis)
pred_yt <- predict(model_svm, xt)
#check accuracy
table(pred_yt, yt)
yt
pred_yt 2 4
2 135 1
4 5 70
confusionMatrix(pred_yt, yt, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 2 4
2 135 1
4 5 70
Accuracy : 0.9716
95% CI : (0.9391, 0.9895)
No Information Rate : 0.6635
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9372
Mcnemar's Test P-Value : 0.2207
Sensitivity : 0.9643
Specificity : 0.9859
Pos Pred Value : 0.9926
Neg Pred Value : 0.9333
Precision : 0.9926
Recall : 0.9643
F1 : 0.9783
Prevalence : 0.6635
Detection Rate : 0.6398
Detection Prevalence : 0.6445
Balanced Accuracy : 0.9751
'Positive' Class : 2
The Kappa result shows this is an excellent model with high accuracy and balanced accuracy. The CCA test will be considered next.
#Create x and y values
#dfcp_cca data
x1 <- train_cca[,-9]
y1 <- as.factor(train_cca$Diagnosis)
set.seed(345) #for reproducibility
#Fit model with training data and review details
model_svm1 <- svm(x1, y1, kernel='radial')
print(model_svm1)
Call:
svm.default(x = x1, y = y1, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 72
summary(model_svm1)
Call:
svm.default(x = x1, y = y1, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 72
( 21 51 )
Number of Classes: 2
Levels:
2 4
#test with test data
x1t <- test_cca[,-9]
y1t <-as.factor(test_cca$Diagnosis)
pred_y1t <- predict(model_svm1, x1t)
#check accuracy
table(pred_y1t, y1t)
y1t
pred_y1t 2 4
2 141 2
4 7 55
confusionMatrix(pred_y1t, y1t, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 2 4
2 141 2
4 7 55
Accuracy : 0.9561
95% CI : (0.9183, 0.9797)
No Information Rate : 0.722
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8935
Mcnemar's Test P-Value : 0.1824
Sensitivity : 0.9527
Specificity : 0.9649
Pos Pred Value : 0.9860
Neg Pred Value : 0.8871
Precision : 0.9860
Recall : 0.9527
F1 : 0.9691
Prevalence : 0.7220
Detection Rate : 0.6878
Detection Prevalence : 0.6976
Balanced Accuracy : 0.9588
'Positive' Class : 2
The following code block captures the model which produced the best results, based on Kappa and core metric (Sensitivity) as this is a minority classification problem.
# set model variable for best performance as MI or CCA
better_model = 'MI'
Complete case analysis produced an excellent model also, with a lower Kappa rate and metrics than the baseline MI model. Accuracy, balanced accuracy and the F1 score are all slightly lower. Although the MCAR test result indicated that data is MCAR, the missing data pattern in one variable only suggested MAR or MNAR. The better performance of the MI model over CCA, despite a low missingness rate, suggests that MAR or MNAR is more likely or that valuable information is included in the deleted data. To test MNAR further, the SVM model will be tested against the highest delta adjusted imputation to see if there is a big impact on results. If CCA produces a better result, this may be due to data bias.
#Create x and y values
#dfcp_cca data
x5 <- train_d5[,-9]
y5 <- as.factor(train_d5$Diagnosis)
set.seed(345) #for reproducibility
#Fit model with training data and review details
model_svm5 <- svm(x5, y5, kernel='radial')
print(model_svm5)
Call:
svm.default(x = x5, y = y5, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 86
summary(model_svm5)
Call:
svm.default(x = x5, y = y5, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 86
( 29 57 )
Number of Classes: 2
Levels:
2 4
#test with test data
x5t <- test_d5[,-9]
y5t <-as.factor(test_d5$Diagnosis)
pred_y5t <- predict(model_svm5, x5t)
#check accuracy
table(pred_y5t, y5t)
y5t
pred_y5t 2 4
2 135 1
4 5 70
confusionMatrix(pred_y5t, y5t, mode="everything")
Confusion Matrix and Statistics
Reference
Prediction 2 4
2 135 1
4 5 70
Accuracy : 0.9716
95% CI : (0.9391, 0.9895)
No Information Rate : 0.6635
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9372
Mcnemar's Test P-Value : 0.2207
Sensitivity : 0.9643
Specificity : 0.9859
Pos Pred Value : 0.9926
Neg Pred Value : 0.9333
Precision : 0.9926
Recall : 0.9643
F1 : 0.9783
Prevalence : 0.6635
Detection Rate : 0.6398
Detection Prevalence : 0.6445
Balanced Accuracy : 0.9751
'Positive' Class : 2
There is not a big impact in this example, suggesting results are relatively stable, that multiple imputation produces the best results and that a pattern mixture model is not necessary. The impact result is captured as a Yes or No response in the code below.
#Is there a big impact on result?
delta_impact = 'No'
Look at which rows were misclassified by the MI and CCA models.
test_cca <-as.data.frame(test_cca)
misclass_mi <- which(pred_yt != test_mi[,9])
misclass_cca <- which(pred_y1t != test_cca[,9])
Print misclassified row from MI and CCA Model results.
print(misclass_mi)
[1] 1 2 61 70 99 197
print(misclass_cca)
[1] 1 2 15 43 58 67 79 96 135
Create objects with misclassified rows and print results.
#the model misclassified the following rows
#subset data
misclass_rows_mi <- test_mi[c(1,2,61,70,99,197),]
misclass_rows_cca <- test_cca[c(1,2,15,43,58,67,79,96,135),]
print(misclass_rows_mi)
print(misclass_rows_cca)
Whilst 5 of the 6 misclassified results from the multiple imputation model also appear in the model using CCA, there is still room for improvement. However, several of the misclassified results have data that are more representative of the other diagnosis - such as higher measures for CellShape and BareNuclei.
library(ROCR) # for ROC curve
#calculations for ROC curve
predictions <- as.numeric(predict(model_svm, test_mi[,-9], type="response"))
pred <- prediction(predictions, test_mi$Diagnosis)
perf <-performance(pred, measure="tpr", x.measure="fpr")
plot(perf, col="dodgerblue1", main="ROC Curve Shows Good Separation of Data")
The following code will print a summary, based on the variable information recorded in earlier code blocks.
print(paste(missingness, "is missing data level."))
[1] "0.02315485 is missing data level."
print(paste("Is data more more likely to be missing in some variables than others?", likelihood))
[1] "Is data more more likely to be missing in some variables than others? Yes"
print(paste("Is data missing from dependent variable(s) only?", dependent_only))
[1] "Is data missing from dependent variable(s) only? No"
print(paste("Does higher delta adjustment have big impact on results?", delta_impact))
[1] "Does higher delta adjustment have big impact on results? No"
if (likelihood=="No" & mcar_presult=="error") {
hypothesis="Missing data pattern provides some support for MCAR hypothesis. However, no result available for MCAR Test."
} else if (likelihood=="No" & mcar_presult<0.05) {
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
} else if (likelihood=="No" & mcar_presult>=0.05) {
hypothesis="MCAR hypothesis supported by MCAR Test and missing data pattern."
} else if(likelihood=="Yes" & mcar_presult=="error") {
hypothesis="Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
} else if (likelihood=="Yes" & mcar_presult<0.05) {
hypothesis="MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
}else if (likelihood=="Yes" & mcar_presult>=0.05) {
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
}else {
print("No hypothesis found.")
}
print(hypothesis)
[1] "Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
#define different recommended approaches
data_approach1 = "Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data."
data_approach2 = "As MCAR hypothesis is supported and missing data is less than 20%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances."
data_approach3 = "MCAR hypothesis is supported. However, as missing data is between 20-50%, multiple imputation is likely to produce a more effective model."
data_approach4 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 20% and missing from the dependent variable only, complete case analysis may be the more reliable approach."
data_approach5 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."
data_approach6 = "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg. (van Buuren et al., 2023)"
#define conditional statements
if (missingness>=0.5) {
approach=data_approach1
} else if (missingness>0.2 & dependent_only=='Yes') {
approach=data_approach1
} else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.2) {
approach=data_approach2
} else if (hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness>0.2 & missingness<0.5) {
approach=data_approach3
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.2 & dependent_only=="Yes") {
approach=data_approach4
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & missingness<=0.5 & delta_impact=="No") {
approach=data_approach5
} else if(hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." & delta_impact=="Yes") {
approach=data_approach6
}else {
print("No approach found.")
}
print(approach)
[1] "MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."
if (better_model=="MI" & approach==data_approach1) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, there is a risk that it would not generalise well on new data.")
} else if (better_model=="CCA" & approach==data_approach1) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, there is a risk that it would not generalise well on new data.")
} else if (better_model=="MI" & approach==data_approach2) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, complete case analysis would be likely to produce unbiased results also.")
} else if (better_model=="CCA" & approach==data_approach2) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, multiple imputation may produce a more effective model in some circumstances.")
} else if (better_model=="MI" & approach==data_approach3) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This result is expected, given variables provided.")
} else if (better_model=="CCA" & approach==data_approach3) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, caution should be noted over results due to high level of missingness.")
} else if (better_model=="MI" & approach==data_approach4) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, caution should be taken with how missing data in dependent variables is handled, as CCA may be more reliable.")
} else if (better_model=="CCA" & approach==data_approach4) {
print("For this particular data experiment, complete case analysis produced the most effective model. This should be a reliable approach for this missing data problem.")
} else if (better_model=="MI" & approach==data_approach5) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.")
} else if (better_model=="CCA" & approach==data_approach5) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, the MCAR hypothesis is either not supported or unclear so caution is advised on the results produced.")
} else if (better_model=="MI" & approach==data_approach6) {
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also.")
} else if (better_model=="CCA" & approach==data_approach6) {
print("For this particular data experiment, complete case analysis produced the most effective model. However, it is recommended that a pattern mixture model is considered also.")
} else {
print("No model evaluation found.")
}
[1] "For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem."
–Last updated: August 2023 –
–End–