This Vaar R Markdown
Notebook using Python aims to return:
- Whether data is more likely to be MCAR or MAR/MNAR
- A recommended approach for managing missing data that considers
stability of model results
- Evaluation of imputation efforts based on data model experiment
results
This particular Vaar notebook uses UCI’s Breast Cancer Wisconsin Original Data as an exemplar.
library(reticulate)
py_available(TRUE)
py_install("pandas")
reticulate::repl_python()
py_install("numpy")
Print summary statistics for dataframe.
import numpy as np
print(df.describe())
0 1 2 ... 8 9 10
count 6.990000e+02 699.000000 699.000000 ... 699.000000 699.000000 699.000000
mean 1.071704e+06 4.417740 3.134478 ... 2.866953 1.589413 2.689557
std 6.170957e+05 2.815741 3.051459 ... 3.053634 1.715078 0.951273
min 6.163400e+04 1.000000 1.000000 ... 1.000000 1.000000 2.000000
25% 8.706885e+05 2.000000 1.000000 ... 1.000000 1.000000 2.000000
50% 1.171710e+06 4.000000 1.000000 ... 1.000000 1.000000 2.000000
75% 1.238298e+06 6.000000 5.000000 ... 4.000000 1.000000 4.000000
max 1.345435e+07 10.000000 10.000000 ... 10.000000 10.000000 4.000000
[8 rows x 11 columns]
Count number of observations in each class if classification problem.
diagnosis = df.groupby(10)
diagnosis.count()
0 1 2 3 4 5 6 7 8 9
10
2 458 458 458 458 458 458 444 458 458 458
4 241 241 241 241 241 241 239 241 241 241
#rename variables to something more meaningful in copy of original df
dfcp = df
cols = df.columns
df.rename(columns = {cols[0]:'sampleID',cols[1]:'ClumpThickness',cols[2]:'CellSize', cols[3]:'CellShape',cols[4]:'Adhesion',cols[5]:'SingleESize',cols[6]:'BareNuclei', cols[7]:'BlandChromatin', cols[8]:'NormalNucleoli', cols[9]:'Mitoses', cols[10]:'Diagnosis'}, inplace=True)
Get the type for variables in the dataframe.
dfcp.dtypes
sampleID int64
ClumpThickness int64
CellSize int64
CellShape int64
Adhesion int64
SingleESize int64
BareNuclei float64
BlandChromatin int64
NormalNucleoli int64
Mitoses int64
Diagnosis int64
dtype: object
Check for and remove any duplicate observations if there is a unique identifier. Comment out if no identifier.
duplicates = dfcp.duplicated(keep='first')
print('Duplicates:')
Duplicates:
print(duplicates)
0 False
1 False
2 False
3 False
4 False
...
694 False
695 False
696 False
697 False
698 False
Length: 699, dtype: bool
print('\n')
print('DataFrame after keeping only the first instance of the duplicate rows:')
DataFrame after keeping only the first instance of the duplicate rows:
dfcp[~duplicates]
sampleID ClumpThickness CellSize ... NormalNucleoli Mitoses Diagnosis
0 1000025 5 1 ... 1 1 2
1 1002945 5 4 ... 2 1 2
2 1015425 3 1 ... 1 1 2
3 1016277 6 8 ... 7 1 2
4 1017023 4 1 ... 1 1 2
.. ... ... ... ... ... ... ...
694 776715 3 1 ... 1 1 2
695 841769 2 1 ... 1 1 2
696 888820 5 10 ... 10 2 4
697 897471 4 8 ... 6 1 4
698 897471 4 8 ... 4 1 4
[691 rows x 11 columns]
Remove unique identifier from dataframe as this will not add any value to model.
dfcp = dfcp.drop(columns=['sampleID'])
Count missing values in each column.
dfcp.isna().sum()
ClumpThickness 0
CellSize 0
CellShape 0
Adhesion 0
SingleESize 0
BareNuclei 16
BlandChromatin 0
NormalNucleoli 0
Mitoses 0
Diagnosis 0
dtype: int64
len(dfcp)
699
missingness = 16/699
Use seaborn module to visualise missing data.
py_install("seaborn")
import seaborn as sns
Install matplotlib module also.
py_install("matplotlib==3.6")
Import matplotlib.
import matplotlib.pyplot as plt
Visualise missing data patterns in matrix.
plt.figure(figsize=(10,6))
<Figure size 1000x600 with 0 Axes>
sns.heatmap(dfcp.isna().transpose(), cmap="Blues", cbar_kws={'label': 'Missing Data'})
<AxesSubplot: >
plt.show()
Run MCAR Test.
py_install("scipy")
py_install("r-naniar")
The following code, including Little’s MCAR Test, has been performed
in R, as no equivalent exists in Python yet (August 2023). There
will be a warning with MCAR test if non-numeric columns are present. If
p-value >0.05 likely to be MCAR as not statistically
significant.
The MCAR Test may error if data is singular, there is no fix available
for this yet (August 2023).
library(naniar)
library(reticulate)
mcar_test(py$dfcp)
If the statistic is high and the p.value <0.05 it is likely to be MNAR or MAR and cannot be ignored. A p-value >0.05 indicates MCAR but other information will be considered also. The value is set as a variable in the following code.
#set MCAR Test p.value as variable
mcar_presult = 0.2133409
if mcar_presult=="error" or isinstance(mcar_presult, (int, float, complex)):
print(f"{mcar_presult} is MCAR Test p.value variable value.")
else:
print("Please enter either a number or 'error' as the mcar_presult variable value.")
0.2133409 is MCAR Test p.value variable value.
If so, this indicates that the data could be missing at random (MAR) or missing not at random (MNAR) and can therefore not be ignored. Yes or No is set as a variable in the following code.
#set likelihood variable
likelihood = "Yes"
if likelihood=="Yes" or likelihood=="No":
print(f"{likelihood} is likelihood variable value.")
else:
print("Please enter either 'Yes' or 'No' as the likelihood variable value.")
Yes is likelihood variable value.
If so, this suggests that complete case analysis may be the best option. However, other factors need to be considered also. Yes or No is set as variable in the following code.
#set dependent missingness variable
dependent_only = "No"
if dependent_only=="Yes" or dependent_only=="No":
print(f"{dependent_only} is dependent_only variable value.")
else:
print("Please enter either 'Yes' or 'No' as the dependent_only variable value.")
No is dependent_only variable value.
If it’s between 20-50% multiple imputation is likely to result in a more effective, unbiased model. The value is set as a variable in the following code.
#print missingness variable value
print(missingness)
0.022889842632331903
It’s good practice to test different approaches and the Vaar Notebooks will consider these, as well as the variable values just set in recommendations.
Visualise correlation for numerical variables.
n=10 #number of columns
names = ['ClumpThickness','CellSize','CellShape','Adhesion','SingleESize','BareNuclei','BlandChromatin','NormalNucleoli','Mitoses','Diagnosis'] #col names
fig = plt.matshow(dfcp.corr(),)
ax = plt.gca()
ax.set_xticks(np.arange(n))
[<matplotlib.axis.XTick object at 0x00000192F1FF6220>, <matplotlib.axis.XTick object at 0x00000192F1FF6E80>, <matplotlib.axis.XTick object at 0x0000019284A26070>, <matplotlib.axis.XTick object at 0x0000019284A5E2E0>, <matplotlib.axis.XTick object at 0x0000019287792940>, <matplotlib.axis.XTick object at 0x00000192F1F96C40>, <matplotlib.axis.XTick object at 0x0000019284A39C10>, <matplotlib.axis.XTick object at 0x0000019287792EB0>, <matplotlib.axis.XTick object at 0x000001928779A9A0>, <matplotlib.axis.XTick ob
ax.set_xticklabels(names)
ject at 0x000001928779F490>]
[Text(0, 1, 'ClumpThickness'), Text(1, 1, 'CellSize'), Text(2, 1, 'CellShape'), Text(3, 1, 'Adhesion'), Text(4, 1, 'SingleESize'), Text(5, 1, 'BareNuclei'), Text(6, 1, 'BlandChromatin'), Text(7, 1, 'NormalNucleoli'), Text(8, 1, 'Mitoses'), Text(9, 1, 'Diagnosis')]
ax.set_yticks(np.arange(n))
[<matplotlib.axis.YTick object at 0x00000192F1E34970>, <matplotlib.axis.YTick object at 0x00000192F1FF6E20>, <matplotlib.axis.YTick object at 0x00000192848123D0>, <matplotlib.axis.YTick object at 0x00000192877A6280>, <matplotlib.axis.YTick object at 0x00000192877AEA60>, <matplotlib.axis.YTick object at 0x00000192877A6FA0>, <matplotlib.axis.YTick object at 0x000001928779F850>, <matplotlib.axis.YTick object at 0x000001928779AD00>, <matplotlib.axis.YTick object at 0x00000192877B6AF0>, <matplotlib.axis.YTick object at 0x00000192877BA5E0>]
[Text(0, 0, 'ClumpThickness'), Text(0, 1, 'CellSize'), Text(0, 2, 'CellShape'), Text(0, 3, 'Adhesion'), Text(0, 4, 'SingleESize'), Text(0, 5, 'BareNuclei'), Text(0, 6, 'BlandChromatin'), Text(0, 7, 'NormalNucleoli'), Text(0, 8, 'Mitoses'), Text(0, 9, 'Diagnosis')]
plt.setp([tick.label2 for tick in ax.xaxis.get_major_ticks()], rotation=40,
ha="left", va="center",rotation_mode="anchor")
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plt.show()
Use Seaborn’s PairGrid to visualise interactions between variables.
#use seaborn's PairGrid to visualise interactions between variables.
pairs = sns.PairGrid(dfcp, hue="Diagnosis")
pairs.map_diag(sns.histplot)
pairs.map_offdiag(sns.scatterplot)
<seaborn.axisgrid.PairGrid object at 0x00000192F1FE0880>
pairs.add_legend()
<seaborn.axisgrid.PairGrid object at 0x00000192F1FE0880>
plt.show()
Install and import scipy.
py_install("scipy")
import scipy as sci
from scipy.stats import skew
Calculate and print skewness. For variables with missing data ‘nan’ will be returned.
print(skew(dfcp, axis = 0, bias=True))
[0.59158554 1.23048876 1.15936443 1.52119475 1.70849542 nan
1.09760722 1.41920737 3.55301239 0.65315834]
Naniar (R) used for prop_miss functions, to add to dataframe temporarily for use in simple decision tree.
library(rpart)
library(rpart.plot)
py$dfcp %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data=.) %>%
prp(type=4, extra = 101, roundint=FALSE, prefix="Prop.Miss = ")
Set variable value for missingness influencer in code below.
#set value for missingness influencer
miss_influencer="BareNuclei"
Create complete dataset by deleting records with missing values and run iterative chained equations which return single imputation instead of multiple.
#Create CCA data
dfcp_cca = dfcp.copy()
dfcp_cca = dfcp_cca.dropna()
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
dfcp_mi = IterativeImputer(random_state=0, max_iter=5).fit_transform(dfcp)
Check that there is no missing data in new datasets.
#convert array to dataframe
dfcp_mi = pd.DataFrame(dfcp_mi)
print(dfcp_cca.isna().sum())
ClumpThickness 0
CellSize 0
CellShape 0
Adhesion 0
SingleESize 0
BareNuclei 0
BlandChromatin 0
NormalNucleoli 0
Mitoses 0
Diagnosis 0
dtype: int64
print(dfcp_mi.isna().sum())
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Add column names to imputed dataframe.
dfcp_imp = dfcp_mi
cols = dfcp_imp.columns
dfcp_imp.rename(columns = {cols[0]:'ClumpThickness',cols[1]:'CellSize', cols[2]:'CellShape',cols[3]:'Adhesion',cols[4]:'SingleESize',cols[5]:'BareNuclei', cols[6]:'BlandChromatin', cols[7]:'NormalNucleoli', cols[8]:'Mitoses', cols[9]:'Diagnosis'}, inplace=True)
Use the variable with missing data which has the most influence on proportion of missingness, according to step 5 and look at the range of values in this variable. Generate imputations under delta adjustment to imitate deviations from MAR (Gink and Van Buuren, no date). The code uses 0 and a reasonable range based on the variable range as the delta values.
#Pull out index of NaN from original data
df[df['BareNuclei'].isna()]
sampleID ClumpThickness CellSize ... NormalNucleoli Mitoses Diagnosis
23 1057013 8 4 ... 3 1 4
40 1096800 6 6 ... 8 1 2
139 1183246 1 1 ... 1 1 2
145 1184840 1 1 ... 1 1 2
158 1193683 1 1 ... 1 1 2
164 1197510 5 1 ... 1 1 2
235 1241232 3 1 ... 1 1 2
249 169356 3 1 ... 1 1 2
275 432809 3 1 ... 1 1 2
292 563649 8 8 ... 10 1 4
294 606140 1 1 ... 1 1 2
297 61634 5 4 ... 3 1 2
315 704168 4 6 ... 9 1 2
321 733639 3 1 ... 1 1 2
411 1238464 1 1 ... 1 1 2
617 1057067 1 1 ... 1 1 2
[16 rows x 11 columns]
#subset these rows and variable identified from imputed data set
delta0 = dfcp_mi.iloc[[23,40,139,145,158,164,235,249,275,292,294,297,315,321,411,617],[5]]
#set delta values then set upper clip
#create new values by delta increase
delta=[0,1,2,3,4]
delta1 = delta0.copy()
delta1.loc[delta1['BareNuclei']<10, 'BareNuclei']+=1
Repeat for other delta values
delta2 = delta1.copy()
delta2.loc[delta2['BareNuclei']<10, 'BareNuclei']+=1
delta2.clip(upper=10)
BareNuclei
23 9.116345
40 5.627595
139 3.186619
145 3.588273
158 3.262360
164 3.479783
235 3.992673
249 3.423504
275 3.644552
292 8.325052
294 3.208827
297 3.068113
315 4.168468
321 3.423504
411 3.186619
617 3.028221
delta3 = delta2.copy()
delta3.loc[delta3['BareNuclei']<10, 'BareNuclei']+=1
delta3 = delta3.clip(upper=10)
delta4 = delta3.copy()
delta4.loc[delta4['BareNuclei']<10, 'BareNuclei']+=1
delta4 = delta4.clip(upper=10)
Replace NaN with increased values to test sensitivity. Example below to change based on outputs created above.
dfcp_mi_d1 = dfcp.copy()
dfcp_mi_d1.iloc[[23],[5]] = 8.116345
dfcp_mi_d1.iloc[[40],[5]] = 4.627595
dfcp_mi_d1.iloc[[139],[5]] = 2.186619
dfcp_mi_d1.iloc[[145],[5]] = 2.588273
dfcp_mi_d1.iloc[[158],[5]] = 2.262360
dfcp_mi_d1.iloc[[164],[5]] = 2.479783
dfcp_mi_d1.iloc[[235],[5]] = 2.992673
dfcp_mi_d1.iloc[[249],[5]] = 2.423504
dfcp_mi_d1.iloc[[275],[5]] = 2.644552
dfcp_mi_d1.iloc[[292],[5]] = 7.325052
dfcp_mi_d1.iloc[[294],[5]] = 2.208827
dfcp_mi_d1.iloc[[297],[5]] = 2.068113
dfcp_mi_d1.iloc[[315],[5]] = 3.168468
dfcp_mi_d1.iloc[[321],[5]] = 2.423504
dfcp_mi_d1.iloc[[411],[5]] = 2.186619
dfcp_mi_d1.iloc[[617],[5]] = 2.028221
Repeat for other delta values.
dfcp_mi_d2 = dfcp_mi_d1.copy()
dfcp_mi_d2.iloc[[23],[5]] = 9.116345
dfcp_mi_d2.iloc[[40],[5]] = 5.627595
dfcp_mi_d2.iloc[[139],[5]] = 3.186619
dfcp_mi_d2.iloc[[145],[5]] = 3.588273
dfcp_mi_d2.iloc[[158],[5]] = 3.262360
dfcp_mi_d2.iloc[[164],[5]] = 3.479783
dfcp_mi_d2.iloc[[235],[5]] = 3.992673
dfcp_mi_d2.iloc[[249],[5]] = 3.423504
dfcp_mi_d2.iloc[[275],[5]] = 3.644552
dfcp_mi_d2.iloc[[292],[5]] = 8.325052
dfcp_mi_d2.iloc[[294],[5]] = 3.208827
dfcp_mi_d2.iloc[[297],[5]] = 3.068113
dfcp_mi_d2.iloc[[315],[5]] = 4.168468
dfcp_mi_d2.iloc[[321],[5]] = 3.423504
dfcp_mi_d2.iloc[[411],[5]] = 3.186619
dfcp_mi_d2.iloc[[617],[5]] = 3.028221
dfcp_mi_d3 = dfcp_mi_d2.copy()
dfcp_mi_d3.iloc[[23],[5]] = 10
dfcp_mi_d3.iloc[[40],[5]] = 6.627595
dfcp_mi_d3.iloc[[139],[5]] = 4.186619
dfcp_mi_d3.iloc[[145],[5]] = 4.588273
dfcp_mi_d3.iloc[[158],[5]] = 4.262360
dfcp_mi_d3.iloc[[164],[5]] = 4.479783
dfcp_mi_d3.iloc[[235],[5]] = 4.992673
dfcp_mi_d3.iloc[[249],[5]] = 4.423504
dfcp_mi_d3.iloc[[275],[5]] = 4.644552
dfcp_mi_d3.iloc[[292],[5]] = 9.325052
dfcp_mi_d3.iloc[[294],[5]] = 4.208827
dfcp_mi_d3.iloc[[297],[5]] = 4.068113
dfcp_mi_d3.iloc[[315],[5]] = 5.168468
dfcp_mi_d3.iloc[[321],[5]] = 4.423504
dfcp_mi_d3.iloc[[411],[5]] = 4.186619
dfcp_mi_d3.iloc[[617],[5]] = 4.028221
dfcp_mi_d4 = dfcp_mi_d3.copy()
dfcp_mi_d4.iloc[[23],[5]] = 10
dfcp_mi_d4.iloc[[40],[5]] = 7.627595
dfcp_mi_d4.iloc[[139],[5]] = 5.186619
dfcp_mi_d4.iloc[[145],[5]] = 5.588273
dfcp_mi_d4.iloc[[158],[5]] = 5.262360
dfcp_mi_d4.iloc[[164],[5]] = 5.479783
dfcp_mi_d4.iloc[[235],[5]] = 5.992673
dfcp_mi_d4.iloc[[249],[5]] = 5.423504
dfcp_mi_d4.iloc[[275],[5]] = 5.644552
dfcp_mi_d4.iloc[[292],[5]] = 10
dfcp_mi_d4.iloc[[294],[5]] = 5.208827
dfcp_mi_d4.iloc[[297],[5]] = 5.068113
dfcp_mi_d4.iloc[[315],[5]] = 6.168468
dfcp_mi_d4.iloc[[321],[5]] = 5.423504
dfcp_mi_d4.iloc[[411],[5]] = 5.186619
dfcp_mi_d4.iloc[[617],[5]] = 5.028221
There can be valuable information in outliers and it is good practice to remove obvious errors only - such as impossible values for particular variables. Outliers could be an accurate representation of the natural variability in data and reflective of diagnostic challenge.
This code is an example of how to determine which variables have above 0.9 correlation, as highly-correlated features do not generally improve models. Then, remove any variables identified from all datasets.
#select highly correlated features
#remove target variables from consideration
dfcp_mi_cor = dfcp_mi.copy()
dfcp_mi_cor = dfcp_mi_cor.drop(columns=['Diagnosis'])
cor = dfcp_mi_cor.corr()
print(cor)
ClumpThickness CellSize ... NormalNucleoli Mitoses
ClumpThickness 1.000000 0.644913 ... 0.535835 0.350034
CellSize 0.644913 1.000000 ... 0.722865 0.458693
CellShape 0.654589 0.906882 ... 0.719446 0.438911
Adhesion 0.486356 0.705582 ... 0.603352 0.417633
SingleESize 0.521816 0.751799 ... 0.628881 0.479101
BareNuclei 0.595850 0.691747 ... 0.582114 0.340126
BlandChromatin 0.558428 0.755721 ... 0.665878 0.344169
NormalNucleoli 0.535835 0.722865 ... 1.000000 0.428336
Mitoses 0.350034 0.458693 ... 0.428336 1.000000
[9 rows x 9 columns]
Remove from datasets if over 0.9.
dfcp_mi = dfcp_mi.drop(columns=['CellSize'])
dfcp_mi_d1 = dfcp_mi_d1.drop(columns=['CellSize'])
dfcp_mi_d2 = dfcp_mi_d2.drop(columns=['CellSize'])
dfcp_mi_d3 = dfcp_mi_d3.drop(columns=['CellSize'])
dfcp_mi_d4 = dfcp_mi_d4.drop(columns=['CellSize'])
dfcp_cca = dfcp_cca.drop(columns=['CellSize'])
Scaling data allows models to compare relative relationships between data points more effectively. Some datasets will already be scaled like this one.
Log transformation can be helpful for linear models for example.
#from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#dfcp_mi = scaler.fit_transform(dfcp_mi)
#dfcp_cca = scaler.fit_transform(dfcp_cca)
#dfcp_mi_d4 = scaler.fit_transform(dfcp_mi_d4)
This notebook uses a Random Forest model as an example.
Using train-test split procedure from Scikit-Learn, creating x and y values for each dataset first.
#for imputed dataset
#change class to factor also if required
from sklearn.model_selection import train_test_split
x = dfcp_mi.drop(columns=['Diagnosis'])
y = dfcp_mi['Diagnosis'].astype('category')
np.random.seed(3) #for reproducibility
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
Repeat for other datasets that will be used to train and test models.
#cca
xcca = dfcp_cca.drop(columns=['Diagnosis'])
ycca = dfcp_cca['Diagnosis'].astype('category')
np.random.seed(3) #for reproducibility
xcca_train, xcca_test, ycca_train, ycca_test = train_test_split(xcca, ycca, test_size=0.3)
#delta4
xd4 = dfcp_mi_d4.drop(columns=['Diagnosis'])
yd4 = dfcp_mi_d4['Diagnosis'].astype('category')
np.random.seed(3) #for reproducibility
xd4_train, xd4_test, yd4_train, yd4_test = train_test_split(xd4, yd4, test_size=0.3)
Start with baseline model using iterative imputation.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier()
# predictions against test
np.random.seed(3)
predictions = rfc.predict(x_test)
#evaluate using confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
sens_imp = tp/(tp+fn)
spec_imp = tn/(fp+tn)
acc_imp = (tp+tn)/(tp+tn+fp+fn)
miss_preds = fp+fn
print(f'Sensitivity for imputed model is {sens_imp}.')
Sensitivity for imputed model is 0.975.
print(f'Specificity for imputed model is {spec_imp}.')
Specificity for imputed model is 0.9615384615384616.
print(f'Accuracy for imputed model is {acc_imp}.')
Accuracy for imputed model is 0.9666666666666667.
print(f'The imputed model missclassified {miss_preds} observations in test.')
The imputed model missclassified 7 observations in test.
Repeat model tests with CCA data.
#define model
rfc1 = RandomForestClassifier()
rfc1.fit(xcca_train,ycca_train)
RandomForestClassifier()
# predictions against test
np.random.seed(3)
predictions_cca = rfc1.predict(xcca_test)
#evaluate using confusion matrix
tn1, fp1, fn1, tp1 = confusion_matrix(ycca_test, predictions_cca).ravel()
#Optional calculations for classification models
sens_cca = tp1/(tp1+fn1)
spec_cca = tn1/(fp1+tn1)
acc_cca = (tp1+tn1)/(tp1+tn1+fp1+fn1)
miss_cca = fp1+fn1
print(f'Sensitivity for CCA model is {sens_cca}.')
Sensitivity for CCA model is 0.935064935064935.
print(f'Specificity for CCA model is {spec_cca}.')
Specificity for CCA model is 0.96875.
print(f'Accuracy for CCA model is {acc_cca}.')
Accuracy for CCA model is 0.9560975609756097.
print(f'The CCA model missclassified {miss_cca} observations in test.')
The CCA model missclassified 9 observations in test.
The following code block captures the model which produced the best results, based on core metric (Sensitivity) as this is a minority classification problem.
# set model variable for best performance as MI or CCA
better_model = 'MI'
if better_model=="CCA" or better_model=="MI":
print(f"{better_model} is the better model based on this data experiment.")
else:
print("Please enter either 'MI' or 'CCA' for this variable.")
MI is the better model based on this data experiment.
#define and fit model
rfc4 = RandomForestClassifier()
rfc4.fit(xd4_train,yd4_train)
RandomForestClassifier()
#predict against test
np.random.seed(3)
predictions_d4 = rfc4.predict(xd4_test)
#evaluate using confusion matrix
tn4, fp4, fn4, tp4 = confusion_matrix(yd4_test, predictions_d4).ravel()
sens_d4 = tp4/(tp4+fn4) #tpr
spec_d4 = tn4/(fp4+tn4)
acc_d4 = (tp4+tn4)/(tp4+tn4+fp4+fn4)
miss_d4 = fp4+fn4
print(f'Sensitivity for delta 4 model is {sens_d4}.')
Sensitivity for delta 4 model is 0.9625.
print(f'Specificity for delta 4 model is {spec_d4}.')
Specificity for delta 4 model is 0.9615384615384616.
print(f'Accuracy for delta 4 model is {acc_d4}.')
Accuracy for delta 4 model is 0.9619047619047619.
print(f'The delta 4 model missclassified {miss_d4} observations in test.')
The delta 4 model missclassified 8 observations in test.
The impact result is captured as a Yes or No response in the code below.
delta_impact = 'No'
if delta_impact=="Yes" or delta_impact=="No":
print(f"{delta_impact} is the value for the delta_impact variable.")
else:
print("Please enter either 'Yes' or 'No' for this variable.")
No is the value for the delta_impact variable.
The following code will print a summary, based on the variable information recorded in earlier code blocks.
Print of variable values set also.
print(f"Missing data level is: {missingness}.\n")
Missing data level is: 0.022889842632331903.
print(f"Data is more likely to be missing in some variables than others: {likelihood}.\n")
Data is more likely to be missing in some variables than others: Yes.
print(f"Little's MCAR Test p value result is: {mcar_presult}. \n")
Little's MCAR Test p value result is: 0.2133409.
print(f"Data is missing from the dependent variable only: {dependent_only}. \n")
Data is missing from the dependent variable only: No.
if likelihood=="No" and mcar_presult=="error":
hypothesis="Missing data pattern provides some support for MCAR hypothesis. However, no result available for MCAR Test."
elif likelihood=="No" and mcar_presult<0.05:
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
elif likelihood=="No" and mcar_presult>=0.05:
hypothesis="MCAR hypothesis supported by MCAR Test and missing data pattern."
elif likelihood=="Yes" and mcar_presult=="error":
hypothesis="Missing data pattern provides some support for MAR/MNAR hypothesis. However, no result available for MCAR Test."
elif likelihood=="Yes" and mcar_presult<0.05:
hypothesis="MAR/MNAR hypothesis supported by MCAR Test and missing data pattern."
elif likelihood=="Yes" and mcar_presult>=0.05:
hypothesis="Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms."
else:
hypothesis="No hypothesis found."
print(hypothesis)
Hypothesis unproven. MCAR Test and missing data pattern indicate different missing data mechanisms.
#define conditional statements
if missingness>=0.5:
approach="Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data."
elif missingness>0.2 and dependent_only=='Yes':
approach="Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data."
elif hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." and missingness<=0.2:
approach="As MCAR hypothesis is supported and missing data is less than 20%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances."
elif hypothesis=="MCAR hypothesis supported by MCAR Test and missing data pattern." and missingness>0.2 and missingness<0.5:
approach="MCAR hypothesis is supported. However, as missing data is between 20-50%, multiple imputation is likely to produce a more effective model."
elif hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." and missingness<=0.2 and dependent_only=="Yes":
approach="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 20% and missing from the dependent variable only, complete case analysis may be the more reliable approach."
elif hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." and missingness<=0.5 and delta_impact=="No":
approach="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach."
elif hypothesis!="MCAR hypothesis supported by MCAR Test and missing data pattern." and delta_impact=="Yes":
approach="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023)."
else:
approach="No approach found."
print(approach)
MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach.
if better_model=="MI" and approach=="Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data.":
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, there is a risk that it would not generalise well on new data.")
elif better_model=="CCA" and approach=="Due to overall level of missing data (or level of missing data from dependent variable only) there is a risk that a poor model will be returned, regardless of method used, that would not generalise well on new data.":
print("For this particular data experiment, complete case analysis produced the most effective model. However, there is a risk that it would not generalise well on new data.")
elif better_model=="MI" and approach=="As MCAR hypothesis is supported and missing data is less than 20%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances.":
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, complete case analysis would be likely to produce unbiased results also.")
elif better_model=="CCA" and approach=="As MCAR hypothesis is supported and missing data is less than 20%, complete case analysis is likely to produce unbiased results. However, multiple imputation may produce a more effective model in some circumstances.":
print("For this particular data experiment, complete case analysis produced the most effective model. However, multiple imputation may produce a more effective model in some circumstances.")
elif better_model=="MI" and approach=="MCAR hypothesis is supported. However, as missing data is between 20-50%, multiple imputation is likely to produce a more effective model.":
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This result is expected, given variables provided.")
elif better_model=="CCA" and approach=="MCAR hypothesis is supported. However, as missing data is between 20-50%, multiple imputation is likely to produce a more effective model.":
print("For this particular data experiment, complete case analysis produced the most effective model. However, caution should be noted over results due to high level of missingness.")
elif better_model=="MI" and approach=="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 20% and missing from the dependent variable only, complete case analysis may be the more reliable approach.":
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, caution should be taken with how missing data in dependent variables is handled, as CCA may be more reliable.")
elif better_model=="CCA" and approach=="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. However, as missing data is less than 20% and missing from the dependent variable only, complete case analysis may be the more reliable approach.":
print("For this particular data experiment, complete case analysis produced the most effective model. This should be a reliable approach for this missing data problem.")
elif better_model=="MI" and approach=="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach.":
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.")
elif better_model=="CCA" and approach=="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. Also, as missing data is less than 50% and the delta sensitivity analysis suggests the results are relatively stable, multiple imputation is the recommended approach.":
print("For this particular data experiment, complete case analysis produced the most effective model. However, the MCAR hypothesis is either not supported or unclear so caution is advised on the results produced.")
elif better_model=="MI" and approach=="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023).":
print("For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. However, it is recommended that a pattern mixture model is considered also.")
elif better_model=="CCA" and approach=="MAR/MNAR hypothesis supported, or MCAR hypothesis unproven. The delta sensitivity analysis suggests the results are not stable and indicative of MNAR data. Therefore, a pattern mixture model is recommended for further consideration. There are a couple of R mice package functions worth exploring further for this: mice.impute.ri and mice.impute.mnar.logreg (van Buuren et al., 2023).":
print("For this particular data experiment, complete case analysis produced the most effective model. However, it is recommended that a pattern mixture model is considered also.")
else:
print("No model evaluation found.")
For this particular data experiment, Multiple Imputation produced a more effective model than complete case analysis. This should be a reliable approach for this missing data problem.
–Last updated: August 2023 –
–End–