DATA 621 Homework #3
From Work by Critical Thinking Group 3
df <- read.csv('data/crime-training-data_modified.csv')
evaluation <- read.csv("data/crime-evaluation-data_modified.csv")
Introduction
We have been given a dataset with 466 records summarizing a attributes of various neighborhoods of a major city. Each record in the training dataset has a response variable that is 1 when the neighborhood’s crime rate is above the median and 0 when it is not. In all there are 12 predictors. We will construct a logistic regression model to predict the classification of the evaluation data set.
Data Exploration
Are There Missing Values?
First we look at the data to see if any variables have missing data:
It looks like we have a complete data set. No need to impute values.
Splitting the Data
Next, we look to split our data between a training set (train
) and a test data set (test
). We’ll use a 70-30 split between train and test, respectively.
Exploratory Data Analysis
Now, let’s look at our training data. The following table summarizes the predictor variables of the training set:
zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 0.00 | Min. : 0.46 | Min. :0.00000 | Min. :0.3890 | Min. :4.138 | Min. : 2.90 | Min. : 1.130 | Min. : 1.000 | Min. :188.0 | Min. :12.60 | Min. : 1.980 | Min. : 5.00 | |
1st Qu.: 0.00 | 1st Qu.: 4.94 | 1st Qu.:0.00000 | 1st Qu.:0.4480 | 1st Qu.:5.930 | 1st Qu.: 40.15 | 1st Qu.: 2.191 | 1st Qu.: 4.000 | 1st Qu.:277.0 | 1st Qu.:17.40 | 1st Qu.: 6.635 | 1st Qu.:17.80 | |
Median : 0.00 | Median : 8.14 | Median :0.00000 | Median :0.5200 | Median :6.229 | Median : 74.30 | Median : 3.378 | Median : 5.000 | Median :315.0 | Median :18.70 | Median :10.590 | Median :21.70 | |
Mean :12.02 | Mean :10.48 | Mean :0.07645 | Mean :0.5466 | Mean :6.332 | Mean : 66.19 | Mean : 3.867 | Mean : 9.205 | Mean :396.5 | Mean :18.42 | Mean :12.248 | Mean :23.19 | |
3rd Qu.:20.00 | 3rd Qu.:18.10 | 3rd Qu.:0.00000 | 3rd Qu.:0.6240 | 3rd Qu.:6.646 | 3rd Qu.: 92.60 | 3rd Qu.: 5.238 | 3rd Qu.: 8.000 | 3rd Qu.:437.0 | 3rd Qu.:20.20 | 3rd Qu.:16.215 | 3rd Qu.:26.65 | |
Max. :95.00 | Max. :27.74 | Max. :1.00000 | Max. :0.8710 | Max. :8.780 | Max. :100.00 | Max. :10.710 | Max. :24.000 | Max. :711.0 | Max. :22.00 | Max. :37.970 | Max. :50.00 |
By looking at a correlation matrix, we can see which variables may be too correlated to be included together in a model as predictor variables. This will help us later during the model selection process.
Next we will look at each potential predictor and how it is distributed across the target variable.
The following plots show how predictors are distributed between a positive target variable (areas with crime rates higher than the median, i.e. blue) and a negative target variable (areas with crime rates below the median, i.e. red). What we are looking for is variables that show way to split data into two groups.
for (var in names(train)){
if(var != "target"){
plot_df <- train
plot_df$x <- plot_df[,var]
p <- ggplot(plot_df, aes(x, color = factor(target))) +
geom_density() +
theme_light() +
ggtitle(var) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.title.x = element_blank())
print(p)
}
}
Looking at the plots above, the nox
variable seems to be the best variable to divide the data into the two groups.
Basic Model Building
We start by applying occam’s razor and create a baseline model that only has one predictor. Any model we build beyond that will have to outperform this simplest model.
Call:
glm(formula = target ~ nox, family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1163 -0.4050 -0.1814 0.2717 2.5754
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.370 1.823 -8.981 <2e-16 ***
nox 30.372 3.444 8.818 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 451.70 on 326 degrees of freedom
Residual deviance: 205.67 on 325 degrees of freedom
AIC: 209.67
Number of Fisher Scoring iterations: 6
test$baseline <- ifelse(predict.glm(baseline, test,"response") >= 0.5,1,0)
cm <- confusionMatrix(factor(test$baseline), factor(test$target),"1")
results <- tibble(model = "baseline",predictors = 1,F1 = cm$byClass[7],
deviance=baseline$deviance,
r2 = 1 - baseline$deviance/baseline$null.deviance,
aic=baseline$aic)
cm
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 50 11
1 12 66
Accuracy : 0.8345
95% CI : (0.7621, 0.8921)
No Information Rate : 0.554
P-Value [Acc > NIR] : 2.178e-12
Kappa : 0.6646
Mcnemar's Test P-Value : 1
Sensitivity : 0.8571
Specificity : 0.8065
Pos Pred Value : 0.8462
Neg Pred Value : 0.8197
Prevalence : 0.5540
Detection Rate : 0.4748
Detection Prevalence : 0.5612
Balanced Accuracy : 0.8318
'Positive' Class : 1
Our baseline model is ok with an F1 score of 0.8516129.
Next we try adding every other variable, to build a full model. From here we can work backwards and eliminate non-significant predictors:
Call:
glm(formula = target ~ ., family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1095 -0.1955 -0.0025 0.0010 3.4349
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -45.183151 8.234576 -5.487 4.09e-08 ***
zn -0.075536 0.044071 -1.714 0.086539 .
indus -0.091581 0.057306 -1.598 0.110020
chas 0.686033 0.878327 0.781 0.434763
nox 52.841701 9.811850 5.385 7.22e-08 ***
rm -0.060835 0.895649 -0.068 0.945847
age 0.014775 0.014866 0.994 0.320276
dis 0.748904 0.277029 2.703 0.006865 **
rad 0.693492 0.192573 3.601 0.000317 ***
tax -0.004634 0.003734 -1.241 0.214522
ptratio 0.376793 0.162243 2.322 0.020211 *
lstat 0.146590 0.065105 2.252 0.024349 *
medv 0.159840 0.086270 1.853 0.063912 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 451.70 on 326 degrees of freedom
Residual deviance: 136.79 on 314 degrees of freedom
AIC: 162.79
Number of Fisher Scoring iterations: 9
test$fullmodel <- ifelse(predict.glm(fullmodel, test,"response") < 0.5, 0, 1)
cm <- confusionMatrix(factor(test$fullmodel), factor(test$target),"1")
results <- rbind(results,tibble(model = "fullmodel",
predictors = 12,F1 = cm$byClass[7],
deviance=fullmodel$deviance,
r2 = 1-fullmodel$deviance/fullmodel$null.deviance,
aic=fullmodel$aic))
cm
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 55 6
1 7 71
Accuracy : 0.9065
95% CI : (0.8454, 0.9493)
No Information Rate : 0.554
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8104
Mcnemar's Test P-Value : 1
Sensitivity : 0.9221
Specificity : 0.8871
Pos Pred Value : 0.9103
Neg Pred Value : 0.9016
Prevalence : 0.5540
Detection Rate : 0.5108
Detection Prevalence : 0.5612
Balanced Accuracy : 0.9046
'Positive' Class : 1
The full model has an F1 score is 0.916129, which is a bit higher than before. However, many variables do not seem to be significant.
After some backward elimination of non-significant predictor variables, we arrive at the following model:
model1 <- glm(target ~ . -tax -rm -chas - age -zn -indus,
family = binomial(link = "logit"),
train)
summary(model1)
Call:
glm(formula = target ~ . - tax - rm - chas - age - zn - indus,
family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.25157 -0.28019 -0.06287 0.00092 2.93960
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -40.36177 7.32297 -5.512 3.55e-08 ***
nox 43.62710 7.91385 5.513 3.53e-08 ***
dis 0.43922 0.19736 2.225 0.02605 *
rad 0.64560 0.14500 4.453 8.49e-06 ***
ptratio 0.38903 0.13439 2.895 0.00379 **
lstat 0.13126 0.05636 2.329 0.01986 *
medv 0.14156 0.04686 3.021 0.00252 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 451.70 on 326 degrees of freedom
Residual deviance: 149.16 on 320 degrees of freedom
AIC: 163.16
Number of Fisher Scoring iterations: 9
test$model1 <- ifelse(predict.glm(model1, test,"response") < 0.5, 0, 1)
cm <- confusionMatrix(factor(test$model1), factor(test$target),"1")
results <- rbind(results,tibble(model = "model1",
predictors = 6,F1 = cm$byClass[7],
deviance=model1$deviance,
r2 = 1-model1$deviance/model1$null.deviance,
aic=model1$aic))
cm
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 51 5
1 11 72
Accuracy : 0.8849
95% CI : (0.8198, 0.9328)
No Information Rate : 0.554
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7649
Mcnemar's Test P-Value : 0.2113
Sensitivity : 0.9351
Specificity : 0.8226
Pos Pred Value : 0.8675
Neg Pred Value : 0.9107
Prevalence : 0.5540
Detection Rate : 0.5180
Detection Prevalence : 0.5971
Balanced Accuracy : 0.8788
'Positive' Class : 1
Alternative Models
Feature Engineering
In this portion below we explored using the difference between the class distributions as a feature itself. So instead of the raw feature itself the we created density features that are essientially the differences in probability of the positve and negative classes at that certain point for the feature.
library(fuzzyjoin)
density_diff <- function(data,eval_data,var,target,new_column){
data[[paste0(var,"temp")]] <- data[[var]]
eval_data[[paste0(var,"temp")]] <- eval_data[[var]]
#standardize variable between 0 and 1
data[[var]] <- (data[[var]] - min(data[[var]]))/(max(data[[var]]) - min(data[[var]]))
eval_data[[var]] <- (eval_data[[var]] - min(eval_data[[var]]))/(max(eval_data[[var]]) - min(eval_data[[var]]))
## Calculate density estimates
g1 <- ggplot(data, aes(x=data[[var]], group=target, colour=target)) +
geom_density(data = data) + xlim(min(data[[var]]), max(data[[var]]))
gg1 <- ggplot_build(g1)
# construct the dataset
x <- gg1$data[[1]]$x[gg1$data[[1]]$group == 1]
y1 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 1]
y2 <- gg1$data[[1]]$y[gg1$data[[1]]$group == 2]
df2 <- data.frame(x = x, ymin = pmin(y1, y2), ymax = pmax(y1, y2), side=(y1<y2), ydiff = y2-y1)
##creating the second graph object
g3 <- ggplot(df2) +
geom_line(aes(x = x, y = ydiff, colour = side)) +
geom_area(aes(x = x, y = ydiff, fill = side, alpha = 0.4)) +
guides(alpha = FALSE, fill = FALSE)
data$join <- data[[var]]
df2$join <- df2$x
temp <- difference_left_join(data,df2, by="join", max_dist =.001)
means <- aggregate(ydiff ~ temp$join.x, temp, mean)
colnames(means) <- c("std_var","density_diff")
new_data <- merge(means,data, by.x ="std_var", by.y =var)
new_data$std_var <- NULL
new_data$join <- NULL
new_data[new_column] <- new_data$density_diff
new_data$density_diff <- NULL
###################same thing with eval data ############################
eval_data$join <- eval_data[[var]]
df2$join <- df2$x
temp <- difference_left_join(eval_data,df2, by="join", max_dist =.001)
means <- aggregate(ydiff ~ temp$join.x, temp, mean)
##new stuff
colnames(means) <- c("std_var","density_diff")
#data["key"] <- (var - min(var))/(max(var) - min(var))
eval_data2 <- merge(means,eval_data, by.x ="std_var", by.y =var)
eval_data2$std_var <- NULL
eval_data2$join <- NULL
eval_data2[new_column] <- eval_data2$density_diff
eval_data2$density_diff <- NULL
eval_data2[[var]] <- eval_data2[[paste0(var,"temp")]]
eval_data2[[paste0(var,"temp")]] <- NULL
new_data[[var]] <- new_data[[paste0(var,"temp")]]
new_data[[paste0(var,"temp")]] <- NULL
mylist <- list(g1,g3,new_data,eval_data2)
names(mylist)<- c("dist","dist_diff","new_data_training","new_data_eval")
return(mylist)
}
Creating Some New Variables
'data.frame': 327 obs. of 13 variables:
$ zn : num 0 0 0 30 0 0 80 22 0 22 ...
$ indus : num 19.58 19.58 18.1 4.93 2.46 ...
$ chas : int 0 1 0 0 0 0 0 0 0 0 ...
$ nox : num 0.605 0.871 0.74 0.428 0.488 0.52 0.392 0.431 0.437 0.431 ...
$ rm : num 7.93 5.4 6.49 6.39 7.16 ...
$ age : num 96.2 100 100 7.8 92.2 71.3 19.1 8.9 45 8.4 ...
$ dis : num 2.05 1.32 1.98 7.04 2.7 ...
$ rad : int 5 5 24 6 3 5 1 7 5 7 ...
$ tax : int 403 403 666 300 193 384 315 330 398 330 ...
$ ptratio: num 14.7 14.7 20.2 16.6 17.8 20.9 16.4 19.1 18.7 19.1 ...
$ lstat : num 3.7 26.82 18.85 5.19 4.82 ...
$ medv : num 50 13.4 15.4 23.7 37.9 26.5 20.9 24.8 21.4 42.8 ...
$ target : int 1 1 1 0 0 0 0 0 0 1 ...
'data.frame': 139 obs. of 16 variables:
$ zn : num 0 0 0 0 0 100 0 0 0 0 ...
$ indus : num 18.1 18.1 5.19 18.1 2.46 1.32 18.1 3.24 6.2 2.89 ...
$ chas : int 0 0 0 0 0 0 0 0 1 0 ...
$ nox : num 0.693 0.693 0.515 0.532 0.488 0.411 0.679 0.46 0.507 0.445 ...
$ rm : num 5.45 4.52 6.32 7.06 6.15 ...
$ age : num 100 100 38.1 77 68.8 40.5 95.4 32.2 66.5 62.5 ...
$ dis : num 1.49 1.66 6.46 3.41 3.28 ...
$ rad : int 24 24 5 24 3 5 24 4 8 2 ...
$ tax : int 666 666 224 666 193 256 666 430 307 276 ...
$ ptratio : num 20.2 20.2 20.2 20.2 17.8 15.1 20.2 16.9 17.4 18 ...
$ lstat : num 30.59 36.98 5.68 7.01 13.15 ...
$ medv : num 5 7 22.2 25 29.6 31.6 8.3 19.8 29 33.2 ...
$ target : int 1 1 0 1 0 0 1 0 1 0 ...
$ baseline : num 1 1 0 0 0 0 1 0 0 0 ...
$ fullmodel: num 1 1 1 1 0 0 1 0 1 0 ...
$ model1 : num 1 1 0 1 0 0 1 0 1 0 ...
## try it out
train <- density_diff(train,test,"nox",train$target,"nox_density")$new_data_training
train <- density_diff(train,test,"age",train$target,"age_density")$new_data_training
train <- density_diff(train,test,"indus",train$target,"indus_density")$new_data_training
train <- density_diff(train,test,"dis",train$target,"dis_density")$new_data_training
train <- density_diff(train,test,"rad",train$target,"rad_density")$new_data_training
train <- density_diff(train,test,"tax",train$target,"tax_density")$new_data_training
train <- density_diff(train,test,"ptratio",train$target,"ptratio_density")$new_data_training
train <- density_diff(train,test,"lstat",train$target,"lstat_density")$new_data_training
train <- density_diff(train,test,"medv",train$target,"medv_density")$new_data_training
test<- density_diff(train,test,"nox",train$target,"nox_density")$new_data_eval
test<- density_diff(train,test,"age",train$target,"age_density")$new_data_eval
test<- density_diff(train,test,"indus",train$target,"indus_density")$new_data_eval
test<- density_diff(train,test,"dis",train$target,"dis_density")$new_data_eval
test<- density_diff(train,test,"rad",train$target,"rad_density")$new_data_eval
test<- density_diff(train,test,"tax",train$target,"tax_density")$new_data_eval
test<- density_diff(train,test,"ptratio",train$target,"ptratio_density")$new_data_eval
test<- density_diff(train,test,"lstat",train$target,"lstat_density")$new_data_eval
test<- density_diff(train,test,"medv",train$target,"medv_density")$new_data_eval
str(train)
'data.frame': 327 obs. of 22 variables:
$ zn : num 0 0 0 0 0 0 0 0 0 0 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ rm : num 5.68 5.99 5.28 5 6.78 ...
$ target : int 1 1 1 1 1 0 1 1 1 1 ...
$ nox_density : num 1.73 1.73 1.7 1.7 1.71 ...
$ nox : num 0.693 0.693 0.7 0.7 0.679 0.609 0.718 0.693 0.74 0.679 ...
$ age_density : num 3.15 3.15 3.65 2.04 2.41 ...
$ age : num 100 100 98.1 89.5 90.8 98 76.5 85.4 100 95.6 ...
$ indus_density : num 3.58 3.58 3.58 3.58 3.58 ...
$ indus : num 18.1 18.1 18.1 18.1 18.1 ...
$ dis_density : num 3.02 3.47 3.02 3.31 3.64 ...
$ dis : num 1.43 1.59 1.43 1.52 1.82 ...
$ rad_density : num 1.54 1.54 1.54 1.54 1.54 ...
$ rad : int 24 24 24 24 24 4 24 24 24 24 ...
$ tax_density : num 1.95 1.95 1.95 1.95 1.95 ...
$ tax : int 666 666 666 666 666 711 666 666 666 666 ...
$ ptratio_density: num 3.87 3.87 3.87 3.87 3.87 ...
$ ptratio : num 20.2 20.2 20.2 20.2 20.2 20.1 20.2 20.2 20.2 20.2 ...
$ lstat_density : num 1.013 0.723 0.248 0.222 0.822 ...
$ lstat : num 23 26.8 30.8 32 25.8 ...
$ medv_density : num 0.258 0.325 0.546 0.576 0.591 ...
$ medv : num 5 5.6 7.2 7.4 7.5 8.1 8.4 8.5 8.7 9.5 ...
'data.frame': 139 obs. of 25 variables:
$ zn : num 0 0 0 0 0 0 0 0 0 0 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ rm : num 5.45 5.85 5.41 4.52 6.43 ...
$ target : int 1 1 0 1 1 1 1 1 1 1 ...
$ baseline : num 1 1 1 1 1 1 1 1 1 1 ...
$ fullmodel : num 1 1 0 1 1 1 1 1 1 1 ...
$ model1 : num 1 1 1 1 1 1 1 1 1 1 ...
$ nox_density : num 1.73 1.73 1.2 1.73 1.7 ...
$ nox : num 0.693 0.693 0.609 0.693 0.679 0.693 0.679 0.693 0.693 0.7 ...
$ age_density : num 3.15 0.24 3.62 3.15 3.15 ...
$ age : num 100 77.8 98.3 100 100 100 95.4 96 98.9 91.2 ...
$ indus_density : num 3.47 3.47 -0.11 3.47 3.47 ...
$ indus : num 18.1 18.1 27.7 18.1 18.1 ...
$ dis_density : num 2.96 2.96 3.55 3.36 3.63 ...
$ dis : num 1.49 1.5 1.76 1.66 1.83 ...
$ rad_density : num 1.54 1.54 -4.35 1.54 1.54 ...
$ rad : int 24 24 4 24 24 24 24 24 24 24 ...
$ tax_density : num 1.95 1.95 1.37 1.95 1.95 ...
$ tax : int 666 666 711 666 666 666 666 666 666 666 ...
$ ptratio_density: num 3.73 3.73 3.31 3.73 3.73 ...
$ ptratio : num 20.2 20.2 20.1 20.2 20.2 20.2 20.2 20.2 20.2 20.2 ...
$ lstat_density : num 0.2284 0.2481 0.9091 0.0852 0.3155 ...
$ lstat : num 30.6 30 24 37 29.1 ...
$ medv_density : num 0.258 0.414 0.517 0.517 0.546 ...
$ medv : num 5 6.3 7 7 7.2 7.2 8.3 8.3 8.5 8.8 ...
Example of the new meta feature
The first graph below is the difference between the distributions and the second graph is the new derived predictive feature
With our new density variables, we can run another model:
density_models <- glm(target ~ nox_density +indus_density+age_density+dis_density+rad_density+tax_density+ptratio_density+lstat_density+medv_density, family = binomial(link = "logit"), train)
summary(density_models)
Call:
glm(formula = target ~ nox_density + indus_density + age_density +
dis_density + rad_density + tax_density + ptratio_density +
lstat_density + medv_density, family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.96836 -0.31245 -0.06225 0.20194 3.01402
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8093 0.5320 1.521 0.128210
nox_density 1.1492 0.2600 4.420 9.86e-06 ***
indus_density 1.0706 0.2488 4.302 1.69e-05 ***
age_density 0.1219 0.1698 0.718 0.472916
dis_density -0.1849 0.2138 -0.865 0.387284
rad_density 0.4141 0.1226 3.379 0.000727 ***
tax_density -1.2582 0.4408 -2.854 0.004314 **
ptratio_density 0.1550 0.1524 1.017 0.309139
lstat_density -0.2456 0.2055 -1.195 0.231963
medv_density 0.5417 0.2062 2.628 0.008596 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 451.70 on 326 degrees of freedom
Residual deviance: 164.12 on 317 degrees of freedom
AIC: 184.12
Number of Fisher Scoring iterations: 7
test$density_models <- ifelse(predict(density_models, test) < 0.5, 0, 1)
test$density_models_yhat <- predict(density_models, test)
cm <- confusionMatrix(factor(test$density_models), factor(test$target))
results <- rbind(results,tibble(model = "density models",
predictors = 9,
F1 = cm$byClass[7],
deviance = density_models$deviance,
r2 = 1 - density_models$deviance / density_models$null.deviance,
aic = density_models$aic))
cm
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 58 12
1 4 65
Accuracy : 0.8849
95% CI : (0.8198, 0.9328)
No Information Rate : 0.554
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.77
Mcnemar's Test P-Value : 0.08012
Sensitivity : 0.9355
Specificity : 0.8442
Pos Pred Value : 0.8286
Neg Pred Value : 0.9420
Prevalence : 0.4460
Detection Rate : 0.4173
Detection Prevalence : 0.5036
Balanced Accuracy : 0.8898
'Positive' Class : 0
Model Selection
Since the assignment mentions that the purpose is prediction, we will prefer F1 score as our measure of model success.
model | predictors | F1 | deviance | r2 | aic |
---|---|---|---|---|---|
baseline | 1 | 0.8516129 | 205.6730 | 0.5446683 | 209.6730 |
fullmodel | 12 | 0.9161290 | 136.7893 | 0.6971673 | 162.7893 |
model1 | 6 | 0.9000000 | 149.1603 | 0.6697795 | 163.1603 |
density models | 9 | 0.8787879 | 164.1193 | 0.6366623 | 184.1193 |
Looking the three models together, model1
looks like the best one. Although the full model scored an F1 that was ever-so-slightly higher on our test data set, model1
has half the predictors and it’s scores are almost exactly the same as fullmodel
.
pROC Output
ROC curves can give us another look at which model might be better suited for prediction.
Baseline
par(pty = "s")
roc(test[["target"]], test[["baseline"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Call:
roc.default(response = test[["target"]], predictor = test[["baseline"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: test[["baseline"]] in 62 controls (test[["target"]] 0) < 77 cases (test[["target"]] 1).
Area under the curve: 0.8318
Full Model
par(pty = "s")
roc(test[["target"]], test[["fullmodel"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Call:
roc.default(response = test[["target"]], predictor = test[["fullmodel"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: test[["fullmodel"]] in 62 controls (test[["target"]] 0) < 77 cases (test[["target"]] 1).
Area under the curve: 0.9046
Model1
par(pty = "s")
roc(test[["target"]], test[["model1"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Call:
roc.default(response = test[["target"]], predictor = test[["model1"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: test[["model1"]] in 62 controls (test[["target"]] 0) < 77 cases (test[["target"]] 1).
Area under the curve: 0.8788
Density Model
par(pty = "s")
roc(test[["target"]], test[["density_models_yhat"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Call:
roc.default(response = test[["target"]], predictor = test[["density_models_yhat"]], plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: test[["density_models_yhat"]] in 62 controls (test[["target"]] 0) < 77 cases (test[["target"]] 1).
Area under the curve: 0.9652
Final Selection and Predictions
Having looked at the F1 statistics and the ROC curves we will use the model1 to form our final predictions.