Step 1: Review the dimensions, variables, and summarize the data
#collect dimensions of the training data
dim(train)
## [1] 200000 56
#collect dimensions of the test data
dim(test)
## [1] 200000 55
#summarize training data
summary(train)
## id elevation aspect slope
## Length:200000 Min. :1859 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.:2808 1st Qu.: 58.0 1st Qu.: 9.00
## Mode :character Median :2995 Median :127.0 Median :13.00
## Mean :2959 Mean :155.4 Mean :14.11
## 3rd Qu.:3163 3rd Qu.:260.0 3rd Qu.:18.00
## Max. :3858 Max. :360.0 Max. :66.00
## horizontal_distance_to_hydrology vertical_distance_to_hydrology
## Min. : 0.0 Min. :-173.00
## 1st Qu.: 108.0 1st Qu.: 7.00
## Median : 218.0 Median : 30.00
## Mean : 269.6 Mean : 46.52
## 3rd Qu.: 384.0 3rd Qu.: 69.00
## Max. :1390.0 Max. : 601.00
## horizontal_distance_to_roadways hillshade_9am hillshade_noon
## Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.:1110 1st Qu.:198.0 1st Qu.:213.0
## Median :1998 Median :218.0 Median :226.0
## Mean :2351 Mean :212.1 Mean :223.3
## 3rd Qu.:3329 3rd Qu.:231.0 3rd Qu.:237.0
## Max. :7087 Max. :254.0 Max. :254.0
## hillshade_3pm horizontal_distance_to_fire_points wilderness_area1
## Min. : 0.0 Min. : 0 Min. :0.0000
## 1st Qu.:119.0 1st Qu.:1024 1st Qu.:0.0000
## Median :143.0 Median :1711 Median :0.0000
## Mean :142.5 Mean :1982 Mean :0.4489
## 3rd Qu.:168.0 3rd Qu.:2550 3rd Qu.:1.0000
## Max. :254.0 Max. :7168 Max. :1.0000
## wilderness_area2 wilderness_area3 wilderness_area4 soil_type1
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.000000
## Mean :0.05147 Mean :0.4358 Mean :0.06375 Mean :0.005245
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.000000
## soil_type2 soil_type3 soil_type4 soil_type5
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.01288 Mean :0.008475 Mean :0.02156 Mean :0.002625
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.000000
## soil_type6 soil_type7 soil_type8 soil_type9
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01132 Mean :0.00015 Mean :0.00036 Mean :0.00201
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## soil_type10 soil_type11 soil_type12 soil_type13
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.05652 Mean :0.02101 Mean :0.05169 Mean :0.03006
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## soil_type14 soil_type15 soil_type16 soil_type17
## Min. :0.000000 Min. :0e+00 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0e+00 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0e+00 Median :0.000000 Median :0.00000
## Mean :0.001005 Mean :1e-05 Mean :0.005035 Mean :0.00598
## 3rd Qu.:0.000000 3rd Qu.:0e+00 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1e+00 Max. :1.000000 Max. :1.00000
## soil_type18 soil_type19 soil_type20 soil_type21
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.00329 Mean :0.006905 Mean :0.01605 Mean :0.001355
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.000000
## soil_type22 soil_type23 soil_type24 soil_type25
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.05767 Mean :0.0986 Mean :0.03691 Mean :0.00083
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## soil_type26 soil_type27 soil_type28 soil_type29
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.000
## Mean :0.004265 Mean :0.001975 Mean :0.00155 Mean :0.199
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.000
## soil_type30 soil_type31 soil_type32 soil_type33
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.05142 Mean :0.0443 Mean :0.09073 Mean :0.0773
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## soil_type34 soil_type35 soil_type36 soil_type37
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.002765 Mean :0.003405 Mean :0.00024 Mean :0.00052
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.00000
## soil_type38 soil_type39 soil_type40 cover_type
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :1.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :2.000
## Mean :0.02646 Mean :0.02347 Mean :0.01503 Mean :2.053
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :7.000
#plot distribution of 'cover_type' variable
ggplot(data=train, aes(cover_type))+geom_freqpoly() +
scale_x_discrete(limits=c("1","2","3", "4", "5", "6", "7")) +
labs(title= "Frequency of Cover Type in Training Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#summary of distribution of 'cover_type'
table(train$cover_type)
##
## 1 2 3 4 5 6 7
## 72879 97433 12388 936 3227 6078 7059
There are 200,000 observations in both the training and test data set. There are 56 variables in the training set which include: id; elevation; aspect; slope; 40 predictors on soil types; 4 predictors for wilderness areas; 3 hillshade measurements for morning, noon, and afternoon; both the horizontal and vertical distance to hydrology; both the horizontal distance to roadways and distance to fireways; and cover type (the response variable of interest).
At first glance, the soil type appears as a binary variable of 0/1 and limited to one soil type selection per observation. The mean value shows the prevalence of each soil type among total observations. Finally, the plot shows the distribution of cover types in the training data. Cover type #2 (n=97,453) comprises nearly half of all of the observations. Cover type #1 (n=72,879) has the second most frequent occurrence.
Step 2: Pre-processing
#reduce the number of columns/variables
#pivot soil variable to one column and filter new rows by the number of 1
train2<- train%>%
pivot_longer(cols="soil_type1":"soil_type40",names_to="soil",values_to="number")%>%
filter(number==1)
#pivot wilderness variable to one column and filter new rows by the value of 1
train2<-train2%>%
pivot_longer("wilderness_area1":"wilderness_area4",names_to="wilderness",values_to="value")%>%
filter(value==1)
#remove the 'number' and 'value' columns from the previous pre-processing step
train2 <-subset(train2, select = -c(number, value))
#convert "cover_type" to a factor variable
train2$cover_type <- as.factor(train2$cover_type)
Step 3: Split the training data
#split data into test/training partition
#create partition of 50% of the training set based on the outcome, `cover type`
train_index <- createDataPartition(y= train2$cover_type, times=1, p= 0.8, list = FALSE)
#establish a training subset based on the partitioned data
training <- train2[ train_index,]
#establish a testing subset based on observations excluded from the training subset
testing <- train2[-train_index,]
During pre-processing, I reduced the number of columns using pivot_longer to identify the singular soil type and wilderness area per observation. This reduced the variables from 56 to 14. Next, I split the original training data into 80/20 subsets, training (n= ~160,000) and testing (n=~40,000) subsets.
The following function, credited to Omer Yalcin, was used to calculate the Macro F1 score.
# create function that calculates f1 score
f1 <- function(pred, actual) {
# the input vectors should be pred and actual
# the class of interest should be 1, the other class is 0
# it's a good idea throw errors when something unexpected happens
if (length(pred) != length(actual)) stop('pred and actual have different lengths')
if (any(is.na(pred))) stop('pred has missing values')
if (any(is.na(actual))) stop('actual has missing values')
# precision and recall are each ratio of two quantities
# they have a common numerator, different denominators
# remember: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
# numerator = observations where both the prediction and the actual value is 1
numerator <- sum((pred + actual == 2))
# denominators
precision_denominator <- sum(pred) # total number of observations predicted as 1
recall_denominator <- sum(actual) # total number of observations that are actually 1
precision <- numerator / precision_denominator
recall <- numerator / recall_denominator
f1 <- (2 * precision * recall) / (precision + recall)
if(is.nan(f1)) f1 <- 0 # is undefined for whatever reason, just assign 0
return(f1)
}
# multi-class version
macro_f1 <- function(pred, actual) {
all_classes <- sort(unique(actual)) # get a vector of all classes to iterate over
class_specific_f1 <- numeric(length(all_classes)) # an empty vector, size equal to total number of classes to hold the class-specific F1 scores so we can get its mean later
# each iteration of loop calculates F1 score for one class
for (i in 1:length(all_classes)) {
class <- all_classes[i] # get the first class
class_pred <- ifelse(pred == class, 1, 0) # if the prediction is equal to this class, code it as 1, otherwise 0
class_actual <- ifelse(actual == class, 1, 0) # if the actual value is equal to this class, code it as 1, otherwise 0
# now we have two vectors made up of 1s and 0s that can be plugged back into our f1() function
class_f1 <- f1(class_pred, class_actual)
class_specific_f1[i] <- class_f1 # store the f1 score for this class
}
# return the mean
return(mean(class_specific_f1))
}
In this section, five classification models: logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), Naive Bayes, and K-Nearest Neighbors will be used to predict the cover type of forests.
The first classification model is a multinomial logistic regression which classifies responses of more than two classifications. For this model, all response classes are handled symmetrically.
#create a multinomial logistic regression model
log.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "multinom",
trControl = trainControl(method = "cv", number = 2), data= training)
## # weights: 378 (318 variable)
## initial value 155674.757834
## iter 10 value 120252.870775
## iter 20 value 115403.130735
## iter 30 value 108376.506360
## iter 40 value 105784.968050
## iter 50 value 104779.213035
## iter 60 value 93546.029600
## iter 70 value 79556.518811
## iter 80 value 69636.914430
## iter 90 value 64315.701958
## iter 100 value 59431.677996
## final value 59431.677996
## stopped after 100 iterations
## # weights: 378 (318 variable)
## initial value 155674.757834
## iter 10 value 120252.870775
## iter 20 value 115403.130800
## iter 30 value 108376.507035
## iter 40 value 105784.972059
## iter 50 value 104779.223121
## iter 60 value 93549.359407
## iter 70 value 79617.700125
## iter 80 value 69760.708235
## iter 90 value 64606.926002
## iter 100 value 59848.830865
## final value 59848.830865
## stopped after 100 iterations
## # weights: 378 (318 variable)
## initial value 155674.757834
## iter 10 value 120252.870775
## iter 20 value 115403.130735
## iter 30 value 108376.506359
## iter 40 value 105784.968030
## iter 50 value 104779.212925
## iter 60 value 93546.028326
## iter 70 value 79556.515831
## iter 80 value 69636.957449
## iter 90 value 64315.644149
## iter 100 value 59432.251614
## final value 59432.251614
## stopped after 100 iterations
## # weights: 378 (318 variable)
## initial value 155678.649655
## iter 10 value 120240.610902
## iter 20 value 115182.232508
## iter 30 value 109707.817095
## iter 40 value 107372.455992
## iter 50 value 106603.922331
## iter 60 value 94945.115198
## iter 70 value 80132.611748
## iter 80 value 72164.843189
## iter 90 value 64419.958663
## iter 100 value 60032.434246
## final value 60032.434246
## stopped after 100 iterations
## # weights: 378 (318 variable)
## initial value 155678.649655
## iter 10 value 120240.610902
## iter 20 value 115182.232583
## iter 30 value 109707.817643
## iter 40 value 107372.459988
## iter 50 value 106603.931700
## iter 60 value 94948.557165
## iter 70 value 80200.873406
## iter 80 value 72264.132427
## iter 90 value 64514.679410
## iter 100 value 60343.885316
## final value 60343.885316
## stopped after 100 iterations
## # weights: 378 (318 variable)
## initial value 155678.649655
## iter 10 value 120240.610902
## iter 20 value 115182.232508
## iter 30 value 109707.817095
## iter 40 value 107372.455993
## iter 50 value 106603.922321
## iter 60 value 94945.118057
## iter 70 value 80132.676684
## iter 80 value 72164.918784
## iter 90 value 64419.919165
## iter 100 value 60032.878113
## final value 60032.878113
## stopped after 100 iterations
## # weights: 378 (318 variable)
## initial value 311353.407490
## iter 10 value 200010.103675
## iter 20 value 190926.185950
## iter 30 value 170877.595194
## iter 40 value 164576.270790
## iter 50 value 162611.596934
## iter 60 value 152806.097163
## iter 70 value 143213.786965
## iter 80 value 137075.802686
## iter 90 value 126409.124909
## iter 100 value 121458.359047
## final value 121458.359047
## stopped after 100 iterations
#review model summary
log.model
## Penalized Multinomial Regression
##
## 160004 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 80001, 80003
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.7106635 0.5274083
## 1e-04 0.7106572 0.5274017
## 1e-01 0.7108760 0.5274453
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
actual <- training$cover_type
#predict cover type on test data
log.pred <- predict(log.model, training)
#review model accuracy
confusionMatrix(log.pred, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 39095 13702 1 0 44 3 2358
## 2 17179 61636 683 6 2324 1383 64
## 3 58 1832 7911 527 205 1968 17
## 4 0 30 111 156 0 42 0
## 5 17 6 0 0 5 0 0
## 6 10 581 1205 60 4 1467 0
## 7 1945 160 0 0 0 0 3209
##
## Overall Statistics
##
## Accuracy : 0.7092
## 95% CI : (0.707, 0.7115)
## No Information Rate : 0.4872
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5265
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.6705 0.7907 0.79820 0.208278 1.936e-03 0.301666
## Specificity 0.8416 0.7363 0.96931 0.998851 9.999e-01 0.988011
## Pos Pred Value 0.7082 0.7402 0.63197 0.460177 1.786e-01 0.440938
## Neg Pred Value 0.8167 0.7874 0.98644 0.996286 9.839e-01 0.978325
## Prevalence 0.3644 0.4872 0.06194 0.004681 1.614e-02 0.030393
## Detection Rate 0.2443 0.3852 0.04944 0.000975 3.125e-05 0.009169
## Detection Prevalence 0.3450 0.5205 0.07824 0.002119 1.750e-04 0.020793
## Balanced Accuracy 0.7561 0.7635 0.88375 0.603564 5.009e-01 0.644838
## Class: 7
## Sensitivity 0.56817
## Specificity 0.98636
## Pos Pred Value 0.60388
## Neg Pred Value 0.98423
## Prevalence 0.03530
## Detection Rate 0.02006
## Detection Prevalence 0.03321
## Balanced Accuracy 0.77726
#macro F1 score for logistic regression model
log_macrof1 <- macro_f1(log.pred, actual)
#print score
log_macrof1
## [1] 0.4847439
The multinomial logistic regression model did not perform well on predicting classes: 4-7. There is low sensitivity in classes: 4-6 and a low prevalence of positive predictions among these classifications. The model performed fairly well on classes: 1 or 2 and the overall model accuracy is 71%, but the macro F1 score is 43%.
In the second model, the LDA is for predictors greater than 1. Here the assumption is each predictor follows a normal distribution with some correlation between each pair of the predictors.
#create an LDA model
lda.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "lda",
trControl = trainControl(method = "cv", number = 2), data= training)
#LDA
lda.model
## Linear Discriminant Analysis
##
## 160004 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 80004, 80000
## Resampling results:
##
## Accuracy Kappa
## 0.6804892 0.4982209
actual <- training$cover_type
#predict cover type on test data
lda.pred <- predict(lda.model, training)
#review model accuracy
confusionMatrix(lda.pred, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 36103 14189 0 0 22 0 1055
## 2 15968 59394 470 0 1824 776 26
## 3 50 1369 5327 221 214 1204 17
## 4 0 344 802 442 0 167 0
## 5 56 958 66 0 518 178 0
## 6 44 1290 3246 86 4 2538 0
## 7 6083 403 0 0 0 0 4550
##
## Overall Statistics
##
## Accuracy : 0.6804
## 95% CI : (0.6781, 0.6827)
## No Information Rate : 0.4872
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4982
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.6192 0.7620 0.53748 0.590120 0.200620 0.52190
## Specificity 0.8499 0.7677 0.97951 0.991755 0.992009 0.96990
## Pos Pred Value 0.7028 0.7570 0.63402 0.251852 0.291667 0.35211
## Neg Pred Value 0.7956 0.7725 0.96976 0.998060 0.986956 0.98478
## Prevalence 0.3644 0.4872 0.06194 0.004681 0.016137 0.03039
## Detection Rate 0.2256 0.3712 0.03329 0.002762 0.003237 0.01586
## Detection Prevalence 0.3210 0.4904 0.05251 0.010968 0.011100 0.04505
## Balanced Accuracy 0.7346 0.7648 0.75850 0.790938 0.596314 0.74590
## Class: 7
## Sensitivity 0.80559
## Specificity 0.95798
## Pos Pred Value 0.41229
## Neg Pred Value 0.99263
## Prevalence 0.03530
## Detection Rate 0.02844
## Detection Prevalence 0.06897
## Balanced Accuracy 0.88179
#macro F1 score for logistic regression model
lda_macrof1 <- macro_f1(lda.pred, actual)
#print score
lda_macrof1
## [1] 0.5080487
Once again the LDA model under performed on correctly predicting classes: 4-7 but this time more errors were found in class: 3. This addition of false predictions is shown in the overall decline in accuracy to 68%. However, there is an increase in macro F1 score to 51% compared to the logistic regression model at 43%.
In the third model, QDA assumes observations in each class are from Gaussian distribution and each class has its own covariance matrix.
#create an QDA model
qda.model <- train(cover_type ~ elevation + hillshade_noon, method= "qda", trControl = trainControl(method = "cv", number = 2), data= training)
#QDA results
qda.model
## Quadratic Discriminant Analysis
##
## 160004 samples
## 2 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 80004, 80000
## Resampling results:
##
## Accuracy Kappa
## 0.6744269 0.4563092
actual <- training$cover_type
#create prediction using QDA model
qda.pred <- predict (qda.model, training)
#review model accuracy
confusionMatrix(qda.pred, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 41049 17288 0 0 0 0 5235
## 2 16628 59224 2689 1 2495 1486 20
## 3 10 1082 7162 748 30 3342 0
## 4 0 0 0 0 0 0 0
## 5 23 209 0 0 57 4 0
## 6 0 137 60 0 0 31 0
## 7 594 7 0 0 0 0 393
##
## Overall Statistics
##
## Accuracy : 0.6745
## 95% CI : (0.6722, 0.6768)
## No Information Rate : 0.4872
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4564
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.7041 0.7598 0.72263 0.000000 0.0220759 0.0063747
## Specificity 0.7785 0.7158 0.96527 1.000000 0.9985008 0.9987302
## Pos Pred Value 0.6457 0.7175 0.57879 NaN 0.1945392 0.1359649
## Neg Pred Value 0.8211 0.7583 0.98138 0.995319 0.9841902 0.9697577
## Prevalence 0.3644 0.4872 0.06194 0.004681 0.0161371 0.0303930
## Detection Rate 0.2565 0.3701 0.04476 0.000000 0.0003562 0.0001937
## Detection Prevalence 0.3973 0.5159 0.07734 0.000000 0.0018312 0.0014250
## Balanced Accuracy 0.7413 0.7378 0.84395 0.500000 0.5102884 0.5025524
## Class: 7
## Sensitivity 0.069582
## Specificity 0.996106
## Pos Pred Value 0.395372
## Neg Pred Value 0.966952
## Prevalence 0.035299
## Detection Rate 0.002456
## Detection Prevalence 0.006212
## Balanced Accuracy 0.532844
#macro F1 score for logistic regression model
qda_macrof1 <- macro_f1(qda.pred, actual)
#print score
qda_macrof1
## [1] 0.3177988
This QDA model includes less predictors than the other models listed. I encountered a rank deficiency error for columns with zero values and removed those predictors from the model. Nonetheless, the overall model accuracy is 68% but the macro F1 score is 32%. We see class: 4 consistently predicted as class: 3 and classes: 5 and 6 predicted as class: 2. This suggests variables in the other predictors as influential in making these distinctions. Low sensitivity in classes: 4-7 was also an issue.
In the fourth model, Naive Bayes assumes predictors are independent with no association. Here some bias is introduced but the overall variance is reduced.
library(naivebayes)
## Warning: package 'naivebayes' was built under R version 4.1.3
## naivebayes 0.9.7 loaded
#create a naive bayes model
nb.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "naive_bayes", trControl = trainControl(method = "cv", number = 2), data= training)
#review model
nb.model
## Naive Bayes
##
## 160004 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 80002, 80002
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.05722982 9.824005e-03
## TRUE 0.48716282 1.368247e-05
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
actual <- training$cover_type
#create prediction using Naive Bayes model
nb.pred <- predict (nb.model, training)
#review model accuracy
confusionMatrix(nb.pred, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 1 0 0 0 0 0 0
## 2 58303 77947 9911 749 2582 4863 5648
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.4872
## 95% CI : (0.4847, 0.4896)
## No Information Rate : 0.4872
## P-Value [Acc > NIR] : 0.499
##
## Kappa : 0
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 1.715e-05 1.000e+00 0.00000 0.000000 0.00000 0.00000
## Specificity 1.000e+00 1.219e-05 1.00000 1.000000 1.00000 1.00000
## Pos Pred Value 1.000e+00 4.872e-01 NaN NaN NaN NaN
## Neg Pred Value 6.356e-01 1.000e+00 0.93806 0.995319 0.98386 0.96961
## Prevalence 3.644e-01 4.872e-01 0.06194 0.004681 0.01614 0.03039
## Detection Rate 6.250e-06 4.872e-01 0.00000 0.000000 0.00000 0.00000
## Detection Prevalence 6.250e-06 1.000e+00 0.00000 0.000000 0.00000 0.00000
## Balanced Accuracy 5.000e-01 5.000e-01 0.50000 0.500000 0.50000 0.50000
## Class: 7
## Sensitivity 0.0000
## Specificity 1.0000
## Pos Pred Value NaN
## Neg Pred Value 0.9647
## Prevalence 0.0353
## Detection Rate 0.0000
## Detection Prevalence 0.0000
## Balanced Accuracy 0.5000
#macro F1 score for logistic regression model
nb_macrof1 <- macro_f1(nb.pred, actual)
#print score
nb_macrof1
## [1] 0.09359839
The Naive Bayes model performed the worst so far. The model predicted each observation as class: 2. The overall model accuracy was 49% with a macro F1 score of 9%.
# create knn model
knn.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "knn", trControl = trainControl(method = "cv", number = 2), data= testing)
#view summary from k-nearest neighbors model
knn.model
## k-Nearest Neighbors
##
## 39996 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 19998, 19998
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8047555 0.6823441
## 7 0.7929043 0.6610797
## 9 0.7844784 0.6453631
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
actual2 <- testing$cover_type
#predict cover type on test data
knn.pred <- predict(knn.model, testing)
#review model accuracy
confusionMatrix(knn.pred, actual2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 13266 938 7 0 10 1 155
## 2 1231 18339 118 0 158 87 21
## 3 3 79 2255 31 13 120 0
## 4 0 0 7 138 0 5 0
## 5 8 60 4 0 461 3 1
## 6 4 60 86 18 3 999 0
## 7 63 10 0 0 0 0 1234
##
## Overall Statistics
##
## Accuracy : 0.9174
## 95% CI : (0.9147, 0.9201)
## No Information Rate : 0.4872
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.9102 0.9411 0.91038 0.737968 0.71473 0.82222
## Specificity 0.9563 0.9213 0.99344 0.999699 0.99807 0.99559
## Pos Pred Value 0.9227 0.9191 0.90164 0.920000 0.85847 0.85385
## Neg Pred Value 0.9489 0.9428 0.99408 0.998770 0.99534 0.99444
## Prevalence 0.3644 0.4872 0.06193 0.004675 0.01613 0.03038
## Detection Rate 0.3317 0.4585 0.05638 0.003450 0.01153 0.02498
## Detection Prevalence 0.3595 0.4989 0.06253 0.003750 0.01343 0.02925
## Balanced Accuracy 0.9332 0.9312 0.95191 0.868833 0.85640 0.90891
## Class: 7
## Sensitivity 0.87456
## Specificity 0.99811
## Pos Pred Value 0.94415
## Neg Pred Value 0.99543
## Prevalence 0.03528
## Detection Rate 0.03085
## Detection Prevalence 0.03268
## Balanced Accuracy 0.93633
#macro F1 score for logistic regression model
knn_macrof1 <- macro_f1(knn.pred, actual2)
#print score
knn_macrof1
## [1] 0.8710215
For the K-Nearest Neighbors model, I reduced the number of observations from 160,000+ in the training set to just under 40,000 in the test set. This significantly reduced the processing time but does not appear to have affected the model’s predictive performance. Overall model accuracy was 92% and the macro F1 score is 85%. The model had the highest sensitivity and specificity of any of the other models fit. It also performed the best at classifying classes: 4-7.
Comparing the models, the logistic regression model under performed on classifications greater than two. The addition of the Bayes classifier in LDA, QDA, and Naive Bayes allows for better distribution of the predictors in each class but the Naive Bayes model assumed there is no correlation between the predictors. This was an incorrect assumption and ultimately had the lowest macro F1 score, 0.09624223, and missing data for the remaining 5 classes.
The LDA model performed better than the QDA model. LDA assumed a common covariance among the classes versus QDA which assumed each class has its own covariance matrix. This results in a lower variance, total error, and higher bias for the LDA model compared to a higher variance in the QDA model. Generally, QDA would be recommended for large training sets like this one as it allows for more flexibility; however, in addressing the rank deficiency I needed to reduce the number of predictors. This could have impacted the performance of the QDA model.
The K-Nearest Neighbors model performed the best in model accuracy, macro F1 score, and overall sensitivity and specificity across the classifications and thus selected as the preferred model to predict the test data.
In the final step, KNN was used to predict one of seven tree classifications for the test data.
#preprocess test data
#reduce the number of columns/variables
#pivot soil variable to one column and filter new rows by the number of 1
test2<- test%>%
pivot_longer(cols="soil_type1":"soil_type40",names_to="soil",values_to="number")%>%
filter(number==1)
#pivot wilderness variable to one column and filter new rows by the value of 1
test2<-test2%>%
pivot_longer("wilderness_area1":"wilderness_area4",names_to="wilderness",values_to="value")%>%
filter(value==1)
#remove the 'number' and 'value' columns from the previous pre-processing step
test2 <-subset(test2, select = -c(number, value))
#remove outlier with soiltype 15, "A310457"
test2<-test2[-c(110457),]
#predict cover type on test data
knn.test.pred <- predict(knn.model, test2)
head(knn.test.pred,10)
## [1] 1 2 1 1 2 2 1 6 2 2
## Levels: 1 2 3 4 5 6 7
#create data frame of predictions and add in the matching object id
predictions<- as.data.frame(cbind(id=test2$id, cover_type=knn.test.pred))
#write data frame to "submission.csv"
write_csv(predictions,"submission.csv")
Now, let’s see how the predictions performed in the competition!