KM_HW2_ForestFrenzy

Data

Step 1: Review the dimensions, variables, and summarize the data

#collect dimensions of the training data
dim(train)

## [1] 200000     56

#collect dimensions of the test data
dim(test)

## [1] 200000     55

#summarize training data
summary(train)

##       id              elevation        aspect          slope      
##  Length:200000      Min.   :1859   Min.   :  0.0   Min.   : 0.00  
##  Class :character   1st Qu.:2808   1st Qu.: 58.0   1st Qu.: 9.00  
##  Mode  :character   Median :2995   Median :127.0   Median :13.00  
##                     Mean   :2959   Mean   :155.4   Mean   :14.11  
##                     3rd Qu.:3163   3rd Qu.:260.0   3rd Qu.:18.00  
##                     Max.   :3858   Max.   :360.0   Max.   :66.00  
##  horizontal_distance_to_hydrology vertical_distance_to_hydrology
##  Min.   :   0.0                   Min.   :-173.00               
##  1st Qu.: 108.0                   1st Qu.:   7.00               
##  Median : 218.0                   Median :  30.00               
##  Mean   : 269.6                   Mean   :  46.52               
##  3rd Qu.: 384.0                   3rd Qu.:  69.00               
##  Max.   :1390.0                   Max.   : 601.00               
##  horizontal_distance_to_roadways hillshade_9am   hillshade_noon 
##  Min.   :   0                    Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:1110                    1st Qu.:198.0   1st Qu.:213.0  
##  Median :1998                    Median :218.0   Median :226.0  
##  Mean   :2351                    Mean   :212.1   Mean   :223.3  
##  3rd Qu.:3329                    3rd Qu.:231.0   3rd Qu.:237.0  
##  Max.   :7087                    Max.   :254.0   Max.   :254.0  
##  hillshade_3pm   horizontal_distance_to_fire_points wilderness_area1
##  Min.   :  0.0   Min.   :   0                       Min.   :0.0000  
##  1st Qu.:119.0   1st Qu.:1024                       1st Qu.:0.0000  
##  Median :143.0   Median :1711                       Median :0.0000  
##  Mean   :142.5   Mean   :1982                       Mean   :0.4489  
##  3rd Qu.:168.0   3rd Qu.:2550                       3rd Qu.:1.0000  
##  Max.   :254.0   Max.   :7168                       Max.   :1.0000  
##  wilderness_area2  wilderness_area3 wilderness_area4    soil_type1      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.000000  
##  Mean   :0.05147   Mean   :0.4358   Mean   :0.06375   Mean   :0.005245  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000000  
##    soil_type2        soil_type3         soil_type4        soil_type5      
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.000000  
##  Mean   :0.01288   Mean   :0.008475   Mean   :0.02156   Mean   :0.002625  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.000000  
##    soil_type6        soil_type7        soil_type8        soil_type9     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.01132   Mean   :0.00015   Mean   :0.00036   Mean   :0.00201  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##   soil_type10       soil_type11       soil_type12       soil_type13     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.05652   Mean   :0.02101   Mean   :0.05169   Mean   :0.03006  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##   soil_type14        soil_type15     soil_type16        soil_type17     
##  Min.   :0.000000   Min.   :0e+00   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0e+00   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000   Median :0e+00   Median :0.000000   Median :0.00000  
##  Mean   :0.001005   Mean   :1e-05   Mean   :0.005035   Mean   :0.00598  
##  3rd Qu.:0.000000   3rd Qu.:0e+00   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1e+00   Max.   :1.000000   Max.   :1.00000  
##   soil_type18       soil_type19        soil_type20       soil_type21      
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.000000  
##  Mean   :0.00329   Mean   :0.006905   Mean   :0.01605   Mean   :0.001355  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.000000  
##   soil_type22       soil_type23      soil_type24       soil_type25     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.05767   Mean   :0.0986   Mean   :0.03691   Mean   :0.00083  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##   soil_type26        soil_type27        soil_type28       soil_type29   
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.000000   Median :0.000000   Median :0.00000   Median :0.000  
##  Mean   :0.004265   Mean   :0.001975   Mean   :0.00155   Mean   :0.199  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00000   Max.   :1.000  
##   soil_type30       soil_type31      soil_type32       soil_type33    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.05142   Mean   :0.0443   Mean   :0.09073   Mean   :0.0773  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##   soil_type34        soil_type35        soil_type36       soil_type37     
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.002765   Mean   :0.003405   Mean   :0.00024   Mean   :0.00052  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
##   soil_type38       soil_type39       soil_type40        cover_type   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :1.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :2.000  
##  Mean   :0.02646   Mean   :0.02347   Mean   :0.01503   Mean   :2.053  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :7.000

#plot distribution of 'cover_type' variable
ggplot(data=train, aes(cover_type))+geom_freqpoly() +
  scale_x_discrete(limits=c("1","2","3", "4", "5", "6", "7")) +
  labs(title= "Frequency of Cover Type in Training Data")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#summary of distribution of 'cover_type'
table(train$cover_type)

## 
##     1     2     3     4     5     6     7 
## 72879 97433 12388   936  3227  6078  7059

There are 200,000 observations in both the training and test data set. There are 56 variables in the training set which include: id; elevation; aspect; slope; 40 predictors on soil types; 4 predictors for wilderness areas; 3 hillshade measurements for morning, noon, and afternoon; both the horizontal and vertical distance to hydrology; both the horizontal distance to roadways and distance to fireways; and cover type (the response variable of interest).

At first glance, the soil type appears as a binary variable of 0/1 and limited to one soil type selection per observation. The mean value shows the prevalence of each soil type among total observations. Finally, the plot shows the distribution of cover types in the training data. Cover type #2 (n=97,453) comprises nearly half of all of the observations. Cover type #1 (n=72,879) has the second most frequent occurrence.

Step 2: Pre-processing

#reduce the number of columns/variables

#pivot soil variable to one column and filter new rows by the number of 1
train2<- train%>%
  pivot_longer(cols="soil_type1":"soil_type40",names_to="soil",values_to="number")%>%
  filter(number==1)

#pivot wilderness variable to one column and filter new rows by the value of 1
train2<-train2%>%
  pivot_longer("wilderness_area1":"wilderness_area4",names_to="wilderness",values_to="value")%>%
  filter(value==1)

#remove the 'number' and 'value' columns from the previous pre-processing step
train2 <-subset(train2, select = -c(number, value))

#convert "cover_type" to a factor variable 
train2$cover_type <- as.factor(train2$cover_type)

Step 3: Split the training data

#split data into test/training partition 

#create partition of 50% of the training set based on the outcome, `cover type`
train_index <- createDataPartition(y= train2$cover_type, times=1, p= 0.8, list = FALSE)

#establish a training subset based on the partitioned data
training <- train2[ train_index,]

#establish a testing subset based on observations excluded from the training subset
testing <- train2[-train_index,]

During pre-processing, I reduced the number of columns using pivot_longer to identify the singular soil type and wilderness area per observation. This reduced the variables from 56 to 14. Next, I split the original training data into 80/20 subsets, training (n= ~160,000) and testing (n=~40,000) subsets.

Evaluation Metric: Macro F1 Score

The following function, credited to Omer Yalcin, was used to calculate the Macro F1 score.

# create function that calculates f1 score

f1 <- function(pred, actual) {
  
  # the input vectors should be pred and actual
  # the class of interest should be 1, the other class is 0 
  
  # it's a good idea throw errors when something unexpected happens
  if (length(pred) != length(actual)) stop('pred and actual have different lengths')
  if (any(is.na(pred))) stop('pred has missing values')
  if (any(is.na(actual))) stop('actual has missing values')
  
  # precision and recall are each ratio of two quantities 
  # they have a common numerator, different denominators
  # remember: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
  
  # numerator = observations where both the prediction and the actual value is 1
  numerator <- sum((pred + actual == 2))
  
  # denominators
  precision_denominator <- sum(pred) # total number of observations predicted as 1
  recall_denominator <- sum(actual) # total number of observations that are actually 1
  precision <- numerator / precision_denominator
  recall <- numerator / recall_denominator
  f1 <- (2 * precision * recall) / (precision + recall)
  if(is.nan(f1)) f1 <- 0 # is undefined for whatever reason, just assign 0
  return(f1)
}

#  multi-class version

macro_f1 <- function(pred, actual) {
  all_classes <- sort(unique(actual)) # get a vector of all classes to iterate over
  class_specific_f1 <- numeric(length(all_classes)) # an empty vector, size equal to total number of classes to hold the class-specific F1 scores so we can get its mean later
  
  # each iteration of loop calculates F1 score for one class
  for (i in 1:length(all_classes)) {
    class <- all_classes[i] # get the first class
    class_pred <- ifelse(pred == class, 1, 0) # if the prediction is equal to this class, code it as 1, otherwise 0
    class_actual <- ifelse(actual == class, 1, 0) # if the actual value is equal to this class, code it as 1, otherwise 0
    
    # now we have two vectors made up of 1s and 0s that can be plugged back into our f1() function
    class_f1 <- f1(class_pred, class_actual) 
    class_specific_f1[i] <- class_f1 # store the f1 score for this class
  }
 
  # return the mean 
  return(mean(class_specific_f1))
}

Classification Models

In this section, five classification models: logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), Naive Bayes, and K-Nearest Neighbors will be used to predict the cover type of forests.

Multinomial Logistic Regression

The first classification model is a multinomial logistic regression which classifies responses of more than two classifications. For this model, all response classes are handled symmetrically.

#create a multinomial logistic regression model

log.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "multinom", 
  trControl = trainControl(method = "cv", number = 2), data= training)

## # weights:  378 (318 variable)
## initial  value 155674.757834 
## iter  10 value 120252.870775
## iter  20 value 115403.130735
## iter  30 value 108376.506360
## iter  40 value 105784.968050
## iter  50 value 104779.213035
## iter  60 value 93546.029600
## iter  70 value 79556.518811
## iter  80 value 69636.914430
## iter  90 value 64315.701958
## iter 100 value 59431.677996
## final  value 59431.677996 
## stopped after 100 iterations
## # weights:  378 (318 variable)
## initial  value 155674.757834 
## iter  10 value 120252.870775
## iter  20 value 115403.130800
## iter  30 value 108376.507035
## iter  40 value 105784.972059
## iter  50 value 104779.223121
## iter  60 value 93549.359407
## iter  70 value 79617.700125
## iter  80 value 69760.708235
## iter  90 value 64606.926002
## iter 100 value 59848.830865
## final  value 59848.830865 
## stopped after 100 iterations
## # weights:  378 (318 variable)
## initial  value 155674.757834 
## iter  10 value 120252.870775
## iter  20 value 115403.130735
## iter  30 value 108376.506359
## iter  40 value 105784.968030
## iter  50 value 104779.212925
## iter  60 value 93546.028326
## iter  70 value 79556.515831
## iter  80 value 69636.957449
## iter  90 value 64315.644149
## iter 100 value 59432.251614
## final  value 59432.251614 
## stopped after 100 iterations
## # weights:  378 (318 variable)
## initial  value 155678.649655 
## iter  10 value 120240.610902
## iter  20 value 115182.232508
## iter  30 value 109707.817095
## iter  40 value 107372.455992
## iter  50 value 106603.922331
## iter  60 value 94945.115198
## iter  70 value 80132.611748
## iter  80 value 72164.843189
## iter  90 value 64419.958663
## iter 100 value 60032.434246
## final  value 60032.434246 
## stopped after 100 iterations
## # weights:  378 (318 variable)
## initial  value 155678.649655 
## iter  10 value 120240.610902
## iter  20 value 115182.232583
## iter  30 value 109707.817643
## iter  40 value 107372.459988
## iter  50 value 106603.931700
## iter  60 value 94948.557165
## iter  70 value 80200.873406
## iter  80 value 72264.132427
## iter  90 value 64514.679410
## iter 100 value 60343.885316
## final  value 60343.885316 
## stopped after 100 iterations
## # weights:  378 (318 variable)
## initial  value 155678.649655 
## iter  10 value 120240.610902
## iter  20 value 115182.232508
## iter  30 value 109707.817095
## iter  40 value 107372.455993
## iter  50 value 106603.922321
## iter  60 value 94945.118057
## iter  70 value 80132.676684
## iter  80 value 72164.918784
## iter  90 value 64419.919165
## iter 100 value 60032.878113
## final  value 60032.878113 
## stopped after 100 iterations
## # weights:  378 (318 variable)
## initial  value 311353.407490 
## iter  10 value 200010.103675
## iter  20 value 190926.185950
## iter  30 value 170877.595194
## iter  40 value 164576.270790
## iter  50 value 162611.596934
## iter  60 value 152806.097163
## iter  70 value 143213.786965
## iter  80 value 137075.802686
## iter  90 value 126409.124909
## iter 100 value 121458.359047
## final  value 121458.359047 
## stopped after 100 iterations

#review model summary
log.model

## Penalized Multinomial Regression 
## 
## 160004 samples
##     12 predictor
##      7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 80001, 80003 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.7106635  0.5274083
##   1e-04  0.7106572  0.5274017
##   1e-01  0.7108760  0.5274453
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.

actual <- training$cover_type

#predict cover type on test data
log.pred <- predict(log.model, training)

#review model accuracy
confusionMatrix(log.pred, actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3     4     5     6     7
##          1 39095 13702     1     0    44     3  2358
##          2 17179 61636   683     6  2324  1383    64
##          3    58  1832  7911   527   205  1968    17
##          4     0    30   111   156     0    42     0
##          5    17     6     0     0     5     0     0
##          6    10   581  1205    60     4  1467     0
##          7  1945   160     0     0     0     0  3209
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7092         
##                  95% CI : (0.707, 0.7115)
##     No Information Rate : 0.4872         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5265         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4  Class: 5 Class: 6
## Sensitivity            0.6705   0.7907  0.79820 0.208278 1.936e-03 0.301666
## Specificity            0.8416   0.7363  0.96931 0.998851 9.999e-01 0.988011
## Pos Pred Value         0.7082   0.7402  0.63197 0.460177 1.786e-01 0.440938
## Neg Pred Value         0.8167   0.7874  0.98644 0.996286 9.839e-01 0.978325
## Prevalence             0.3644   0.4872  0.06194 0.004681 1.614e-02 0.030393
## Detection Rate         0.2443   0.3852  0.04944 0.000975 3.125e-05 0.009169
## Detection Prevalence   0.3450   0.5205  0.07824 0.002119 1.750e-04 0.020793
## Balanced Accuracy      0.7561   0.7635  0.88375 0.603564 5.009e-01 0.644838
##                      Class: 7
## Sensitivity           0.56817
## Specificity           0.98636
## Pos Pred Value        0.60388
## Neg Pred Value        0.98423
## Prevalence            0.03530
## Detection Rate        0.02006
## Detection Prevalence  0.03321
## Balanced Accuracy     0.77726

#macro F1 score for logistic regression model
log_macrof1 <- macro_f1(log.pred, actual)

#print score
log_macrof1

## [1] 0.4847439

The multinomial logistic regression model did not perform well on predicting classes: 4-7. There is low sensitivity in classes: 4-6 and a low prevalence of positive predictions among these classifications. The model performed fairly well on classes: 1 or 2 and the overall model accuracy is 71%, but the macro F1 score is 43%.

Linear Discriminant Analysis (LDA)

In the second model, the LDA is for predictors greater than 1. Here the assumption is each predictor follows a normal distribution with some correlation between each pair of the predictors.

#create an LDA model 
lda.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "lda", 
  trControl = trainControl(method = "cv", number = 2), data= training)

#LDA
lda.model

## Linear Discriminant Analysis 
## 
## 160004 samples
##     12 predictor
##      7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 80004, 80000 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6804892  0.4982209

actual <- training$cover_type

#predict cover type on test data
lda.pred <- predict(lda.model, training)

#review model accuracy
confusionMatrix(lda.pred, actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3     4     5     6     7
##          1 36103 14189     0     0    22     0  1055
##          2 15968 59394   470     0  1824   776    26
##          3    50  1369  5327   221   214  1204    17
##          4     0   344   802   442     0   167     0
##          5    56   958    66     0   518   178     0
##          6    44  1290  3246    86     4  2538     0
##          7  6083   403     0     0     0     0  4550
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6804          
##                  95% CI : (0.6781, 0.6827)
##     No Information Rate : 0.4872          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4982          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.6192   0.7620  0.53748 0.590120 0.200620  0.52190
## Specificity            0.8499   0.7677  0.97951 0.991755 0.992009  0.96990
## Pos Pred Value         0.7028   0.7570  0.63402 0.251852 0.291667  0.35211
## Neg Pred Value         0.7956   0.7725  0.96976 0.998060 0.986956  0.98478
## Prevalence             0.3644   0.4872  0.06194 0.004681 0.016137  0.03039
## Detection Rate         0.2256   0.3712  0.03329 0.002762 0.003237  0.01586
## Detection Prevalence   0.3210   0.4904  0.05251 0.010968 0.011100  0.04505
## Balanced Accuracy      0.7346   0.7648  0.75850 0.790938 0.596314  0.74590
##                      Class: 7
## Sensitivity           0.80559
## Specificity           0.95798
## Pos Pred Value        0.41229
## Neg Pred Value        0.99263
## Prevalence            0.03530
## Detection Rate        0.02844
## Detection Prevalence  0.06897
## Balanced Accuracy     0.88179

#macro F1 score for logistic regression model
lda_macrof1 <- macro_f1(lda.pred, actual)

#print score
lda_macrof1

## [1] 0.5080487

Once again the LDA model under performed on correctly predicting classes: 4-7 but this time more errors were found in class: 3. This addition of false predictions is shown in the overall decline in accuracy to 68%. However, there is an increase in macro F1 score to 51% compared to the logistic regression model at 43%.

Quadratic Discriminant Analysis (QDA)

In the third model, QDA assumes observations in each class are from Gaussian distribution and each class has its own covariance matrix.

#create an QDA model 
qda.model <- train(cover_type ~ elevation + hillshade_noon, method= "qda", trControl = trainControl(method = "cv", number = 2), data= training)

#QDA results
qda.model

## Quadratic Discriminant Analysis 
## 
## 160004 samples
##      2 predictor
##      7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 80004, 80000 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6744269  0.4563092

actual <- training$cover_type

#create prediction using QDA model
qda.pred <- predict (qda.model, training)

#review model accuracy
confusionMatrix(qda.pred, actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3     4     5     6     7
##          1 41049 17288     0     0     0     0  5235
##          2 16628 59224  2689     1  2495  1486    20
##          3    10  1082  7162   748    30  3342     0
##          4     0     0     0     0     0     0     0
##          5    23   209     0     0    57     4     0
##          6     0   137    60     0     0    31     0
##          7   594     7     0     0     0     0   393
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6745          
##                  95% CI : (0.6722, 0.6768)
##     No Information Rate : 0.4872          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4564          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4  Class: 5  Class: 6
## Sensitivity            0.7041   0.7598  0.72263 0.000000 0.0220759 0.0063747
## Specificity            0.7785   0.7158  0.96527 1.000000 0.9985008 0.9987302
## Pos Pred Value         0.6457   0.7175  0.57879      NaN 0.1945392 0.1359649
## Neg Pred Value         0.8211   0.7583  0.98138 0.995319 0.9841902 0.9697577
## Prevalence             0.3644   0.4872  0.06194 0.004681 0.0161371 0.0303930
## Detection Rate         0.2565   0.3701  0.04476 0.000000 0.0003562 0.0001937
## Detection Prevalence   0.3973   0.5159  0.07734 0.000000 0.0018312 0.0014250
## Balanced Accuracy      0.7413   0.7378  0.84395 0.500000 0.5102884 0.5025524
##                      Class: 7
## Sensitivity          0.069582
## Specificity          0.996106
## Pos Pred Value       0.395372
## Neg Pred Value       0.966952
## Prevalence           0.035299
## Detection Rate       0.002456
## Detection Prevalence 0.006212
## Balanced Accuracy    0.532844

#macro F1 score for logistic regression model
qda_macrof1 <- macro_f1(qda.pred, actual)

#print score
qda_macrof1

## [1] 0.3177988

This QDA model includes less predictors than the other models listed. I encountered a rank deficiency error for columns with zero values and removed those predictors from the model. Nonetheless, the overall model accuracy is 68% but the macro F1 score is 32%. We see class: 4 consistently predicted as class: 3 and classes: 5 and 6 predicted as class: 2. This suggests variables in the other predictors as influential in making these distinctions. Low sensitivity in classes: 4-7 was also an issue.

Naive Bayes

In the fourth model, Naive Bayes assumes predictors are independent with no association. Here some bias is introduced but the overall variance is reduced.

library(naivebayes)

## Warning: package 'naivebayes' was built under R version 4.1.3

## naivebayes 0.9.7 loaded

#create a naive bayes model
nb.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "naive_bayes", trControl = trainControl(method = "cv", number = 2), data= training)

#review model
nb.model

## Naive Bayes 
## 
## 160004 samples
##     12 predictor
##      7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 80002, 80002 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy    Kappa       
##   FALSE      0.05722982  9.824005e-03
##    TRUE      0.48716282  1.368247e-05
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

actual <- training$cover_type

#create prediction using Naive Bayes model
nb.pred <- predict (nb.model, training)

#review model accuracy
confusionMatrix(nb.pred, actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3     4     5     6     7
##          1     1     0     0     0     0     0     0
##          2 58303 77947  9911   749  2582  4863  5648
##          3     0     0     0     0     0     0     0
##          4     0     0     0     0     0     0     0
##          5     0     0     0     0     0     0     0
##          6     0     0     0     0     0     0     0
##          7     0     0     0     0     0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4872          
##                  95% CI : (0.4847, 0.4896)
##     No Information Rate : 0.4872          
##     P-Value [Acc > NIR] : 0.499           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 1  Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity          1.715e-05 1.000e+00  0.00000 0.000000  0.00000  0.00000
## Specificity          1.000e+00 1.219e-05  1.00000 1.000000  1.00000  1.00000
## Pos Pred Value       1.000e+00 4.872e-01      NaN      NaN      NaN      NaN
## Neg Pred Value       6.356e-01 1.000e+00  0.93806 0.995319  0.98386  0.96961
## Prevalence           3.644e-01 4.872e-01  0.06194 0.004681  0.01614  0.03039
## Detection Rate       6.250e-06 4.872e-01  0.00000 0.000000  0.00000  0.00000
## Detection Prevalence 6.250e-06 1.000e+00  0.00000 0.000000  0.00000  0.00000
## Balanced Accuracy    5.000e-01 5.000e-01  0.50000 0.500000  0.50000  0.50000
##                      Class: 7
## Sensitivity            0.0000
## Specificity            1.0000
## Pos Pred Value            NaN
## Neg Pred Value         0.9647
## Prevalence             0.0353
## Detection Rate         0.0000
## Detection Prevalence   0.0000
## Balanced Accuracy      0.5000

#macro F1 score for logistic regression model
nb_macrof1 <- macro_f1(nb.pred, actual)

#print score
nb_macrof1

## [1] 0.09359839

The Naive Bayes model performed the worst so far. The model predicted each observation as class: 2. The overall model accuracy was 49% with a macro F1 score of 9%.

K-Nearest Neighbors (KNN)

# create knn model
knn.model <- train(cover_type ~ elevation + aspect + slope + soil + wilderness + horizontal_distance_to_hydrology + vertical_distance_to_hydrology+ horizontal_distance_to_roadways+ hillshade_9am+ hillshade_noon+ hillshade_3pm+ horizontal_distance_to_fire_points, method= "knn", trControl = trainControl(method = "cv", number = 2), data= testing)

#view summary from k-nearest neighbors model
knn.model

## k-Nearest Neighbors 
## 
## 39996 samples
##    12 predictor
##     7 classes: '1', '2', '3', '4', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 19998, 19998 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8047555  0.6823441
##   7  0.7929043  0.6610797
##   9  0.7844784  0.6453631
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

actual2 <- testing$cover_type

#predict cover type on test data
knn.pred <- predict(knn.model, testing)

#review model accuracy
confusionMatrix(knn.pred, actual2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3     4     5     6     7
##          1 13266   938     7     0    10     1   155
##          2  1231 18339   118     0   158    87    21
##          3     3    79  2255    31    13   120     0
##          4     0     0     7   138     0     5     0
##          5     8    60     4     0   461     3     1
##          6     4    60    86    18     3   999     0
##          7    63    10     0     0     0     0  1234
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9174          
##                  95% CI : (0.9147, 0.9201)
##     No Information Rate : 0.4872          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.9102   0.9411  0.91038 0.737968  0.71473  0.82222
## Specificity            0.9563   0.9213  0.99344 0.999699  0.99807  0.99559
## Pos Pred Value         0.9227   0.9191  0.90164 0.920000  0.85847  0.85385
## Neg Pred Value         0.9489   0.9428  0.99408 0.998770  0.99534  0.99444
## Prevalence             0.3644   0.4872  0.06193 0.004675  0.01613  0.03038
## Detection Rate         0.3317   0.4585  0.05638 0.003450  0.01153  0.02498
## Detection Prevalence   0.3595   0.4989  0.06253 0.003750  0.01343  0.02925
## Balanced Accuracy      0.9332   0.9312  0.95191 0.868833  0.85640  0.90891
##                      Class: 7
## Sensitivity           0.87456
## Specificity           0.99811
## Pos Pred Value        0.94415
## Neg Pred Value        0.99543
## Prevalence            0.03528
## Detection Rate        0.03085
## Detection Prevalence  0.03268
## Balanced Accuracy     0.93633

#macro F1 score for logistic regression model
knn_macrof1 <- macro_f1(knn.pred, actual2)

#print score
knn_macrof1

## [1] 0.8710215

For the K-Nearest Neighbors model, I reduced the number of observations from 160,000+ in the training set to just under 40,000 in the test set. This significantly reduced the processing time but does not appear to have affected the model’s predictive performance. Overall model accuracy was 92% and the macro F1 score is 85%. The model had the highest sensitivity and specificity of any of the other models fit. It also performed the best at classifying classes: 4-7.

Model Comparison

Comparing the models, the logistic regression model under performed on classifications greater than two. The addition of the Bayes classifier in LDA, QDA, and Naive Bayes allows for better distribution of the predictors in each class but the Naive Bayes model assumed there is no correlation between the predictors. This was an incorrect assumption and ultimately had the lowest macro F1 score, 0.09624223, and missing data for the remaining 5 classes.

The LDA model performed better than the QDA model. LDA assumed a common covariance among the classes versus QDA which assumed each class has its own covariance matrix. This results in a lower variance, total error, and higher bias for the LDA model compared to a higher variance in the QDA model. Generally, QDA would be recommended for large training sets like this one as it allows for more flexibility; however, in addressing the rank deficiency I needed to reduce the number of predictors. This could have impacted the performance of the QDA model.

The K-Nearest Neighbors model performed the best in model accuracy, macro F1 score, and overall sensitivity and specificity across the classifications and thus selected as the preferred model to predict the test data.

Predictions

In the final step, KNN was used to predict one of seven tree classifications for the test data.

#preprocess test data

#reduce the number of columns/variables

#pivot soil variable to one column and filter new rows by the number of 1
test2<- test%>%
  pivot_longer(cols="soil_type1":"soil_type40",names_to="soil",values_to="number")%>%
  filter(number==1)

#pivot wilderness variable to one column and filter new rows by the value of 1
test2<-test2%>%
  pivot_longer("wilderness_area1":"wilderness_area4",names_to="wilderness",values_to="value")%>%
  filter(value==1)

#remove the 'number' and 'value' columns from the previous pre-processing step
test2 <-subset(test2, select = -c(number, value))

#remove outlier with soiltype 15, "A310457"
test2<-test2[-c(110457),]

#predict cover type on test data
knn.test.pred <- predict(knn.model, test2)

head(knn.test.pred,10)

##  [1] 1 2 1 1 2 2 1 6 2 2
## Levels: 1 2 3 4 5 6 7

#create data frame of predictions and add in the matching object id
predictions<- as.data.frame(cbind(id=test2$id, cover_type=knn.test.pred))

#write data frame to "submission.csv"
write_csv(predictions,"submission.csv")

Now, let’s see how the predictions performed in the competition!