Open Source Spatial Analytics

Sarah Woolard

4/23/2021

Methods:

This study compares 4 machine learning algorithims including k-nearest neighbors (k-NN), decision trees (DTs), random forests (RF), and support vector machines (SVM). The first part includes seperating training and validation data, training using the four algorithims, and assessing their confusion matrix and further statistics. The second part focuses on forest type classification using 89 variables. The four algorithims were then applied again to predict out forest type classification predictions and accuracies. A mask is multiplied against the result raster to extract only forest type areas and mapped using tmap() to show its class predictions.

Part 1: Classification of Wines

set.seed(34)
train <- wine %>% 
  group_by(class) %>% 
  sample_frac(0.5, replace=FALSE)

val <- setdiff(wine, train)

set.seed(34)
trainctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)

K-NN Model

set.seed(34)
knn.model <- train(class~., data=train, method = "knn",
                   tuneLength = 10,
                   preProcess = c("center", "scale"),
                   trControl = trainctrl,
                   metric="Kappa")
knn.predict <-predict(knn.model, val)
knn_con <- confusionMatrix(knn.predict, as.factor(val$class))

DT Model

set.seed(34)
dt.model <- train(class~., data=train, method = "rpart", 
                  tuneLength = 10,
                  preProcess = c("center", "scale"),
                  trControl = trainctrl,
                  metric="Kappa")
dt.predict <-predict(dt.model, val)
dt_con <- confusionMatrix(dt.predict, as.factor(val$class))

RF Model

set.seed(34)
rf.model <- train(class~., data=train, method = "rf", 
                  tuneLength = 10,
                  ntree=100,
                  importance=TRUE,
                  preProcess = c("center", "scale"),
                  trControl = trainctrl,
                  metric="Kappa")
rf.predict <-predict(rf.model, val)
rfcon <- confusionMatrix(rf.predict, as.factor(val$class))

SVM Model

set.seed(34)
svm.model <- train(class~., data=train, method = "svmRadial",
                   tuneLength = 10,
                   preProcess = c("center", "scale"),
                   trControl = trainctrl,
                   metric="Kappa")
svm.predict <-predict(svm.model, val)
svmcon <- confusionMatrix(svm.predict, as.factor(val$class))
dt.model.final <- dt.model$finalModel
plot(dt.model.final)
text(dt.model.final) 


rf.model.final <- rf.model$finalModel
rfimpo <- importance(rf.model.final)

K-NN yielded the highest overall accuracy.

The k-NN algorithm yielded the best Kappa statistic.

Wine B and C were often the most confused.

Proline and Color Intensity were the two variables used to split the data

Alc, Flav, ColorInt, and Proline are found to be most important.

Part 2: Forest Type Classification

set.seed(34)
train_sub <- training_data %>%
  group_by(class) %>%
  sample_n(100, replace=FALSE)

set.seed(34)
knn.model2 <- train(class~., data=train_sub, method = "knn",
                   tuneLength = 10,
                   preProcess = c("center", "scale"),
                   trControl = trainctrl,
                   metric="Kappa")
knn.predict2 <-predict(knn.model2, val_data)
knn2_con <- confusionMatrix(knn.predict2, as.factor(val_data$class))

DT Model 2

set.seed(34)
dt.model2 <- train(class~., data=train_sub, method = "rpart", 
                  tuneLength = 10,
                  preProcess = c("center", "scale"),
                  trControl = trainctrl,
                  metric="Kappa")
dt.predict2 <-predict(dt.model2, val_data)
dt2_con <- confusionMatrix(dt.predict2, as.factor(val_data$class))

RF Model 2

set.seed(34)
rf.model2 <- train(class~., data=train_sub, method = "rf", 
                  tuneLength = 10,
                  ntree=100,
                  importance=TRUE,
                  preProcess = c("center", "scale"),
                  trControl = trainctrl,
                  metric="Kappa")
rf.predict2 <-predict(rf.model2, val_data)
rf2_con <-confusionMatrix(rf.predict2, as.factor(val_data$class))

SVM Model 2

set.seed(34)
svm.model2 <- train(class~., data=train_sub, method = "svmRadial",
                   tuneLength = 10,
                   preProcess = c("center", "scale"),
                   trControl = trainctrl,
                   metric="Kappa")
svm.predict2 <-predict(svm.model2, val_data)
svm2_con <-confusionMatrix(svm.predict2, as.factor(val_data$class))
names(image) <- bo$name
predict(image, svm.model2, overwrite=TRUE, filename="C:/SW/Grad_WVU/693c/A14image_out.img")
## class      : RasterLayer 
## dimensions : 612, 1507, 922284  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 611359.9, 656569.9, 4308286, 4326646  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=17 +datum=WGS84 +units=m +no_defs 
## source     : C:/SW/Grad_WVU/693c/A14image_out.img 
## names      : A14image_out 
## values     : 1, 5  (min, max)
raster_result <- raster("C:/SW/Grad_WVU/693c/A14image_out.img")
result_masked <- raster_result*mask
tm_shape(result_masked)+
  tm_raster(style= "cat", labels = c("Evergreen", "Mixed Mesophytic/Cove Hardwood", "Northern Hardwood", "Oak", "Oak-Pine"), 
            palette = c("forestgreen", "darkgreen", "lightgreen", "tan", "brown"), 
            title="Forest Classes")+
  tm_layout(main.title = "Predicted Forest Classifications", main.title.size = 1.5, main.title.fontface = "bold")+
  tm_layout(legend.outside=TRUE) + 
  tm_compass(position = c("right", "bottom")) + 
tm_scale_bar(position = c("right", "bottom")) 

rf.model.final2 <- rf.model2$finalModel
rf2_impo <- importance(rf.model.final2)

The SVM model returned the highest overall accuracy and greatest Kappa statistic.

Mixed Mesophytic/Cove Hardwood, Northern Hardwood, and Oak proved most difficult to map

Mixed Mesophytic/Cove Hardwood and Northern Hardwood forest types were most often confused.

The variables all.71 and all.52 are the most important variables for RF.