GDAA 1001 - Fundamentals of Geospatial Data Analytics

Lab #6/#7 - Build a Machine Learning Model for Classification

Alex Moss November 2023

Introduction

This lab aims to leverage a dataset based on UN demographic data from 1995 for 108 countries to help build a model that predicts country GDP per capita values as either high or low with the aid of predictor variables.

knitr::opts_knit$set(root.dir = "E:/NSCC/Semester_1/GDAA1001_Fundamentals_of_GDA/Labs/Lab_6_7")

Task 1 + 2

Download and unzip the data
Load the data in your R session

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(sf)

## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(likert)

## Warning: package 'likert' was built under R version 4.3.2

## Loading required package: xtable

## Warning: package 'xtable' was built under R version 4.3.2

## 
## Attaching package: 'likert'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode

library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.3.2

library(dplyr)

world_data <- st_read("world.shp")

## Reading layer `world' from data source 
##   `E:\NSCC\Semester_1\GDAA1001_Fundamentals_of_GDA\Labs\Lab_6_7\world.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 108 features and 26 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -15679260 ymin: -6224955 xmax: 15351770 ymax: 9252194
## Projected CRS: World_Winkel_II

It doesn’t hurt to do a little check on the data that was just loaded into the session. The following two lines of code are used to check the column names and data type of the data set.

names(world_data)

##  [1] "ISO_3DIGIT" "CNTRY_NAME" "POP_CNTRY"  "SQKM_CNTRY" "LANDLOCKED"
##  [6] "RELIGION"   "REGION"     "CLIMATE"    "POPULATN"   "URBAN"     
## [11] "LIFEEXPF"   "LIFEEXPM"   "LITERACY"   "POP_INCR"   "BABYMORT"  
## [16] "GDP_CAP"    "CALORIES"   "AIDS"       "BIRTH_RT"   "DEATH_RT"  
## [21] "AIDS_RT"    "B_TO_D"     "FERTILTY"   "LIT_MALE"   "LIT_FEMA"  
## [26] "UNIQUE"     "geometry"

class(world_data)

## [1] "sf"         "data.frame"

Task 3

Prepare the data

Task 3 involves preparing the data for training and validation. This will include creating our target variable, eliminating any problem variables (geometry column), selecting predictors, and removing any observations missing data.

world_data_nogeom <- world_data %>%
  st_drop_geometry()

Firstly, the geometry column is removed so that it does not cause issues during the modelling process of the lab.

HLGDP_world_data_nogeom <- world_data_nogeom %>%
  mutate(gdp_high_low = case_when(
    GDP_CAP >= (median(GDP_CAP)) ~ "High",
    GDP_CAP < (median(GDP_CAP)) ~ "Low"
         ))

Next, a new column was created for our target variable. We named it gdp_high_low, and decided to give this variable two possible values: “high” or “low”. Any country with a GDP_CAP value that is greater or equal to the median will have a “high” gdp_high_low, and any country with a GDP_CAP value that is less than the median will have a “low” gdp_high_low.

The challenging portion of this task is to select good, quality predictors for our model. Categorical variables were avoided to keep things simple given this lab is introducing new concepts to me. Region and climate were considered for the model but for the sake of simplicity, were left out. Calories and aids were left out because it looked like they were missing a number of values for several countries. Both variables would have made quality predictors, however, preserving as many observations as possible given the small data set made more sense. Birth to death ratio was selected instead of the individual birth rates and death rates. This is because birth to death ratio only takes up one predictor slot while just being a ratio of the birth and death ratios. Literacy as the total % of people who read was also chosen instead of % of males and % of females who read. This was not the ideal choice as % of females who read would have been a very important predictor for gdp per capita. However, when looking over both % of males and females who read, many 0 values can be observed, meaning a large number of countries would need to be omitted from the model. Given the overall small number of countries to begin with, the decision to go with literacy as the total % of people who read made more sense. Country names and 3-digit country abbreviations were also left out as it was deemed they did not bring enough value to the model. The unique column is simply a column acting as a unique row identifier so that was also excluded from the model.

This meant 12 variables were left to choose from for the 10 predictors. Two of these were population variables, so total population (POP_CNTRY) was chosen for the model and POPULATN (Population in thousands) was left out. With one more variable to exclude from the model, the final variable to leave out was between total country area or landlocked. For the sake of keeping all of the variables numeric, the landlocked variable was left behind.

world_data_sub <- HLGDP_world_data_nogeom %>%
  select("SQKM_CNTRY","POP_CNTRY", "POP_INCR", "B_TO_D", "LIFEEXPF", 
         "LIFEEXPM", "URBAN", "LITERACY","FERTILTY", "BABYMORT", "gdp_high_low")

This code creates a new data set that only includes our 10 predictor variables that were selected and our target variable that was created.

To finish off task 3, rows of data that are missing values were removed.

world_data_filter <- world_data_sub %>% 
  filter("People Who Read (%)" > 0.0 | "Fertility: Average Number of Kids" > 0.0)

I noticed four different rows that either had a value of 0 for Literacy or Fertility. I tried using logical indexing to remove those rows but had no luck. In the next chunk of code, I will remove these rows individually.

rows_to_remove <- c(12, 35, 67, 97)
world_data_filter <- world_data_sub[-rows_to_remove, ]

str(world_data_filter)

## 'data.frame':    104 obs. of  11 variables:
##  $ SQKM_CNTRY  : num  641869 85808 29872 2781013 7706142 ...
##  $ POP_CNTRY   : num  17250390 5487866 3377228 33796870 17827520 ...
##  $ POP_INCR    : num  2.8 1.4 1.4 1.3 1.4 0.2 2.4 0.2 2.7 0.2 ...
##  $ B_TO_D      : num  2.41 3.29 3.83 2.22 1.88 1.09 7.25 1.9 4 1.09 ...
##  $ LIFEEXPF    : int  44 75 75 75 80 79 74 78 66 79 ...
##  $ LIFEEXPM    : int  45 67 68 68 74 73 71 73 60 73 ...
##  $ URBAN       : int  18 54 68 86 85 58 83 45 25 96 ...
##  $ LITERACY    : int  29 98 98 95 100 99 77 99 72 99 ...
##  $ FERTILTY    : num  6.9 2.8 3.2 2.8 1.9 1.5 4 1.8 5.1 1.7 ...
##  $ BABYMORT    : num  168 35 27 25.6 7.3 6.7 25 20.3 39.3 7.2 ...
##  $ gdp_high_low: chr  "Low" "High" "High" "High" ...

world_data_filter$gdp_high_low <- as.factor(world_data_filter$gdp_high_low)

Let’s run the same code as earlier to verify it is no longer a character string.

str(world_data_filter)

## 'data.frame':    104 obs. of  11 variables:
##  $ SQKM_CNTRY  : num  641869 85808 29872 2781013 7706142 ...
##  $ POP_CNTRY   : num  17250390 5487866 3377228 33796870 17827520 ...
##  $ POP_INCR    : num  2.8 1.4 1.4 1.3 1.4 0.2 2.4 0.2 2.7 0.2 ...
##  $ B_TO_D      : num  2.41 3.29 3.83 2.22 1.88 1.09 7.25 1.9 4 1.09 ...
##  $ LIFEEXPF    : int  44 75 75 75 80 79 74 78 66 79 ...
##  $ LIFEEXPM    : int  45 67 68 68 74 73 71 73 60 73 ...
##  $ URBAN       : int  18 54 68 86 85 58 83 45 25 96 ...
##  $ LITERACY    : int  29 98 98 95 100 99 77 99 72 99 ...
##  $ FERTILTY    : num  6.9 2.8 3.2 2.8 1.9 1.5 4 1.8 5.1 1.7 ...
##  $ BABYMORT    : num  168 35 27 25.6 7.3 6.7 25 20.3 39.3 7.2 ...
##  $ gdp_high_low: Factor w/ 2 levels "High","Low": 2 1 1 1 1 1 1 1 2 1 ...

Finally, we scale the predictor variables using a z-transformation.

world_data_scaled <- world_data_filter %>% 
 mutate(across(.cols = 1:10, ~ as.vector(scale(.)), .names = "scaled_{.col}"))
world_data_final <- world_data_scaled %>% 
 select(-c(1:10))

Task 3 is finished and the data has been prepped.

Task 4

Create training and validation datasets

Task 4 involves creating a training and a validation data set from our now prepped data set.

set.seed(554)
inTraining <- createDataPartition(world_data_filter$gdp_high_low, p=0.75, list=FALSE)
training <- world_data_filter[inTraining,]
validation <- world_data_filter[-inTraining,]

First thing noted after this chunk of code was run is that the data did not get properly split between training and validation. Training should have 78 observations, not 79, and validation should have 26 observations, not 25.

Task 5

Setup model parameters

In task 5, our training model parameters need to be setup so that the model uses cross validation applied 10 fold to avoid over fitting. Accuracy will be used as the metric in which we measure success for our models.

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

As seen in the above chunk of code, an object named “control” was created. In that object, the train control function from the caret package was used to specify that the cross validation technique (method = cv) will be used,and applied 10 fold (number = 10). Then, the metric was set to “accuracy”, which will be how performance will be measured for each algorithm.

Task 6

Train your model

Task 6 is where we train our model using the five appointed algorithms: • Linear Discriminant Analysis • Classification and Regression Trees • k-Nearest Neighbors • Support Vector Machines • Logistic Regression

Linear Discriminant Analysis

set.seed(554)
fit.lda <- train(gdp_high_low~., data=training, method="lda", metric=metric,
                 trControl=control)
predictions1 <- predict(fit.lda, validation)
predictions1

##  [1] Low  High High Low  High Low  High Low  Low  Low  High High Low  High Low 
## [16] Low  Low  High Low  Low  High Low  High High High
## Levels: High Low

The model above using the LDA algorithm to run the training and validation models can be used as a template for the other four algorithms as well. The only thing that needs to be changed is the name of the model and the method that is used, the method being which algorithm will be run with the model.

The set seed function was used so that the results from running the model can be reproduced if necessary. The “gdp_high_low~.” portion of the model is saying that “gdp_high_low” will be our target variable, the variable the model is trying to predict, and the “~.” portion is saying that every other variable in the data will act as a predictor, variables used to help the model predict either high or low gdp. We use the training data set for the training models, the metric function we created earlier for the metric and the control function for the training control.

For the validation model, we use the training model that we just ran inside the predict function, then specify that we want to use the validation data.

Classification and Regression Trees

set.seed(554)
fit.cart <- train(gdp_high_low~., data=training, method="rpart", metric=metric, 
                 trControl=control)
predictions2 <- predict(fit.cart, validation)
predictions2

##  [1] Low  High High High High Low  High Low  Low  Low  High High Low  High Low 
## [16] Low  High Low  Low  Low  High Low  High High Low 
## Levels: High Low

k-Nearest Neighbors

set.seed(554)
fit.knn <- train(gdp_high_low~., data=training, method="knn", metric=metric, 
                 trControl=control)
predictions3 <- predict(fit.knn, validation)
predictions3

##  [1] High Low  Low  High Low  High Low  High Low  Low  Low  Low  High Low  Low 
## [16] Low  High Low  High Low  Low  Low  Low  Low  Low 
## Levels: High Low

Support Vector Machines

set.seed(554)
fit.svm <- train(gdp_high_low~., data=training, method="svmRadial", metric=metric, 
                 trControl=control)
predictions4 <- predict(fit.svm, validation)
predictions4

##  [1] Low  Low  High High High Low  High Low  Low  Low  High High Low  High Low 
## [16] Low  Low  Low  Low  Low  High Low  High High Low 
## Levels: High Low

Logistic Regression

set.seed(554)
fit.glm <- train(gdp_high_low~., data=training, method="glm", metric=metric, 
                 trControl=control)
predictions5 <- predict(fit.glm, validation)
predictions5

##  [1] Low  High High Low  High Low  High Low  Low  Low  High High Low  High Low 
## [16] Low  Low  High Low  Low  High Low  High High High
## Levels: High Low

Task 7

Evaluate your model

For task 7, we want to run the validation models using the validation data set, and then use the accuracy classification to measure how successful each model was.

Linear Discriminant Analysis

cm1 <- confusionMatrix(predictions1, as.factor(validation$gdp_high_low))
cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High   10   2
##       Low     2  11
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.6392, 0.9546)
##     No Information Rate : 0.52            
##     P-Value [Acc > NIR] : 0.0008956       
##                                           
##                   Kappa : 0.6795          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.8462          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.4800          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.8397          
##                                           
##        'Positive' Class : High            
##

Confusion matrix function is used in this code with our validation results from the validation model to print out the results and give us a bit of statistical information about them, including the accuracy value we are looking for.

Classification and Regression Trees

cm2 <- confusionMatrix(predictions2, as.factor(validation$gdp_high_low))
cm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High   12   0
##       Low     0  13
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8628, 1)
##     No Information Rate : 0.52       
##     P-Value [Acc > NIR] : 7.945e-08  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.48       
##          Detection Rate : 0.48       
##    Detection Prevalence : 0.48       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : High       
##

k-Nearest Neighbors

cm3 <- confusionMatrix(predictions3, as.factor(validation$gdp_high_low))
cm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High    2   5
##       Low    10   8
##                                           
##                Accuracy : 0.4             
##                  95% CI : (0.2113, 0.6133)
##     No Information Rate : 0.52            
##     P-Value [Acc > NIR] : 0.9197          
##                                           
##                   Kappa : -0.2215         
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.1667          
##             Specificity : 0.6154          
##          Pos Pred Value : 0.2857          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.4800          
##          Detection Rate : 0.0800          
##    Detection Prevalence : 0.2800          
##       Balanced Accuracy : 0.3910          
##                                           
##        'Positive' Class : High            
##

Support Vector Machines

cm4 <- confusionMatrix(predictions4, as.factor(validation$gdp_high_low))
cm4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High   10   0
##       Low     2  13
##                                           
##                Accuracy : 0.92            
##                  95% CI : (0.7397, 0.9902)
##     No Information Rate : 0.52            
##     P-Value [Acc > NIR] : 2.222e-05       
##                                           
##                   Kappa : 0.8387          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.4800          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : High            
##

Logistic Regression

cm5 <- confusionMatrix(predictions5, as.factor(validation$gdp_high_low))
cm5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High   10   2
##       Low     2  11
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.6392, 0.9546)
##     No Information Rate : 0.52            
##     P-Value [Acc > NIR] : 0.0008956       
##                                           
##                   Kappa : 0.6795          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.8462          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.4800          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.8397          
##                                           
##        'Positive' Class : High            
##

Using the confusion matrix function shows the accuracy for each model. Accuracy can be used to rank the models and show which ones performed well and which ones performed poorly.

The model which performed the best was the model using the Support Vector Machines algorithm, which had an accuracy value of 0.92.

Next, with an accuracy value of 0.88, was the Logistic Regression algorithm.

Third, the Logistic Regression algorithm had an accuracy of 0.84.

Fourth was the model using the Classification and Regression Trees algorithm, which had an accuracy of 0.80.

Finally, the model with k-Nearest Neighbors only had an accuracy value of 0.56, which is only slightly higher than the no information rate of 0.52.

Task 8

Confusion matrices

For task 8, a confusion matrix plot will be produced for the best and the worst performing models. In this case, the best performing model uses the Support Vector Machines algorithm and the worst performing model uses the k-Nearest Neighbors algorithm.

cm_svm <- as.data.frame(cm4$table)
cm_svm$diag <- cm_svm$Prediction == cm_svm$Reference 
cm_svm$ndiag <- cm_svm$Prediction != cm_svm$Reference     
cm_svm$Reference <-  reverse.levels(cm_svm$Reference) 
cm_svm$ref_freq <- cm_svm$Freq * ifelse(is.na(cm_svm$diag),-1,1)

plt1 <-  ggplot(data = cm_svm, aes(x = Prediction , y =  Reference, fill = Freq))+
  scale_x_discrete(position = "top") +
  geom_tile( data = cm_svm,aes(fill = ref_freq)) +
  scale_fill_gradient2(guide = FALSE ,low="red",high="green", midpoint = 0,na.value = 'white') +
  geom_text(aes(label = Freq), color = 'black', size = 3)+
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.border = element_blank(),
        plot.background = element_blank(),
        axis.line = element_blank(),
  )
plt1

Above is the confusion matrix plot for the Support Vector Machine algorithm.

cm_kNA <- as.data.frame(cm3$table)
cm_kNA$diag <- cm_kNA$Prediction == cm_kNA$Reference 
cm_kNA$ndiag <- cm_kNA$Prediction != cm_kNA$Reference     
cm_kNA[cm_kNA == 0] <- NA
cm_kNA$Reference <-  reverse.levels(cm_kNA$Reference) 
cm_kNA$ref_freq <- cm_kNA$Freq * ifelse(is.na(cm_kNA$diag),-1,1)

plt1 <-  ggplot(data = cm_kNA, aes(x = Prediction , y =  Reference, fill = Freq))+
  scale_x_discrete(position = "top") +
  geom_tile( data = cm_kNA,aes(fill = ref_freq)) +
  scale_fill_gradient2(guide = FALSE ,low="red",high="green", midpoint = 0,na.value = 'white') +
  geom_text(aes(label = Freq), color = 'black', size = 3)+
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.border = element_blank(),
        plot.background = element_blank(),
        axis.line = element_blank(),
  )
plt1

Above is the confusion matrix plot for the k-Nearest Neighbor algorithm.

Task 9

Side Note

After reading your email about variable importance potentially not working for all of the models, I ran the code for all five to see what results would be produced. No errors occurred, however, looking closely at the lollipop plots, I could see that the three models you did not mention in your email (the ones you didn’t think would work) seem to all show the exact same results. For this reason, I am simply leaving them out of the code entirely. I will only be determining variable importance based on the fit.cart and the fit.glm models.

Predictor importance

In task 9, we take a look at variable importance for the model. For task 3, ten predictors were chosen to train and validate the model. Now, the goal is to look a little closer at those predictors and figure out which variables are the most important when making the predictions.

# determining variable importance
importance1 <- varImp(fit.cart)
importance2 <- varImp(fit.glm)

imp1 <- importance1$importance 
imp2 <- importance2$importance


p1 <- imp1 %>% 
  mutate(Predictor = rownames(imp1)) %>% 
  pivot_longer(names_to = "gdp_high_low", values_to = "Importance", -Predictor) %>%
  ggplot(aes(x=Predictor, y=Importance))+
  geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
  geom_point(color="blue", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank())+
  ylab("Classification & Regression Tree")+
  xlab("")
p1

p2 <- imp2 %>% 
  mutate(Predictor = rownames(imp2)) %>% 
  pivot_longer(names_to = "gdp_high_low", values_to = "Importance", -Predictor) %>%
  ggplot(aes(x=Predictor, y=Importance))+
  geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
  geom_point(color="blue", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank())+
  ylab("Logistic Regression")+
  xlab("")
p2

Below is the code to save the plots as an external png file, with the plots being stacked.

# create the plot
plot_importance <- ggarrange(p1, p2, ncol = 1, heights = c(4, 4), width = 6) +
  theme(text = element_text(size = 12)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(plot.margin = margin(2))

## Warning in as_grob.default(plot): Cannot convert object of class numeric into a
## grob.

# save the plot to a file
ggsave("plot_importance.png", plot_importance, height = 10, width = 8)

# open the saved image in a viewer
png_file <- "plot_importance.png"
browseURL(png_file)

In the code below, a bar plot showing the overall rank of all ten predictors, only using variable importance from the CART and GLM models.The predictors are given a relative score between 0 and 100 for their importance.

# Calculate the average importance score for each predictor variable across all models
average_importance <- rowMeans(cbind(imp1, imp2))

# Create a data frame for plotting
importance_df <- data.frame(Predictor = names(average_importance), Importance = average_importance)

# Sort the data frame by importance score in descending order
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Create the plot
library(ggplot2)

ggplot(importance_df, aes(x = reorder(Predictor, -Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "pink") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Predictor", y = "Average Importance Score") +
  ggtitle("Overall Predictor Importance Across All Models")

From the plot, the most important variables are population increase per year, baby mortality and male life expectancy. The least important variables were literacy and fertility. Before checking for variable importance, I would have predicted total population and area to be the least important and literacy to be one of the most important. However, as I will mention in the summary/closing thoughts section, the data set was definitely limiting us in this lab. It also does not help that only two out of five models were able to provide data on variable importance.

Summary / Final Thoughts

I spent a good chunk of time tinkering with this lab, looking at the data, etc. I believe that the data set we worked with was quite limiting in a couple of ways. Given that there were only 108 countries within the data to begin with, I based my predictor selection on limiting the number of countries that needed to be removed. A good chunk of the variables for this data set had many null or 0 values. Variables such as AIDS, Calories, male & female literacy all would have made for good predictors in my opinion. If I had selected them, some 30-40 countries would have needed to be removed, which would have been too many. There were also some categorical variables like climate and region that could have made for interesting predictors but given this was my first time tackling a lab like this, and taking into account your warning in the lab instructions about the categorical variables, I thought they needed to be left out.

The next weird hitch I found when going through this lab was dividing up the training and validation data. The data refused to be properly split into 75/25, despite there being a good number to create the split (104 countries made it past the cut, 75% should have given me 78 for training, and left 26 for validation).

After fixing my errors with the CART algorithm, I did manage to get all five models to run. However, one thing I noticed when printing out the confusion matrix function to get the accuracy values is that depending on what set.seed was used, my results varied quite a bit. For example, with the k-Nearest Neighbors algorithm, I saw accuracy values that ranged from 0.44 to 0.68. The LDA algorithm produced accuracy values ranging from 0.74 to 0.96. The other three algorithms weren’t quite as varied but still produced accuracy values that ranged from 0.78 to 0.94. I believe the size of the data set is the culprit for this large accuracy variance.

For variable importance, due to only two of five models giving us data for calculating this, I’m not putting a whole lot of stock into it. I’d be curious to understand why only CART and GLM can produce values for variable importance.

Finally, I’d like to briefly talk about stuff I would redo if I started from scratch. This is mostly pertaining to the predictor selection. I’m unsure if not selecting variables based on those false zero values was the correct call, especially given that I believe some would have made excellent predictors. AIDS cases, daily calorie intake and female literacy % I think should be in the model as predictors. Ignoring the variables with the false zeroes, I think I could have made a population density variable by dividing the total population values by the country area values. This would have opened up a spot for the landlocked variable. Thinking back, whether a country is landlocked or not could have lots of impact on GDP per capita. A country having a coastline can unlock numerous economic options.