Lab 6/7 - Building a Machine Learning Model for Classification

Introduction

This lab will demonstrate how to build a model that predicts national GPD per capita values using UN demographic data from 1995 which includes 108 countries. The model will also predict if the country will have a high or low GDP per capita value based on various predictors from the UN dataset.

Setup

Let’s begin by loading the required libraries.

library(tidyverse)
library(sf)
library(plotly)
library(caret)
library(likert)
library(grid)
library(gridExtra)
library(ggpubr)

1. Download and unzip the data

Download the “World” folder from Brightspace and unzip the contents into the working directory. Let’s set our working directory for this project.

2. Load the data in your R session

Use the st_read() function to load the data. We will set “world” as a table that will contain data from the “world.shp” which will be loaded to our environment.

data <- st_read("world.shp")

3. Prepare the data

We will create a new column called GDP_High_Low. Countries with GDP_CAP values greater than or equal to the median of said variable will be assigned as ‘High’ while countries less than the median will be assigned as ‘Low’ variables. Additionally, GDP_High_Low will be the target variable. Whereby the model will try to predict if a country will have high or low GDP_CAP values.

data$GDP_High_Low <- as.factor(ifelse(data$GDP_CAP >= median(data$GDP_CAP), 'High', 'Low'))

Let’s drop the geometry column.

world <- data %>%
  st_drop_geometry()

I want to add a new predictor based on density - the population of the country divided by the total area of the country.

world <- world %>%
  mutate(`Density` = (`POP_CNTRY`/`SQKM_CNTRY`))

The predictors I chose for the model are baby mortality, population increase, country area in square kilometers, density, percentage of population who live in urban areas, female and male life expectancy, literacy, birth to death ration, and fertility. I chose baby mortality because it is likely that high GDP would result in lower infant mortality and I am curious to see if the model would agree with my hypothesis. Birth to death ratio is similar in this regard without specifying the cause of death. If the cause of death is due to old age or natural causes, that would not necessarily relate back to infant mortality. The life expectancy would give us better information with regards to death. I would hypothesize that the longer life expectancy, the relationship with a high GDP is also stronger. Of course this would not take into account ‘Blue Zones’ where centenarians are more common. In a similar vain, I also chose fertility as the number of children per household often has a strong relationship with a country’s economic health. These five predictors affect a country’s population increase in various ways that’s why I chose this as a predictor. I was also curious about the density and the percentage of the population that live in urban areas. a high population density does not necessarily translate to a high urban population and I wanted to be able to capture the nuance between the two statistics. Finally, I chose literacy because hypothetically, a high literacy rate should relate to a high GDP. However, I know generalizing is a dangerous habit. This is why having a good sample size with various predictors are important.

world_sub <- world %>%
  select("BABYMORT", "POP_INCR", "Density", "URBAN", "LIFEEXPF", "LIFEEXPM", "B_TO_D", 
         "FERTILTY", "LIT_FEMA", "LIT_MALE", "GDP_High_Low")

We will now give the column names aliases.

names(world_sub) <- c("BabyMort", "PopIncrease", "Density", "Urban", "LifeExpectancyF", 
                      "LifeExpectancyM", "LiteracyF", "LiteracyM",  "BirthDeathRatio", 
                      "Fertility", "GDP")
world_sub <- world_sub %>% drop_na()

We will then rescale the data in order to mitigate outlier impacts to the overall representation.

world_z <- world_sub %>%
  mutate(across(.cols = 1:10, ~ as.vector(scale(.)), .names = "scaled_{.col}")) %>%
  select(c(11, 12:21))

Let’s investigate if there is an equal distribution between ‘high’ and ‘low’ GDPs.

tapply(world_z$GDP, world_z$GDP, length)

## High  Low 
##   54   54

We will now select a sample size of 50 of each of the ‘high’ and ‘low’ GDP values.

set.seed(321)
df1 <- world_z %>% 
  filter(GDP == "High") %>% 
  sample_n(50)

df2 <- world_z %>% 
  filter(GDP == "Low") %>% 
  sample_n(50)

df <- bind_rows(df1, df2, .id = NULL)

4. Create training and validation datasets

We will now create training and validation datasets using a 75/25 training/validation split. This means the dataframe we are sampling from the previous step will be split into two categories for model testing. The training portion, where the model will analyse 75% of the sample, and attempt to predict(validate) whether or not the remainder of the 25% of the sample has been correctly validated by the given model.

set.seed(321)
inTraining <- createDataPartition(df$GDP, p=0.75, list=FALSE)
training <- df[inTraining,]
validation <- df[-inTraining,]

5. Setup model parameters

We will set the parameter to “Accuracy” and ensure the model uses a 10 fold cross-validation to avoid over fitting.

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

6. Train your model

This is for the Linear Discriminant Analysis (LDA).

set.seed(321)
fit.lda <- train(GDP~., data=training, method="lda", metric=metric, trControl=control)
predictions <- predict(fit.lda, validation)
predictions

cm <- confusionMatrix(predictions, as.factor(validation$GDP))
cm

This is for the Classification and Regression Trees (cart).

set.seed(321)
fit.cart <- train(GDP~., data=training, method="rpart", metric=metric, trControl=control)
predictions <- predict(fit.cart, validation)

cm1 <- confusionMatrix(predictions, as.factor(validation$GDP))
cm1

This is for the k-Nearest Neighbors (knn).

set.seed(321)
fit.knn <- train(GDP~., data=training, method="knn", metric=metric, trControl=control)
predictions <- predict(fit.knn, validation)

cm2 <- confusionMatrix(predictions, as.factor(validation$GDP))
cm2

This is for the Support Vector Machines (SVM).

set.seed(321)
fit.svm <- train(GDP~., data=training, method="svmRadial", metric=metric, trControl=control)
predictions <- predict(fit.svm, validation)

cm3 <- confusionMatrix(predictions, as.factor(validation$GDP))
cm3

And lastly, this is for the Logistic Regression (glm).

set.seed(321)
fit.glm <- train(GDP~., data=training, method="glm", metric=metric, 
                 trControl=control)
predictions <- predict(fit.glm, validation)


cm4 <- confusionMatrix(predictions, as.factor(validation$GDP))
cm4

7. Evaluate your model

We will now summarize the mean accuracy for each model using classification accuracy to describe the best performing models.

set.seed(321)
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, 
                          glm=fit.glm))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, glm 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## lda  0.6250000 0.7142857 0.7500000 0.8035714 0.9642857    1    0
## cart 0.7142857 0.7500000 0.8571429 0.8303571 0.8750000    1    0
## knn  0.6250000 0.7232143 0.7500000 0.8178571 0.9687500    1    0
## svm  0.6250000 0.7232143 0.8125000 0.8303571 0.9687500    1    0
## glm  0.6250000 0.7142857 0.8035714 0.8142857 0.9642857    1    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean 3rd Qu. Max. NA's
## lda  0.2500000 0.4166667 0.5000000 0.6053333  0.9300    1    0
## cart 0.4166667 0.5000000 0.7078261 0.6582319  0.7500    1    0
## knn  0.2500000 0.4375000 0.5000000 0.6280303  0.9375    1    0
## svm  0.2500000 0.4375000 0.6250000 0.6530303  0.9375    1    0
## glm  0.2500000 0.4166667 0.5978261 0.6248986  0.9300    1    0

results_df <- as.data.frame(results)

results_tidy <- results_df %>% 
  pivot_longer(names_to = "Model", values_to = "Accuracy", -Resample) %>% 
  group_by(Model) %>% 
  summarise(Mean_Accuracy = mean(Accuracy))

mean_acc <- results_tidy %>% 
  ggplot(aes(x=fct_reorder(Model, Mean_Accuracy), y=Mean_Accuracy))+
  geom_bar(stat = "identity")+
  coord_flip()+
  xlab("Mean Accuracy")+
  ylab("Model")+
  theme(text = element_text(size = 20))

mean_acc

According to the summary, the svm and cart models performed the best with a mean accuracy of 0.8303571, or 83%. This result is interesting because if we look at the individual models and their resulting accuracies, cart and knn shows the best accuracies at 87.5%, albeit different matrix compositions for the resulting accuracy. This may be due the the smaller sample size where I only used 100 out of the 108 occurrences. Additionally, the summary shows lda as the least accurate at 80.4% where in individual model performances, glm performed the worst at 66.7% accuracy. Again, this could be due to the smaller sample size.

8. Confusion matrices

I will now include two confusion matrices, the best performing and the worst performing. Note: I will be displaying the models that were outlined in the summary. I will be using the cart model for the best model in this instance.

cm1_d <- as.data.frame(cm1$table)
cm1_d$diag <- cm1_d$Prediction == cm1_d$Reference 
cm1_d$ndiag <- cm1_d$Prediction != cm1_d$Reference   
cm1_d[cm1_d == 0] <- NA 
cm1_d$Reference <-  reverse.levels(cm1_d$Reference) 
cm1_d$ref_freq <- cm1_d$Freq * ifelse(is.na(cm1_d$diag),-1,1)

plt1 <-  ggplot(data = cm1_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  scale_x_discrete(position = "top") +
  geom_tile( data = cm1_d,aes(fill = ref_freq)) +
  scale_fill_gradient2(guide = FALSE ,low="turquoise4",high="orchid4", midpoint = 0,
                       na.value = 'white') +
  geom_text(aes(label = Freq), color = 'black', size = 3)+
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.border = element_blank(),
        plot.background = element_blank(),
        axis.line = element_blank(),
  )
plt1

I will be using the lda model as the worst.

cm_d <- as.data.frame(cm$table)
cm_d$diag <- cm_d$Prediction == cm_d$Reference 
cm_d$ndiag <- cm_d$Prediction != cm_d$Reference      
cm_d[cm_d == 0] <- NA 
cm_d$Reference <-  reverse.levels(cm_d$Reference) 
cm_d$ref_freq <- cm_d$Freq * ifelse(is.na(cm_d$diag),-1,1)

plt2 <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  scale_x_discrete(position = "top") +
  geom_tile( data = cm_d,aes(fill = ref_freq)) +
  scale_fill_gradient2(guide = FALSE ,low="turquoise4",high="orchid4", midpoint = 0,
                       na.value = 'white') +
  geom_text(aes(label = Freq), color = 'black', size = 3)+
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.border = element_blank(),
        plot.background = element_blank(),
        axis.line = element_blank(),
  )
plt2

9. Predictor importance

I will only plot the lda, knn, svm, and glm model importance as half of the cart model predictors were in the 0% range importance. We will create a lollipop chart showing the predictor importance for each model using the varImp() function.

importance1 <- varImp(fit.lda)
importance2 <- varImp(fit.knn)
importance3 <- varImp(fit.svm)
importance4 <- varImp(fit.glm)

imp1 <- importance1$importance 
imp2 <- importance2$importance
imp3 <- importance3$importance
imp4 <- importance4$importance


p1 <- imp1 %>% 
  mutate(Predictor = rownames(imp1)) %>% 
  pivot_longer(names_to = "GDP", values_to = "Importance", -Predictor) %>%
  ggplot(aes(x=Predictor, y=Importance))+
  geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="lightgreen") +
  geom_point(color="darkgreen", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank())+
  ylab("Linear Discriminant Analysis")+
  xlab("")

p2 <- imp2 %>% 
  mutate(Predictor = rownames(imp2)) %>% 
  pivot_longer(names_to = "GDP", values_to = "Importance", -Predictor) %>%
  ggplot(aes(x=Predictor, y=Importance))+
  geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="lightgreen") +
  geom_point(color="darkgreen", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank())+
  ylab("k-Nearest Neighbors")+
  xlab("")

p3 <- imp3 %>% 
  mutate(Predictor = rownames(imp3)) %>% 
  pivot_longer(names_to = "GDP", values_to = "Importance", -Predictor) %>%
  ggplot(aes(x=Predictor, y=Importance))+
  geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="lightgreen") +
  geom_point(color="darkgreen", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank())+
  ylab("Support Vector Machine")+
  xlab("")

p4 <- imp4 %>% 
  mutate(Predictor = rownames(imp4)) %>% 
  pivot_longer(names_to = "GDP", values_to = "Importance", -Predictor) %>%
  ggplot(aes(x=Predictor, y=Importance))+
  geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="lightgreen") +
  geom_point(color="darkgreen", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank())+
  ylab("Logistic Regression")+
  xlab("")

plot_importance <- ggarrange(p1, p2, p3, p4, ncol = 1, heights = c(4, 4, 4, 4), width = 6) +
  theme(text = element_text(size = 12)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(plot.margin = margin(2, 2, 2))

ggsave("plot_importance.png", plot_importance, height = 10, width = 8)

Based on the lollipop chart, the most consistent with their degree of predictor importance are the scaled female and male life expectancy. Where female life expectancy has a very high importance across the four models. Urban population and infant mortality also prove to be of high importance. However, these two predictors are not consistent within the four models.

Let’s create a bar plot to show the overall rank of the predictors.

average_importance <- rowMeans(cbind(imp1, imp2, imp3, imp4))

importance_df <- data.frame(Predictor = names(average_importance), Importance = average_importance)

importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

library(ggplot2)

ggplot(importance_df, aes(x = reorder(Predictor, -Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "pink", color = "red3") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Predictor", y = "Average Importance Score") +
  ggtitle("Overall Predictor Importance Across All Models")

Overall, life expectancy for female and males are the highest predictors for GDP values. This is very interesting because generally speaking, the female presenting population tend to have a longer life expectancy globally. This gives rise to a few assumptions on my end, as someone analyzing this data. For one, this tells me that the general trend of low to high life expectancy in female presenting population is heavily related to the predictor of a country’s economic health and trend. This also tells me that perhaps this is a better indicator than male presenting population due to some sex tied differences - namely child birth. If the average life expectancy for female presenting population is high, I can assume child birth mortality is low. Of course this is a far stretch, however, it is also not improbable. This also tells me that the state of the healthcare system, rate of literacy, and socioeconomic health can be tied back to how a country’s economic health. Density and fertility ranked the lowest for predictor importance. This is an interesting finding for me because as a society, we are often faced with the narrative that sexual education and family planning, or more accurately, the lack there of, is indicative of lack of education and lack of resources, which could arguably be tied to a country’s economic health. However, seeing this graph makes me realize that there are several ‘developed’ countries with high GDP’s that politically restricts or sensors access to information. Once again, these assumptions can be far stretched. This tells me that data is data. Anyone can use data to tell a narrative that is biased, or they can attempt to be as unbiased as possible. At the end of it all, data gives us information, it is up to us to choose the appropriate data to convey the information.