options(kableExtra.auto_format = FALSE)
library(tidyverse)
library(skimr)
library(dlookr)
library(class) #for knn()
library(flextable)
library(fastDummies)
library(factoextra)
library(cluster)
library(gridExtra)



##making theme
mpeach <- "#fbaa82"
mteal <- "#73a2ac"
mdarkteal <- "#0b5d69"
mgray <- "#4c4c4c"

# set plot theme for assignment
my_plot_theme <- list(
  theme_classic() +
  theme(plot.background = element_rect(fill = "#F3F2E8"),
        panel.background = element_rect(fill = "#F3F2E8"),
        panel.grid.major.x = element_line(color = "white"),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        text = element_text(size = 20)))

Assignment Prompt

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals. Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - describe the problem you are trying to solve. - describe your datases and what you did to prepare the data for analysis. - methodologies you used for analyzing the data - what’s the purpose of the analysis performed - make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Introduction

In this assignment I choose two separate datasets to demonstrate two different modelling techniques from the required sections of this course.

First, I use a poisonous vs not poison mushrooms classification dataset downloaded from Kaggle to practice KNN modeling. The theoretical problem in this scenario is knowing how to identify if a mushrooms is poisonous or not based on the 22 predictive variables. As those predictive variables are all categorical (with one logical) choosing the right model was a challenge but the KNN model did achieve 100% accuracy, which I read on the dataset page is common due to nature’s predictability when it comes to mushrooms! Dataset can be obtained here: https://www.kaggle.com/datasets/uciml/mushroom-classification?resource=download&select=mushrooms.csv

Second, I use a dataset our course textbook uses on mall customers to perform a clustering model. The problem to solve here is how to group customers so you can potential market/target them and increase sales. I walk through the process of determining the correct number of clusters and then visualize the model’s grouping of the data. Dataset can be obtained here: https://www.wiley.com/en-us/Practical+Machine+Learning+in+R-p-9781119591535#downloads-section

KNN Modeling

I read in the data, set the appropriate class types, and look at a summary of the data. As this dataset has already been prepped for classification machine learning projects, there is no missing data. Further, since there is no numeric data I don’t have to worry about extreme outliers or non-normal distributions.

# load in data with default class type of 'factor', except for bruises which is a logical 
mushrooms <- read_csv("mushrooms.csv", col_types = cols(.default = "f", bruises = "l"))

#produce table summary  
skim(mushrooms)
Data summary
Name mushrooms
Number of rows 8124
Number of columns 23
_______________________
Column type frequency:
factor 22
logical 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
class 0 1 FALSE 2 e: 4208, p: 3916
cap-shape 0 1 FALSE 6 x: 3656, f: 3152, k: 828, b: 452
cap-surface 0 1 FALSE 4 y: 3244, s: 2556, f: 2320, g: 4
cap-color 0 1 FALSE 10 n: 2284, g: 1840, e: 1500, y: 1072
odor 0 1 FALSE 9 n: 3528, f: 2160, y: 576, s: 576
gill-attachment 0 1 FALSE 2 f: 7914, a: 210
gill-spacing 0 1 FALSE 2 c: 6812, w: 1312
gill-size 0 1 FALSE 2 b: 5612, n: 2512
gill-color 0 1 FALSE 12 b: 1728, p: 1492, w: 1202, n: 1048
stalk-shape 0 1 FALSE 2 t: 4608, e: 3516
stalk-root 0 1 FALSE 5 b: 3776, ?: 2480, e: 1120, c: 556
stalk-surface-above-ring 0 1 FALSE 4 s: 5176, k: 2372, f: 552, y: 24
stalk-surface-below-ring 0 1 FALSE 4 s: 4936, k: 2304, f: 600, y: 284
stalk-color-above-ring 0 1 FALSE 9 w: 4464, p: 1872, g: 576, n: 448
stalk-color-below-ring 0 1 FALSE 9 w: 4384, p: 1872, g: 576, n: 512
veil-type 0 1 FALSE 1 p: 8124
veil-color 0 1 FALSE 4 w: 7924, n: 96, o: 96, y: 8
ring-number 0 1 FALSE 3 o: 7488, t: 600, n: 36
ring-type 0 1 FALSE 5 p: 3968, e: 2776, l: 1296, f: 48
spore-print-color 0 1 FALSE 9 w: 2388, n: 1968, k: 1872, h: 1632
population 0 1 FALSE 6 v: 4040, y: 1712, s: 1248, n: 400
habitat 0 1 FALSE 7 d: 3148, g: 2148, p: 1144, l: 832

Variable type: logical

skim_variable n_missing complete_rate mean count
bruises 0 1 0.42 FAL: 4748, TRU: 3376

Looking at the histograms for each variable we see the distribution of the levels within each variable. For our target variable class it appears there is a roughly equal amount in each level. A few variables have near-zero variance, such as gill-attachment, ring-number, and viel-number.

mushrooms %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_histogram(stat = "count", fill = "#73a2ac") +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  labs(title = "Checking Distribution of Predictor Variables") + 
  my_plot_theme

Next I split the data into a train and test set, and prepare my number of cross validation folds at the standard of 10.

mushrooms_dummies <- dummy_cols(mushrooms,  select_columns = c('cap-shape', 'cap-surface', 'cap-color', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'), 
           remove_selected_columns = TRUE)


set.seed(2911)


sample_index <- sample(nrow(mushrooms_dummies), round(nrow(mushrooms_dummies)*.75), replace = FALSE)

# create train and test sets from split
train <- mushrooms_dummies[sample_index, ]
test <- mushrooms_dummies[-sample_index, ]

Now that the data is prepared I build the model.

#for KNN function later
train_labels <- as.factor(pull(train, class))
test_labels <- as.factor(pull(test, class))

# remove target variable
train <- data.frame(select(train, -class))
test <- data.frame(select(test, -class))



knn_pred <- knn(
  train = train,
  test = test,
  cl = train_labels,
  k = 5)

It appears my model achieved 100% accuracy. In reading discussion on Kaggle about this dataset that is common, as mushrooms do follow a very clear pattern when they are poisonous or not. Normally, such a high accuracy rate would be troubling, but as this is a model of a constant, predictable relationship in the natural world it is expected. It seems safe to conclude if you have information and knowledge on identifying the ‘factor’ level of all 22 predictor variables in the wild, you should be able to determine without a doubt if a mushroom is poisonous or not.

mush_pred_table <- table(test_labels, knn_pred)
mush_pred_table
##            knn_pred
## test_labels    p    e
##           p  970    0
##           e    0 1061

Clustering

Unfortunately the categorical-variable only mushroom dataset I choose doesn’t work well with the models covered in the later half of our semester as required, so I’m switching to the dataset our course textbook used to walk through a Clustering example, mallcustomers, so I can follow along with the code.

First I import, prep, and review the data.

mallcustomers <- read_csv("mallcustomers.csv")

#change class
mallcustomers <- mallcustomers %>%
  mutate(Income = str_replace_all(Income," USD","")) %>%
  mutate(Income = str_replace_all(Income,",","")) %>%
  mutate(Income = as.numeric(Income))

#remove variarbles I don't need
mallcustomers <- mallcustomers %>%
  select(-CustomerID, -Gender, -Age) %>%
  scale()

#produce table summary  
skim(mallcustomers)
Data summary
Name mallcustomers
Number of rows 200
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Income 0 1 0 1 -1.73 -0.73 0.04 0.66 2.91 ▆▇▇▂▁
SpendingScore 0 1 0 1 -1.91 -0.60 -0.01 0.88 1.89 ▃▃▇▃▃

A quick view of the variables we’ll be using shows the distribution of the income and spending score.

mallcustomers_df <- as.data.frame(mallcustomers)

mallcustomers_df %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_histogram(fill = "#73a2ac") +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  labs(title = "Checking Distribution of Predictor Variables") + 
  my_plot_theme

Next I use the recommended methods to determine what a good k value is for this dataset.

# Elbow Method
p1 <- fviz_nbclust(mallcustomers, kmeans, method = "wss") + geom_point(
  shape = 1,
  x = 6,
  y = 60,
  colour = "red",
  size = 8,
  stroke = 1.5
) + ggtitle("Elbow Method")

# Silhouette Method
p2 <- fviz_nbclust(mallcustomers, kmeans, method = "silhouette") + geom_point(
  shape = 1,
  x = 6,
  y = 0.53,
  colour = "red",
  size = 8,
  stroke = 1.5
) + ggtitle("Silhouette Method")

# Gap Statistic
p3 <- fviz_nbclust(mallcustomers, kmeans, method = "gap_stat") + geom_point(
  shape = 1,
  x = 6,
  y = 0.57,
  colour = "red",
  size = 8,
  stroke = 1.5
) + ggtitle("Gap Statistic")

grid.arrange(p1, p2, p3, nrow = 3)

The graphs above make it clear the best k/number of clusters is going to be 6, so I proceed with building the model. The graphic below nicely shows the 6 clusters and their boundaries, and we can see a tangible way a business could segment and market to different groups based on their spending score and income. For example, those in the bottom-right teal cluster have high income but a low spending score - suggesting there is potential to market to them and increase their spending.

set.seed(2911)
k_clust <- kmeans(mallcustomers, centers = 6, nstart = 25)

fviz_cluster(
  k_clust,
  data = mallcustomers,
  main = "Mall Customers Segmented by Income and Spending Score",
  repel = TRUE
) + theme(text = element_text(size = 14))