Goal

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, data.gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected.

To complete this task:.

Describe the problem you are trying to solve.

Describe your datases and what you did to prepare the data for analysis.

Methodologies you used for analyzing the data

What’s the purpose of the analysis performed

Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Your final presentation could be the traditional R file or Python file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.

Problems to solve

For the first scenario (KNN) used Pima Indians Diabetes Dataset from kaggle. This dataset can be accessed using the below link.

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

The objective of the dataset is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. The dataset consists of 8 predictor variables and one target variable, Outcome.

For the second case (Clustering) decided to use mallcustomers dataset. It can be downloaded from the below link.

https://www.wiley.com/en-kr/Practical+Machine+Learning+in+R-p-9781119591511#downloads-section

The goal here is to segment customers based on Income and SpendingScore.

Loading Dataset and Exploration

The diabetes dataset is loaded into df1 and the mall customers dataset is loaded into df2 using read.csv().

library(skimr)
## Warning: package 'skimr' was built under R version 4.0.5
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.0      v stringr 1.4.0 
## v readr   2.1.2      v forcats 0.5.2 
## v purrr   0.3.4
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df1 <- read.csv("diabetes.csv")
head(df1)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
skim(df1)
Data summary
Name df1
Number of rows 768
Number of columns 9
_______________________
Column type frequency:
numeric 9
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Pregnancies 0 1 3.85 3.37 0.00 1.00 3.00 6.00 17.00 ▇▃▂▁▁
Glucose 0 1 120.89 31.97 0.00 99.00 117.00 140.25 199.00 ▁▁▇▆▂
BloodPressure 0 1 69.11 19.36 0.00 62.00 72.00 80.00 122.00 ▁▁▇▇▁
SkinThickness 0 1 20.54 15.95 0.00 0.00 23.00 32.00 99.00 ▇▇▂▁▁
Insulin 0 1 79.80 115.24 0.00 0.00 30.50 127.25 846.00 ▇▁▁▁▁
BMI 0 1 31.99 7.88 0.00 27.30 32.00 36.60 67.10 ▁▃▇▂▁
DiabetesPedigreeFunction 0 1 0.47 0.33 0.08 0.24 0.37 0.63 2.42 ▇▃▁▁▁
Age 0 1 33.24 11.76 21.00 24.00 29.00 41.00 81.00 ▇▃▁▁▁
Outcome 0 1 0.35 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
df1 %>% gather() %>% ggplot(aes(value)) + geom_histogram() + facet_wrap(~ key, scales = "free", ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df2 <- read.csv("mallcustomers.csv")
head(df2)
##   CustomerID Gender Age     Income SpendingScore
## 1          1   Male  19 15,000 USD            39
## 2          2   Male  21 15,000 USD            81
## 3          3 Female  20 16,000 USD             6
## 4          4 Female  23 16,000 USD            77
## 5          5 Female  31 17,000 USD            40
## 6          6 Female  22 17,000 USD            76
skim(df2)
Data summary
Name df2
Number of rows 200
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1 4 6 0 2 0
Income 0 1 10 11 0 64 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CustomerID 0 1 100.50 57.88 1 50.75 100.5 150.25 200 ▇▇▇▇▇
Age 0 1 38.85 13.97 18 28.75 36.0 49.00 70 ▆▇▅▃▂
SpendingScore 0 1 50.20 25.82 1 34.75 50.0 73.00 99 ▃▃▇▃▃

1. KNN Algo

Using KNN modeling. From the skim output it is clear that all the variables are numeric and there are a total of 768 records in this dataset. There are no missing values so we can proceed with splitting this dataset into train and test sets and then removing the target variable from train and test sets.

set.seed(222)

i <- sample(nrow(df1), round(nrow(df1)*.75), replace = FALSE)

df1_train <- df1[i, ]
df1_test <- df1[-i, ]

x <- as.factor(pull(df1_train, Outcome))
y <- as.factor(pull(df1_test, Outcome))

df1_train <- data.frame(select(df1_train, -Outcome))
df1_test <- data.frame(select(df1_test, -Outcome))

Now lets proceed with KNN modelling.

library(class)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
knnmodel <- knn(df1_train,df1_test,cl=x,k=5)

Below is the confusionMatrix output of this model. It is observed that the accuracy of this model is approx 71%. Not perfect but OK.

confusionMatrix(table(y, knnmodel))
## Confusion Matrix and Statistics
## 
##    knnmodel
## y    0  1
##   0 99 24
##   1 32 37
##                                           
##                Accuracy : 0.7083          
##                  95% CI : (0.6386, 0.7716)
##     No Information Rate : 0.6823          
##     P-Value [Acc > NIR] : 0.2443          
##                                           
##                   Kappa : 0.35            
##                                           
##  Mcnemar's Test P-Value : 0.3496          
##                                           
##             Sensitivity : 0.7557          
##             Specificity : 0.6066          
##          Pos Pred Value : 0.8049          
##          Neg Pred Value : 0.5362          
##              Prevalence : 0.6823          
##          Detection Rate : 0.5156          
##    Detection Prevalence : 0.6406          
##       Balanced Accuracy : 0.6811          
##                                           
##        'Positive' Class : 0               
## 

2. Clustering Algo

Next choosing to do Clustering which is part of the second half of this course. From the skim result below it is observed that there are 4 numeric and 1 character type variables. There are no missing values either.

library(stringr)
df2 <- df2 %>%
  mutate(Income = str_replace_all(Income," USD","")) %>%
  mutate(Income = str_replace_all(Income,",","")) %>%
  mutate(Income = as.numeric(Income))

skim(df2)
Data summary
Name df2
Number of rows 200
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CustomerID 0 1 100.50 57.88 1 50.75 100.5 150.25 200 ▇▇▇▇▇
Age 0 1 38.85 13.97 18 28.75 36.0 49.00 70 ▆▇▅▃▂
Income 0 1 60560.00 26264.72 15000 41500.00 61500.0 78000.00 137000 ▆▇▇▂▁
SpendingScore 0 1 50.20 25.82 1 34.75 50.0 73.00 99 ▃▃▇▃▃

Removing all the unnecessary columns(CustomerID, Gender, Age).

Now there are 2 numeric variables and no missing values.

df2 <- df2 %>%
  select(-CustomerID, -Gender, -Age) %>% scale()

skim(df2)
Data summary
Name df2
Number of rows 200
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Income 0 1 0 1 -1.73 -0.73 0.04 0.66 2.91 ▆▇▇▂▁
SpendingScore 0 1 0 1 -1.91 -0.60 -0.01 0.88 1.89 ▃▃▇▃▃

First, need to identify the optimal k using the elbow, silhouette and gap statistic methods.

The within-cluster sum of squares(WCSS) of a cluster is the sum of the distances between the items in the cluster and the cluster centroid. At some point in the curve, a visible bend occurs that represents the point at which increasing the value for k no longer yields a significant reduction in WCSS. This point is known as the elbow, and the k value at this point is usually expected to be the appropriate number of clusters for the dataset. This technique of using the elbow of the WCSS curve to determine the right number of clusters is known as the elbow method.

The average silhouette method computes the average silhouette of all items in the dataset based on different values for k. The silhouette of an item is a measure of how closely the item is matched with other items within the same cluster and how loosely it is with items in neighboring clusters. A silhouette value close to 1 implies that an item is the right cluster, while a silhouette value close to –1 implies that it is in the wrong cluster. The k value corresponding to the highest average silhouette represents the optimal number of clusters. The optimal number of clusters is denoted by the k value that yields the largest gap statistic.

Used fviz_nbclust(), the recommended value for k is obtained based on all three methods.

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.5
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Elbow Method
a <- fviz_nbclust(df2, kmeans, method = "wss") + geom_point( shape = 1, x = 6, y = 60, colour = "red", size = 9, stroke = 1.6) + ggtitle("Elbow Method")

# Silhouette Method
b <- fviz_nbclust(df2, kmeans, method = "silhouette") + geom_point( shape = 1, x = 6, y = 0.53, colour = "red", size = 9, stroke = 1.6) + ggtitle("Silhouette Method")

# Gap Statistic
c <- fviz_nbclust(df2, kmeans, method = "gap_stat") + geom_point( shape = 1, x = 6, y = 0.57, colour = "red", size = 9, stroke = 1.6) + ggtitle("Gap Statistic")


grid.arrange(a, b, c)

We set the value for k to 6 and choose to use default 25 different configurations.

set.seed(2222)
k_clust <- kmeans(df2, centers = 6, nstart = 25)

fviz_cluster( k_clust, data = df2, main = "Mall Customers Segmented by Income and Spending Score", repel = TRUE, ggtheme = theme_minimal()) + theme(text = element_text(size = 15))
## Warning: ggrepel: 89 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Conclusion

The k value plays an important role in the performance of KNN model and it is the key tuning parameter of kNN algorithm. Based on the confusionMatrix output of KNN model, it is observed that the accuracy of this model is 71%. This is not perfect but ok.

Based on the output of cluster visualization it can, be seen that the customers in cluster 4 and cluster 5 have above average spending scores and above average income which suggests that they earn more and spend more. The customers in cluster 3 are also high earners, but they have below average spending scores which suggests that they earn more but spend less. Cluster 1 represents lower-earning and lower-spending customers, while cluster 2 represents the average customer with average income and average spending score. The customers in cluster 6 are customers with above average spending but below average income.