You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, data.gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:.
Describe the problem you are trying to solve.
Describe your datases and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data
What’s the purpose of the analysis performed
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.
Your final presentation could be the traditional R file or Python file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.
For the first scenario (KNN) used Pima Indians Diabetes Dataset from kaggle. This dataset can be accessed using the below link.
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
The objective of the dataset is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. The dataset consists of 8 predictor variables and one target variable, Outcome.
For the second case (Clustering) decided to use mallcustomers dataset. It can be downloaded from the below link.
https://www.wiley.com/en-kr/Practical+Machine+Learning+in+R-p-9781119591511#downloads-section
The goal here is to segment customers based on Income and SpendingScore.
The diabetes dataset is loaded into df1 and the mall customers dataset is loaded into df2 using read.csv().
library(skimr)
## Warning: package 'skimr' was built under R version 4.0.5
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.2
## v purrr 0.3.4
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df1 <- read.csv("diabetes.csv")
head(df1)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
skim(df1)
| Name | df1 |
| Number of rows | 768 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 0 | 1 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
| Glucose | 0 | 1 | 120.89 | 31.97 | 0.00 | 99.00 | 117.00 | 140.25 | 199.00 | ▁▁▇▆▂ |
| BloodPressure | 0 | 1 | 69.11 | 19.36 | 0.00 | 62.00 | 72.00 | 80.00 | 122.00 | ▁▁▇▇▁ |
| SkinThickness | 0 | 1 | 20.54 | 15.95 | 0.00 | 0.00 | 23.00 | 32.00 | 99.00 | ▇▇▂▁▁ |
| Insulin | 0 | 1 | 79.80 | 115.24 | 0.00 | 0.00 | 30.50 | 127.25 | 846.00 | ▇▁▁▁▁ |
| BMI | 0 | 1 | 31.99 | 7.88 | 0.00 | 27.30 | 32.00 | 36.60 | 67.10 | ▁▃▇▂▁ |
| DiabetesPedigreeFunction | 0 | 1 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |
| Age | 0 | 1 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |
| Outcome | 0 | 1 | 0.35 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
df1 %>% gather() %>% ggplot(aes(value)) + geom_histogram() + facet_wrap(~ key, scales = "free", ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df2 <- read.csv("mallcustomers.csv")
head(df2)
## CustomerID Gender Age Income SpendingScore
## 1 1 Male 19 15,000 USD 39
## 2 2 Male 21 15,000 USD 81
## 3 3 Female 20 16,000 USD 6
## 4 4 Female 23 16,000 USD 77
## 5 5 Female 31 17,000 USD 40
## 6 6 Female 22 17,000 USD 76
skim(df2)
| Name | df2 |
| Number of rows | 200 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Income | 0 | 1 | 10 | 11 | 0 | 64 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CustomerID | 0 | 1 | 100.50 | 57.88 | 1 | 50.75 | 100.5 | 150.25 | 200 | ▇▇▇▇▇ |
| Age | 0 | 1 | 38.85 | 13.97 | 18 | 28.75 | 36.0 | 49.00 | 70 | ▆▇▅▃▂ |
| SpendingScore | 0 | 1 | 50.20 | 25.82 | 1 | 34.75 | 50.0 | 73.00 | 99 | ▃▃▇▃▃ |
Using KNN modeling. From the skim output it is clear that all the variables are numeric and there are a total of 768 records in this dataset. There are no missing values so we can proceed with splitting this dataset into train and test sets and then removing the target variable from train and test sets.
set.seed(222)
i <- sample(nrow(df1), round(nrow(df1)*.75), replace = FALSE)
df1_train <- df1[i, ]
df1_test <- df1[-i, ]
x <- as.factor(pull(df1_train, Outcome))
y <- as.factor(pull(df1_test, Outcome))
df1_train <- data.frame(select(df1_train, -Outcome))
df1_test <- data.frame(select(df1_test, -Outcome))
Now lets proceed with KNN modelling.
library(class)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
knnmodel <- knn(df1_train,df1_test,cl=x,k=5)
Below is the confusionMatrix output of this model. It is observed that the accuracy of this model is approx 71%. Not perfect but OK.
confusionMatrix(table(y, knnmodel))
## Confusion Matrix and Statistics
##
## knnmodel
## y 0 1
## 0 99 24
## 1 32 37
##
## Accuracy : 0.7083
## 95% CI : (0.6386, 0.7716)
## No Information Rate : 0.6823
## P-Value [Acc > NIR] : 0.2443
##
## Kappa : 0.35
##
## Mcnemar's Test P-Value : 0.3496
##
## Sensitivity : 0.7557
## Specificity : 0.6066
## Pos Pred Value : 0.8049
## Neg Pred Value : 0.5362
## Prevalence : 0.6823
## Detection Rate : 0.5156
## Detection Prevalence : 0.6406
## Balanced Accuracy : 0.6811
##
## 'Positive' Class : 0
##
Next choosing to do Clustering which is part of the second half of this course. From the skim result below it is observed that there are 4 numeric and 1 character type variables. There are no missing values either.
library(stringr)
df2 <- df2 %>%
mutate(Income = str_replace_all(Income," USD","")) %>%
mutate(Income = str_replace_all(Income,",","")) %>%
mutate(Income = as.numeric(Income))
skim(df2)
| Name | df2 |
| Number of rows | 200 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CustomerID | 0 | 1 | 100.50 | 57.88 | 1 | 50.75 | 100.5 | 150.25 | 200 | ▇▇▇▇▇ |
| Age | 0 | 1 | 38.85 | 13.97 | 18 | 28.75 | 36.0 | 49.00 | 70 | ▆▇▅▃▂ |
| Income | 0 | 1 | 60560.00 | 26264.72 | 15000 | 41500.00 | 61500.0 | 78000.00 | 137000 | ▆▇▇▂▁ |
| SpendingScore | 0 | 1 | 50.20 | 25.82 | 1 | 34.75 | 50.0 | 73.00 | 99 | ▃▃▇▃▃ |
Removing all the unnecessary columns(CustomerID, Gender, Age).
Now there are 2 numeric variables and no missing values.
df2 <- df2 %>%
select(-CustomerID, -Gender, -Age) %>% scale()
skim(df2)
| Name | df2 |
| Number of rows | 200 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Income | 0 | 1 | 0 | 1 | -1.73 | -0.73 | 0.04 | 0.66 | 2.91 | ▆▇▇▂▁ |
| SpendingScore | 0 | 1 | 0 | 1 | -1.91 | -0.60 | -0.01 | 0.88 | 1.89 | ▃▃▇▃▃ |
First, need to identify the optimal k using the elbow, silhouette and gap statistic methods.
The within-cluster sum of squares(WCSS) of a cluster is the sum of the distances between the items in the cluster and the cluster centroid. At some point in the curve, a visible bend occurs that represents the point at which increasing the value for k no longer yields a significant reduction in WCSS. This point is known as the elbow, and the k value at this point is usually expected to be the appropriate number of clusters for the dataset. This technique of using the elbow of the WCSS curve to determine the right number of clusters is known as the elbow method.
The average silhouette method computes the average silhouette of all items in the dataset based on different values for k. The silhouette of an item is a measure of how closely the item is matched with other items within the same cluster and how loosely it is with items in neighboring clusters. A silhouette value close to 1 implies that an item is the right cluster, while a silhouette value close to –1 implies that it is in the wrong cluster. The k value corresponding to the highest average silhouette represents the optimal number of clusters. The optimal number of clusters is denoted by the k value that yields the largest gap statistic.
Used fviz_nbclust(), the recommended value for k is obtained based on all three methods.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.5
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Elbow Method
a <- fviz_nbclust(df2, kmeans, method = "wss") + geom_point( shape = 1, x = 6, y = 60, colour = "red", size = 9, stroke = 1.6) + ggtitle("Elbow Method")
# Silhouette Method
b <- fviz_nbclust(df2, kmeans, method = "silhouette") + geom_point( shape = 1, x = 6, y = 0.53, colour = "red", size = 9, stroke = 1.6) + ggtitle("Silhouette Method")
# Gap Statistic
c <- fviz_nbclust(df2, kmeans, method = "gap_stat") + geom_point( shape = 1, x = 6, y = 0.57, colour = "red", size = 9, stroke = 1.6) + ggtitle("Gap Statistic")
grid.arrange(a, b, c)
We set the value for k to 6 and choose to use default 25 different configurations.
set.seed(2222)
k_clust <- kmeans(df2, centers = 6, nstart = 25)
fviz_cluster( k_clust, data = df2, main = "Mall Customers Segmented by Income and Spending Score", repel = TRUE, ggtheme = theme_minimal()) + theme(text = element_text(size = 15))
## Warning: ggrepel: 89 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The k value plays an important role in the performance of KNN model and it is the key tuning parameter of kNN algorithm. Based on the confusionMatrix output of KNN model, it is observed that the accuracy of this model is 71%. This is not perfect but ok.
Based on the output of cluster visualization it can, be seen that the customers in cluster 4 and cluster 5 have above average spending scores and above average income which suggests that they earn more and spend more. The customers in cluster 3 are also high earners, but they have below average spending scores which suggests that they earn more but spend less. Cluster 1 represents lower-earning and lower-spending customers, while cluster 2 represents the average customer with average income and average spending score. The customers in cluster 6 are customers with above average spending but below average income.