ISYE 6501, Week 2 HW

Question 3.1a

Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier: (a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)

Below is the code used to answer question 3.1a, along with helpful comments to describe each step. In general, the approach was: Load data and appropriate libraries, split total data set into training (70% of data) and testing (30% of data) data sets, use Leave One Out Cross Validation on the training data set to identify optimal hyperperameters using a k-nearest neighbors model, use identified hyperperameters in a model on the training data to calculate training accuracy, and then run the same model on the test data set to calculate the true reported accuracy. I hypothesized that the training accuracy would be overly optimistic due to random effects, and that the testing accuracy would be lower, but a more true measure of model accuracy.

# -------------------- Code for Question 3.1.A -----------------------------

# -------------------- Load libraries and split data set -----------------------------
# Clear environment

rm(list = ls())

#First, load the kknn library (which contains the kknn function) and read in the data

library(kknn)

cc_data <- read.table("/Users/walkermeadows/OneDrive/GT_ISYE/Week\ 2/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE)

#Set seed so random variables can be replicated
set.seed(1)

#Generate a random sample of 70% of the rows
random_row<- sample(1:nrow(cc_data),as.integer(0.7*nrow(cc_data)))

#Assign the trainData set to 70% of the original data set. 
trainData = cc_data[random_row,]

#Assign the testData set to the remaining 30% of the original set
testData = cc_data[-random_row,]

#check data formats
#head(trainData)
#head(testData)

#check number of rows in each data set to confirm correct split
#nrow(trainData)
#nrow(testData)

# -------------------- Train hyper perameters on traning data set -----------------------------

#Use LOOCV to determine ideal kernel and k value
train.kknn(as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10, data = trainData, kmax = 50, scale = TRUE)

## 
## Call:
## train.kknn(formula = as.factor(V11) ~ V1 + V2 + V3 + V4 + V5 +     V6 + V7 + V8 + V9 + V10, data = trainData, kmax = 50, scale = TRUE)
## 
## Type of response variable: nominal
## Minimal misclassification: 0.1509847
## Best kernel: optimal
## Best k: 12

# Type of response variable: nominal
# Minimal misclassification: 0.1509847
# Best kernel: optimal
# Best k: 12

# -------------------- Test trained hyper perameters on training data to calc accuracy -----------------------------

predicted_train <- rep(0,(nrow(trainData))) # predictions: start with a vector of all zeros
train_accuracy<- 0  #initialize variable

for (i in 1:nrow(trainData)){
  model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,trainData[-i,],trainData[i,],k=12,kernel="optimal", scale = TRUE) # use scaled data
  predicted_train[i]<- as.integer(fitted(model)+0.5) # round off to 0 or 1 and store predicted values in vector
}

# calculate fraction of correct predictions
train_accuracy<- sum(predicted_train == trainData[,11]) / nrow(trainData)

# -------------------- Use trained hyper permaeters on test data to calculate final accuracy -----------------------------

predicted_test <- rep(0,(nrow(testData))) # predictions: start with a vector of all zeros
test_accuracy<- 0 #initialize variable

for (i in 1:nrow(testData)){
  model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,testData[-i,],testData[i,],k=12,kernel="optimal", scale = TRUE) # use scaled data
  predicted_test[i]<- as.integer(fitted(model)+0.5) # round off to 0 or 1 and store predicted values in vector
}

# calculate fraction of correct predictions
test_accuracy<- sum(predicted_test == testData[,11]) / nrow(testData)

train_accuracy

## [1] 0.8468271

test_accuracy

## [1] 0.8020305

As expected, the true model accuracy (80%), as calculated on the testing data set is lower than the training accuracy (85%), which includes random effects and was calculated on the training data set.

Question 3.1b

Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier:(b) splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional).

Below is the code used to answer question 3.1b, along with helpful comments to describe each step. In general, the approach was: Load data and appropriate libraries, split total data set into training (70% of data), validation (15%), and testing (15% of data) data sets, test/build multiple different k-nearest neighbors models using the training data, evaluate the different models on the validation data to identify the best, evaluate the final model on the test data set to calculate the true reported accuracy. I hypothesized that the training accuracy would be overly optimistic due to random effects, and that the testing accuracy would be lower, but a more true measure of model accuracy.

# -------------------- Code for Question 3.1.B -----------------------------

# -------------------- Load libraries and split data set -----------------------------
# Clear environment

rm(list = ls())

#First, load the kknn library (which contains the kknn function) and read in the data
#

library(kknn)

cc_data <- read.table("/Users/walkermeadows/OneDrive/GT_ISYE/Week\ 2/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE)

#Set seed so random variables can be replicated
set.seed(1)

#Generate a random sample of 70% of the rows
random_row<- sample(1:nrow(cc_data),as.integer(0.7*nrow(cc_data)))

#Assign the trainData set to 70% of the original data set. 
trainData = cc_data[random_row,]

#Assign the remaining 30% of data to a "Remaining Data" data set.
remainingData = cc_data[-random_row,]

#Generate a random sample of 50% of the remaining rows in "Remaining Data"
random_row2<-sample(1:nrow(remainingData),as.integer(0.5*nrow(remainingData)))

#Assign half the remaining data to validation data
validateData=remainingData[random_row2,]

#Assign the remaining data to testData
testData=remainingData[-random_row2,]

#check data formats
#head(trainData)
#head(validateData)
#head(testData)

#Check number of rows in each data set
#nrow(trainData)
#nrow(validateData)
#nrow(testData)

# -------------------- Test/build different kknn models on training data -----------------------------
predicted_train<- rep(0,(nrow(trainData))) # predictions: start with a vector of all zeros
train_accuracy<- 0  #initialize variable
X<- 0 #initialize variable
accuracyTable <-data.frame(matrix(nrow = 25, ncol = 2))
colnames(accuracyTable) <- c("K","Accuracy") # Create blank table for k values and associated accuracies. 


for(X in 1:25){

  
for (i in 1:nrow(trainData)){
  model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,trainData[-i,],trainData[i,],k=X,kernel="optimal", scale = TRUE) # use scaled data
  predicted_train[i]<- as.integer(fitted(model)+0.5) # round off to 0 or 1 and store predicted values in vector
}

# calculate fraction of correct predictions
train_accuracy<- sum(predicted_train == trainData[,11]) / nrow(trainData)

accuracyTable[X, 1] <- X
accuracyTable[X, 2] <- train_accuracy
}

#Output K accuracy table.
accuracyTable

##     K  Accuracy
## 1   1 0.7986871
## 2   2 0.7986871
## 3   3 0.7986871
## 4   4 0.7986871
## 5   5 0.8424508
## 6   6 0.8336980
## 7   7 0.8402626
## 8   8 0.8424508
## 9   9 0.8446389
## 10 10 0.8446389
## 11 11 0.8446389
## 12 12 0.8468271
## 13 13 0.8446389
## 14 14 0.8446389
## 15 15 0.8446389
## 16 16 0.8446389
## 17 17 0.8424508
## 18 18 0.8424508
## 19 19 0.8424508
## 20 20 0.8380744
## 21 21 0.8380744
## 22 22 0.8358862
## 23 23 0.8358862
## 24 24 0.8380744
## 25 25 0.8380744

plot(accuracyTable[,1],accuracyTable[,2])

# A k-value of 12 appears to be the best fit for the training data, but will
# will also check k-values of 13, 14, and 15 against the validation set for verification. 

# -------------------- Test the 4 different models on the validation data set -----------------------------

predicted_validate<- rep(0,(nrow(validateData))) # predictions: start with a vector of all zeros
validate_accuracy<- 0  #initialize variable
X<- 0 #initialize variable
accuracyTable_validate <-data.frame(matrix(nrow = 4, ncol = 2))
colnames(accuracyTable_validate) <- c("K","Validate_Accuracy") # Create blank table for k values and associated accuracies. 
counter<-0

for(X in 12:15){
  counter<- counter + 1
  for (i in 1:nrow(validateData)){
    model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,validateData[-i,],validateData[i,],k=X,kernel="optimal", scale = TRUE) # use scaled data
    predicted_validate[i]<- as.integer(fitted(model)+0.5) # round off to 0 or 1 and store predicted values in vector
  }
  
  # calculate fraction of correct predictions
  validate_accuracy<- sum(predicted_validate == validateData[,11]) / nrow(validateData)
  
  accuracyTable_validate[counter, 1] <- X
  accuracyTable_validate[counter, 2] <- train_accuracy
}

accuracyTable_validate

##    K Validate_Accuracy
## 1 12         0.8380744
## 2 13         0.8380744
## 3 14         0.8380744
## 4 15         0.8380744

#Output shows that all 4 models perform equally well. Therefore I will proceed with the k=12 model
#to run on test data to calculate final accuracy

# -------------------- Use k=12 model to test final accuracy on test data -----------------------------
predicted_test<- rep(0,(nrow(testData))) # predictions: start with a vector of all zeros
test_accuracy<- 0  #initialize variable

  for (i in 1:nrow(testData)){
    model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,testData[-i,],testData[i,],k=12,kernel="optimal", scale = TRUE) # use scaled data
    predicted_test[i]<- as.integer(fitted(model)+0.5) # round off to 0 or 1 and store predicted values in vector
  }
  
  # calculate fraction of correct predictions
  test_accuracy<- sum(predicted_test == testData[,11]) / nrow(testData)
 
  train_accuracy

## [1] 0.8380744

  test_accuracy

## [1] 0.7676768

As expected, the true model accuracy (77%), as calculated on the testing data set is lower than the training accuracy (84%), which includes random effects and was calculated on the training data set.

It’s also interesting to compare the differences in testing accuracy between questions 3.1.a and 3.1.b. Our testing accuracy in 3.1.b was ~5% lower, which can be attributed to the smaller testing data set, as compared to 3.1.a.

Question 4.1

One interesting application of a clustering model would be the segmentation of voters leading up to an election to provide more tailored and influential messaging. For example, a campaign manager for a gubernatorial candidate could hire an incredibly intelligent analytics student from Georgia Tech’s OMS program to help maximize the effectiveness of their limited campaign budget. Demographic data on voters could then be gathered and a clustering model could be created. A few variables that could be used are:

• Median household income • Geographic location • Age • Number of educational degrees • Family size

Multiple models could be created testing different variable combinations. The optimal number of clusters could also be determined. The campaign manager could then use the voter segmentation to deliver more effective messaging to voters in each segment. For example, voters that are older, have smaller families, and have high incomes may be more influenced by messaging related to lowering taxes, estate planning, etc. Whereas a younger voter with no kids and a low income may be more swayed by messaging related to social issues or career development opportunities. Both of these are simply hypothetical situations – a number of different relationships could be observed.

Question 4.2

The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library data sets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/data sets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.

Below is the code used to answer question 4.2, along with helpful comments to describe each step. In general, the approach was: Load applicable libraries and data sets, use the fviz_nbclust function and use the “elbow method” to determine the optimal number of clusters, use the optimal number of clusters with the kmeans clustering function to cluster the given data set, and compare the clustering predictions to the species names provided to determine accuracy.

# -------------------- Code for Question 4.2 -----------------------------

# -------------------- Load libraries and data -----------------------------
# Clear environment

rm(list = ls())

#First, load the kknn library and read in the data

library(kknn)
library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

flower_data <- read.table("/Users/walkermeadows/OneDrive/GT_ISYE/Week\ 2/iris.txt", stringsAsFactors = FALSE, header = TRUE)

#Convert species names to numbers
mapping<- c("setosa"=1,"versicolor"=2,"virginica"=3)
flower_data$Species<- mapping[flower_data$Species]

#remove species column for clustering
filtered_flower_data<-flower_data[,1:4]

#view summary of loaded data as check
#head(flower_data)

#Set seed so random variables can be replicated
set.seed(1)

# -------------------- Determine best cluster size given data set -----------------------------

# this function outputs plot to evaluate ideal # clusters
fviz_nbclust(filtered_flower_data, kmeans, method ="wss")

# "elbow method" used when evaluating results to determine optimatal # of clusters
# Optimal number of clusters determined to be 3, which represents the "eblow" in the plot


# -------------------- Perform kmeans clustering given optimal # of clusters -----------------------------

#perform clustering
k_means_cluster<- kmeans(filtered_flower_data,centers=3,nstart=25)

#output clustering information
str(k_means_cluster)

## List of 9
##  $ cluster     : Named int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ centers     : num [1:3, 1:4] 5.01 5.9 6.85 3.43 2.75 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ totss       : num 681
##  $ withinss    : num [1:3] 15.2 39.8 23.9
##  $ tot.withinss: num 78.9
##  $ betweenss   : num 603
##  $ size        : int [1:3] 50 62 38
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

#store species predictions in vector for comparison
predicted_species<- k_means_cluster$cluster

#Calculate # of correct species predictions
accuracy<- sum(predicted_species == flower_data[,5]) / nrow(flower_data)
accuracy

## [1] 0.8933333

The ideal number of clusters was determined to be 3 using the “elbow method”. The clustering model using 3 clusters then performed fairly well, correctly segmenting 89% of the flowers.

ISYE 6501, Week 2 HW

Joseph Meadows

9/4/2019

Question 3.1a

Question 3.1b

Question 4.1

Question 4.2