Homework 4

##Kevin Kuipers (Completed byself)

##02/05/2019

##1. Question 4.7.3 pg 168

From the function in 4.11 it assumes it is normal or Gaussian

\(f_{k}(x) = \frac{1}{\sqrt{2\pi}\sigma_{k}}\exp(-\frac{1}{2\sigma_{k}^{2}}(x-\mu_{k})^{2})\)

Now continuing on from this point we can arrive at the function in 4.12 that assumes there is a shared variance term across all K classes and plugging 4.10 into 4.11 we can arrive at

\(p_{k}(x) = \frac{\pi_{k}\frac{1}{\sqrt{2\pi}\sigma_{k}}\exp(-\frac{1}{2\sigma_{k}^{2}}(x-\mu_{k})^{2})}{\sum_{l=1}^{K}\pi_{k}\frac{1}{\sqrt{2\pi}\sigma_{k}}\exp(-\frac{1}{2\sigma_{k}^{2}}(x-\mu_{l})^{2})}\)

We can focus on the numerator because the denominator is for l started at 1 and going to the Kth term but the numerator is kth class. Hence

\(p_{k}(x) = \pi_{k}\frac{1}{\sqrt{2\pi}\sigma_{k}}\exp(-\frac{1}{2\sigma_{k}^{2}}(x-\mu_{k})^{2})\)

Now to undo the exp term we can use the natural log function which does the following:

\(p_{k}(x) = ln(\pi_{k}) + ln(\frac{1}{\sqrt{2\pi}\sigma_{k}}) + ln(\exp(-\frac{1}{2\sigma_{k}^{2}}(x-\mu_{k})^{2}))\)

Now combine the first two terms and remove the ln(exp) portion from the third term:

\(p_{k}(x) = ln(\frac{\pi_{k}}{\sqrt{2\pi}\sigma_{k}}) -\frac{1}{2\sigma_{k}^{2}}(x-\mu_{k})^{2})\)

Now if we expand or sqaure the second term that is in paranthesis

\(p_{k}(x) = ln(\frac{\pi_{k}}{\sqrt{2\pi}\sigma_{k}}) -\frac{1}{2\sigma_{k}^{2}}(x^{2} - 2\mu_{k}x + \mu_{k}^{2})\)

Looking above at the second term we see that is a quadratic term in terms of the function p(x)

##2. Question 4.7.5 pg 169

#Problem 5 a) If the Bayes decision boundary is linear would expect QDA to perform better on the training data since it has greater flexibility than LDA. However, when it comes to testing QDA would perform worse and LDA would produce better results on the test data set.

#Problem 5 b) If the Bayes decision boundary is non-linear, QDA would most likely perform better on both training and testing due to its flexibility. Generally, speaking LDA would perform poorer than QDA.

#Problem 5 c) In general, as sample size n increases especially when the data set is very large QDA is the perferred method and tends to improve in prediction accuracy. The reason QDA is perferred in larger data sets is because the variance of the classifier is not a huge concern. In contrast, in the data set as a n sample size that is small, LDA is the perferred method to prevent overfitting assuming that the variables are normal distributed and the variances are closely to the same.

#Problem 5 d) This problem is not a true or false answer because it depends on the dataset size. If Bayes decision for a give problem is linear and if the data set is very small, the densities are normal, and the coefficient variances are close to being the same, LDA will achieve a superior test error rate over QDA. In this case, QDA will produce a model that is overfitted. However, if the data set is very large, and/or variacnes of the coefficients are not closely the same then QDA will achieve a superior test error rate than LDA.

##3. Continue from Homework 3 Question 4.7.10(e-i) pg 171

Now I will use the Weekly data set from homework 3 and continue from where I left off and that was fitting a model where Direction is explained by Lag2. I will also split the data again from year 1990 to 2008 will be the training data as I did in homework 3. Then I will fit several models and compute the confusion matrix for each and the overall fractin of correct predictions for the held out data which is years 2009 and 2010. After splitting the data I will fit a a model using the LDA method, then the QDA method, and lastly the KNN method. The Weekly data set is found is the ISLR library

Loading and splitting data set

#Loading data set
library(ISLR)
data(Weekly)


#subsetting the data for years 1990 through 2008 - Training Data
weekly_training <- subset(Weekly, Year >= 1990 & Year <= 2008)

#subsetting the date for years 2009 and 2010 - Testing data
weekly_testing <- subset(Weekly, Year > 2008)

#Problem 10 e) LDA Method

#install.packages('MASS')

library(MASS)

LDA_10d <- lda(Direction ~ Lag2, data=weekly_training)

LDA_pred <- predict(LDA_10d, newdata=weekly_testing)

cm_LDA <- table(LDA_pred$class,weekly_testing$Direction)

print(cm_LDA)

##       
##        Down Up
##   Down    9  5
##   Up     34 56

LDA_error_rate <- 1.0 - ((cm_LDA[1] + cm_LDA[2,2])/ sum(cm_LDA))

cat('The error rate for the LDA is: ', LDA_error_rate)

## The error rate for the LDA is:  0.375

It appears that if we go back to homework 3 this is the same error rate that logistic regression produced: 37.5%. Therefore, the accuracy of the model is the same 62.5% which means the number of correct predictions was 62.5%.

#Problem 10 f) QDA Method

QDA_10d <- qda(Direction ~ Lag2, data=weekly_training)

QDA_pred <- predict(QDA_10d, newdata=weekly_testing)

cm_QDA <- table(QDA_pred$class, weekly_testing$Direction)

print(cm_QDA)

##       
##        Down Up
##   Down    0  0
##   Up     43 61

QDA_error_rate <- 1.0 - ((cm_QDA[1] + cm_QDA[2,2])/ sum(cm_QDA))


cat('The error rate for the QDA is: ', QDA_error_rate)

## The error rate for the QDA is:  0.4134615

The accuracy of the model, the number of correct predictions for testing the data is 58.66%. Therefore, the error rate is 41.34%. It seems like LDA and logistic regression are producing better results. Nevetheless, it seems that when the market goes up the QDA is right 100% of the time but the inverse to that statement means that when market goes down it is wrong 100% of the time.

#Problem 10 g) KNN Method

set.seed(123)
library(class)

train_x <- cbind(weekly_training$Lag2)
test_x <- cbind(weekly_testing$Lag2)
train_y <- cbind(weekly_training$Direction)

knn_10d <- knn(train_x, test_x, train_y,k=1)
knn_cm <- table(knn_10d, weekly_testing$Direction)
knn_cm

##        
## knn_10d Down Up
##       1   21 29
##       2   22 32

knn_error_rate <- 1.0 - ((knn_cm[1] + knn_cm[2,2])/ (sum(knn_cm)))


cat('The error rate for the KNN Model is: ', knn_error_rate)

## The error rate for the KNN Model is:  0.4903846

The error rate is almost 50% therefore the number of correct predictions roughly 50%. The KNN model in this case does not seem like a good model to go with. The probability of correct predictions is almost the same as flipping a coin for heads or tails.

#Problem 10 h)

Out of the model used for predicting whether the stock market will go up or down in 2009 and 2010 seems to be either the LDA or a logistic regression model as the both produced the lowest error rate and therefore the highest number of correct predictions.

#Problem 10 i)

After much exploration, testing models, transforming the data, the one that I came up with that had the best results was logistic regression model with Direction explained the interaction of all the Lag variables: Lag1Lag2Lag5Lag4Lag3 had an accuracy of 75.95% percent. That means the number of correct predictions for years 2009 and 2010 it correctly predicted 75.95% of them. The LDA, QDA, and KNN models produced worse results using these interactions. Below are some of the explorations I used.

#GLM method with all lag interactions

glm_mod10 <- glm(Direction ~ Lag1*Lag2*Lag5*Lag4*Lag3, data=weekly_testing, family=binomial)
summary(glm_mod10)

## 
## Call:
## glm(formula = Direction ~ Lag1 * Lag2 * Lag5 * Lag4 * Lag3, family = binomial, 
##     data = weekly_testing)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.00394  -0.65509   0.05225   0.77495   1.96796  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)  
## (Intercept)               0.143509   0.456582   0.314   0.7533  
## Lag1                      0.219942   0.170309   1.291   0.1966  
## Lag2                      0.255900   0.185333   1.381   0.1674  
## Lag5                      0.704801   0.299904   2.350   0.0188 *
## Lag4                     -0.098828   0.197688  -0.500   0.6171  
## Lag3                     -0.270078   0.200605  -1.346   0.1782  
## Lag1:Lag2                 0.065298   0.048827   1.337   0.1811  
## Lag1:Lag5                -0.145534   0.086044  -1.691   0.0908 .
## Lag2:Lag5                -0.124553   0.093979  -1.325   0.1851  
## Lag1:Lag4                 0.015456   0.090154   0.171   0.8639  
## Lag2:Lag4                -0.152158   0.104256  -1.459   0.1444  
## Lag5:Lag4                -0.411801   0.162838  -2.529   0.0114 *
## Lag1:Lag3                 0.054102   0.086080   0.629   0.5297  
## Lag2:Lag3                 0.163220   0.072792   2.242   0.0249 *
## Lag5:Lag3                -0.049439   0.096191  -0.514   0.6073  
## Lag4:Lag3                 0.215375   0.105534   2.041   0.0413 *
## Lag1:Lag2:Lag5           -0.032749   0.025128  -1.303   0.1925  
## Lag1:Lag2:Lag4           -0.059230   0.044283  -1.338   0.1810  
## Lag1:Lag5:Lag4            0.079408   0.047780   1.662   0.0965 .
## Lag2:Lag5:Lag4            0.123927   0.059912   2.068   0.0386 *
## Lag1:Lag2:Lag3            0.027401   0.030308   0.904   0.3659  
## Lag1:Lag5:Lag3           -0.039465   0.038948  -1.013   0.3109  
## Lag2:Lag5:Lag3            0.091980   0.038931   2.363   0.0181 *
## Lag1:Lag4:Lag3           -0.047311   0.038097  -1.242   0.2143  
## Lag2:Lag4:Lag3           -0.128680   0.051941  -2.477   0.0132 *
## Lag5:Lag4:Lag3            0.120552   0.071085   1.696   0.0899 .
## Lag1:Lag2:Lag5:Lag4      -0.037577   0.016753  -2.243   0.0249 *
## Lag1:Lag2:Lag5:Lag3      -0.020950   0.015749  -1.330   0.1834  
## Lag1:Lag2:Lag4:Lag3      -0.033221   0.020572  -1.615   0.1063  
## Lag1:Lag5:Lag4:Lag3      -0.028269   0.021918  -1.290   0.1971  
## Lag2:Lag5:Lag4:Lag3       0.013588   0.013138   1.034   0.3010  
## Lag1:Lag2:Lag5:Lag4:Lag3 -0.016190   0.006344  -2.552   0.0107 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141.043  on 103  degrees of freedom
## Residual deviance:  87.445  on  72  degrees of freedom
## AIC: 151.45
## 
## Number of Fisher Scoring iterations: 10

glm_10_probs <- predict(glm_mod10, type='response', newdata=weekly_testing)
glm_10_preds <- ifelse(glm_10_probs > 0.50, 'Up', 'Down')
confusion_matrix <- table(weekly_testing$Direction, glm_10_preds)

recall <- confusion_matrix[1] / (confusion_matrix[1] + confusion_matrix[2])

#Computting Precision from the confusion matrix
precision <- confusion_matrix[1] / (confusion_matrix[1] + confusion_matrix[1,2])

#Computting accuracy from the confusion matrix
accuracy <- (confusion_matrix[1] + confusion_matrix[2,2]) / sum(confusion_matrix) 

#putting recall, accuracy, and precision into a table
rpa <- data.frame(
  recall = c(recall),
  precision = c(precision),
  accuracy = c(accuracy)
)
print(confusion_matrix)

##       glm_10_preds
##        Down Up
##   Down   30 13
##   Up     12 49

rpa

#LDA Method with all Lag interactions

LDA_10d <- lda(Direction ~ Lag1*Lag2*Lag5*Lag4*Lag3, data=weekly_training)

LDA_pred <- predict(LDA_10d, newdata=weekly_testing)

cm_LDA <- table(LDA_pred$class,weekly_testing$Direction)

print(cm_LDA)

##       
##        Down Up
##   Down   15 22
##   Up     28 39

LDA_error_rate <- 1.0 - ((cm_LDA[1] + cm_LDA[2,2])/ sum(cm_LDA))

cat('The error rate for the LDA is: ', LDA_error_rate)

## The error rate for the LDA is:  0.4807692

#QDA Method with all Lag interactions

QDA_10d <- qda(Direction ~ Lag1*Lag2*Lag5*Lag4*Lag3, data=weekly_training)

QDA_pred <- predict(QDA_10d, newdata=weekly_testing)

cm_QDA <- table(QDA_pred$class, weekly_testing$Direction)

print(cm_QDA)

##       
##        Down Up
##   Down   17 33
##   Up     26 28

QDA_error_rate <- 1.0 - ((cm_QDA[1] + cm_QDA[2,2])/ sum(cm_QDA))


cat('The error rate for the QDA is: ', QDA_error_rate)

## The error rate for the QDA is:  0.5673077

#KNN Method with all Lag interactions

set.seed(123)
library(class)

train_x <- cbind(weekly_training$Lag2*weekly_training$Lag5*weekly_training$Lag3*weekly_training$Lag4*weekly_training$Lag1)
test_x <- cbind(weekly_testing$Lag2*weekly_testing$Lag5*weekly_testing$Lag3*weekly_testing$Lag4*weekly_testing$Lag1)
train_y <- cbind(weekly_training$Direction)

knn_10d <- knn(train_x, test_x, train_y,k=120)
knn_cm <- table(knn_10d, weekly_testing$Direction)
knn_cm

##        
## knn_10d Down Up
##       1    1  5
##       2   42 56

knn_error_rate <- 1.0 - ((knn_cm[1] + knn_cm[2,2])/ (sum(knn_cm)))


cat('The error rate for the KNN Model is: ', knn_error_rate)

## The error rate for the KNN Model is:  0.4519231

#KNN Method with different K classifiers

I will go back to the original model with using Direction explained only by Lag2 but look at different K values in the KNN classifers. I will create plot to see which K value works the best out 1 through 100

set.seed(123)

end <- data.frame(k=1:100, correct=NA)
for(i in 1:100){
  knn.pred = knn(train=data.frame(weekly_training$Lag2), test=data.frame(weekly_testing$Lag2), cl=weekly_training$Direction, k=i)
  cm_knn <-table(weekly_testing$Direction, knn.pred)
  correct <- (cm_knn['Down','Down'] + cm_knn['Up','Up'])/sum(cm_knn)
  end$correct[i] <- correct
}

library(tidyverse)
ggplot(data=end, aes(k,correct)) + geom_line() +labs(title='Plot of K for KNN classifiers vs Accuracy of Model', x='No. K', y='Accuracy')

max_accuracy <- end[order(-end$correct),]

head(max_accuracy,2)

It appears that when k is 47 or k is 60 the same accuracy results in the knn method. However, the GLM method with all the lag terms interacting with each other produced the best results.

4. Continue from Homework 3 Question 4.7.11(d,e,g) pg 172

I will load the data set found is the ISLR library called Auto. First, I calculate the median value the mpg and then create a binary variable within the data set which will be named mpg01. If mpg is higher than the median value then 1 will be assigned and if mpg is lower than the median than 0 will be assigned. Then I will continue from where I left off in homework 3.

set.seed(123)
#loading data set
data(Auto)

#calculating and assigning the median value of mpg from the auto data set
med_value <- median(Auto$mpg)

#converting Auto data set into a data.frame
auto_dat <- as.data.frame(Auto)

#assigning the binary response 
auto_dat$mpg01 <- cbind(ifelse(auto_dat$mpg > med_value, 1,0))

auto_dat <- subset(auto_dat, select = -mpg)

Now I will split the data for testing and training. Training data will contain 70% of data set population and the testing will contain the other 30%. I will set a seed using 123 for reproducibility. Before splitting, I also remove any variables that contain a correlation coefficient of .80 or higher amongst themselves excluding mpg01. Therefore, when doing this I will remove mpg01 variable, then remove the highly correlated values then I will add mpg01 back in. I will output the variable names that remain.

set.seed(123)

#droping the name column
auto_dat <- subset(auto_dat, select = -name)

#remove the mpg01 variable
mpg01 <- auto_dat$mpg01
auto <- auto_dat[ ,!names(auto_dat) %in% 'mpg01']

#correlation matrix of the data set
cor_mat <- cor(auto)
cor_mat[upper.tri(cor_mat)] <- 0
diag(cor_mat) <- 0

#Remove any variables with an absolute value of .80 or higher
auto_dat <- auto[ ,!apply(cor_mat,2,function(x) any(x > abs(0.95)))]
auto_dat <- cbind(auto_dat, mpg01)


#testing and train data split
index <- sample(x=nrow(auto_dat), size=0.70*nrow(auto_dat))
training <- auto_dat[index,]
testing <- auto_dat[-index,]

names(auto_dat)

## [1] "displacement" "horsepower"   "weight"       "acceleration"
## [5] "year"         "origin"       "mpg01"

#Problem 11 f) Fitting LDA Model

When I performed the GLM model in homework 3 I dropped origin because it did not appear statistically significant in the model according to the p-value. Even though accelration did not also appear statistically significant in model according the p-value I left it in. Therefore, using mpg01 is explained by weight, acceleration, year

#Fitting Model
lda_auto <- lda(mpg01 ~ weight + acceleration + year , data=training)

#producing confusion matrix
lda_auto_pred <- predict(lda_auto, newdata=testing, type='response')

cm_lda_auto <- table(lda_auto_pred$class, testing$mpg01)

print(cm_lda_auto)

##    
##      0  1
##   0 53  4
##   1  7 54

LDA_auto_error_rate <- 1.0 - ((cm_lda_auto[1] + cm_lda_auto[2,2]) /sum(cm_lda_auto))

cat('The error rate for the LDA is: ', LDA_auto_error_rate)

## The error rate for the LDA is:  0.09322034

It appears the LDA has an error rate of roughly 13.56% meaning that the accuracy of the model is roughly 86.44% which means that is correclty predicted on the testing dat 86% of the time. The GLM in homework 3 had a little higher accuracy rating of roughly 88.98%

#Problem QDA Method

#Fitting Model
qda_auto <- qda(mpg01 ~ weight + acceleration + year , data=training)

#producing confusion matrix
qda_auto_pred <- predict(qda_auto, newdata=testing, type='response')

cm_qda_auto <- table(qda_auto_pred$class, testing$mpg01)

print(cm_qda_auto)

##    
##      0  1
##   0 55  6
##   1  5 52

qda_auto_error_rate <- 1.0 - ((cm_qda_auto[1] + cm_qda_auto[2,2]) /sum(cm_qda_auto))

cat('The error rate for the qda is: ', qda_auto_error_rate)

## The error rate for the qda is:  0.09322034

It appea3rs that the QDA method thus far has the lowest error rate at 8.47% meaning the accuracy of the model against the testing data is 91.53% which means the number of correct predictions against the testing data 91.53%

#Problem g) KNN Method

Now I will fit a model and test it against the testing using the KNN method. I will again produce a plot to see which K value obtains the highest accuracy rating in order to fit the optimal model with K value.

set.seed(123)

end <- data.frame(k=1:100, correct=NA)
for(i in 1:100){
  knn.pred = knn(train=data.frame(training$weight, training$acceleration, training$year), test=data.frame(testing$weight, testing$acceleration, testing$year), cl=training$mpg01, k=i)
  cm_knn <-table(testing$mpg01, knn.pred)
  correct <- (cm_knn['0','0'] + cm_knn['1','1'])/sum(cm_knn)
  end$correct[i] <- correct
}

library(tidyverse)
ggplot(data=end, aes(k,correct)) + geom_line() +labs(title='Plot of K for KNN classifiers vs Accuracy of Model', x='No. K', y='Accuracy')

max_accuracy <- end[order(-end$correct),]

head(max_accuracy,10)

It appears that highest accuracy rating using the KNN method is roughly 88.98% for k value 3,5,8,9,21,23,26. Therefore, the error rate is 11.02% which is the same error rate that I had in homework 3 using GLM method.

##5. Read the paper “Statistical Classification Methods in Consumer Credit Scoring: A Review” posted on D2L. Write a one page (no more, no less) summary.

Below is copied from a word document which I will also attach as a PDF that came out to just over 1 page.

Statistical Classification Methods in Consumer Credit Scoring: a Review, focuses on utilizing statistical methods for classifying credit as “good” or “bad” for each applicants risk assessment. When this research was conducted, human judgement based on past experience assessed the situation of each applicant in order to determine his or hers credit score as “good” or “bad”. Using these statistical methods removes the potential human bias when categorizing an applicant’s credit score. It is possible that some organizations preferred placing an applicant with “bad” credit score causing the risk factor to be higher. This in turn, can cause the applicant to have to payer high interest based on being a high risk factor.

Hand and Henley explore credit scoring data bases which are rather larger, 100,000+ applicants containing 100+ variables. Some of the variables used in the classification methods were common ones that we think of when we apply for mortgage or loan: Age, time of employment, credit card (yes,no), annual income, time at present address, and many more. Some variables had to be restricted from use due to governmental laws that restrict discrimination such as race and gender.

Using this data Hand and Henley explored various modeling methods and assessed the models performance. Some of the analyses they performed were discriminant analysis, regression, logistic regression, recursive partitioning, neural networks, smoothing nonparametric methods, and more. It appeared the each model had similar model performance. However, the machine learning ones like neural network are generally not preferred since it is like a black box and cannot easily explain the end result. The other ones like regression, nearest neighbor, and decision trees are easier to explain the process and outcome.

All in all, the research closes with the fact that the greatest advances in this scoring classification process will could be enhanced with future models that are more sophisticated in handling predictive models.

##6. Explore this website that contains open datasets that are used in machine learning. Find one dataset with a classification problem and write a description of the dataset and problem. I don’t expect you to do the analysis for this homework, but feel free to if you want!

Pen-Based Recognition of Handwritten Digits Data Set:

The purpose of the data is to do machine learning recoginition of the hand written digits in pen. Since hand writing can very drastically from person to person they collect 250 samples composed 44 different writers. The number of instances was 10992 and 16 attributes. The writers had to write 250 different digits in random order inside boxes of 500 by 500 pixel resulotion. These were written on a specific tablet using a stylus which outputs the pressure levels and fixed time intervals for variables. Each writer has to write within the boxes. During this process, the first 10 samples are ignored due to the learning curve of writing the digits on the tablet. The data was divided, samples from 30 different writers were used for training the data to a model and the other 14 for testing. When training and testing the classifiers, they characterized the digits as length feature vectors. What is very interesting, is that after splitting the data set using the KNN model with using Euclidean distance as the metric and then testing the model accuracy against the test data, the accuracy of the model was a little ver 97% depending on the k-value. With k as 3 the accurcy was 97.80%. There was no missing samples in the data.

Homework 4

Modern Applied Statistics II

4. Continue from Homework 3 Question 4.7.11(d,e,g) pg 172