Description of the Project

The following project is a work for Machine Learning course (DAT 315). It will be solve a series of problems. All excercises have been made by Angel Sosa. #Rmarkdown Cache

knitr::opts_chunk$set(cache = TRUE)
In order to avoid some rstudio erros, I’m loading first all libraries needed.

Library

library(mlbench)
## Warning: package 'mlbench' was built under R version 3.3.3
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.3
library(klaR)
## Warning: package 'klaR' was built under R version 3.3.3
## Loading required package: MASS
library(ISLR)
library(data.table)
library(kernlab)
library(fmsb)
library(car)
library(stats)
library(leaps)
library(earth)
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
## 
## Attaching package: 'TeachingDemos'
## The following object is masked from 'package:klaR':
## 
##     triplot
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
## 
##     alpha
library(MASS)
library(DMwR)
## Loading required package: grid
library(quadprog)
library(AppliedPredictiveModeling)

Problem 1

I’m using the hitters data, with which one I will build models to evaluate the predictiv capacity, training and testing data.

data(Hitters)
hitters <- Hitters[, -c(14,15,20)]
hitters <- hitters[!is.na(hitters$Salary),]
hitters$logSalary <- log(hitters$Salary)
hitters <- hitters[,-c(17)]
set.seed(12345)
hitterstrainindex <- createDataPartition(hitters$logSalary,p = 0.7, list = FALSE, times = 1)
hitterstrain <- hitters[hitterstrainindex,]
hitterstest <- hitters[-hitterstrainindex,]

(a)

I’m building a linear model using training data and computing R^2 with testing data.

LM1 <- lm(formula = hitterstrain$logSalary ~ ., data = hitterstrain)
summary(LM1)
## 
## Call:
## lm(formula = hitterstrain$logSalary ~ ., data = hitterstrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27147 -0.44874  0.01607  0.40936  2.79909 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.677e+00  1.959e-01  23.872   <2e-16 ***
## AtBat       -3.931e-03  1.523e-03  -2.582   0.0107 *  
## Hits         1.445e-02  5.813e-03   2.485   0.0139 *  
## HmRun        9.406e-03  1.480e-02   0.636   0.5258    
## Runs        -3.868e-03  6.987e-03  -0.554   0.5805    
## RBI          3.493e-03  6.463e-03   0.540   0.5896    
## Walks        1.066e-02  4.492e-03   2.373   0.0188 *  
## Years        4.177e-02  3.128e-02   1.335   0.1836    
## CAtBat      -2.807e-05  3.299e-04  -0.085   0.9323    
## CHits        1.097e-03  1.745e-03   0.629   0.5302    
## CHmRun       5.071e-04  3.912e-03   0.130   0.8970    
## CRuns        8.409e-04  1.764e-03   0.477   0.6343    
## CRBI        -1.265e-03  1.822e-03  -0.694   0.4886    
## CWalks      -9.993e-04  8.293e-04  -1.205   0.2299    
## PutOuts      4.154e-04  1.765e-04   2.355   0.0197 *  
## Assists      1.052e-03  5.131e-04   2.051   0.0418 *  
## Errors      -1.647e-02  1.017e-02  -1.619   0.1074    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6188 on 168 degrees of freedom
## Multiple R-squared:  0.5713, Adjusted R-squared:  0.5305 
## F-statistic: 13.99 on 16 and 168 DF,  p-value: < 2.2e-16
lm1predicted <- predict(LM1, newdata=hitterstest)
LM1_R2_test <- cor(lm1predicted, hitterstest$logSalary)^2
LM1_R2_test
## [1] 0.4518025

(b)

I’m building a linear model using training data and computing R^2 with testing data.Also, 10-fold cross validation and, center and scale predictors are required to be used.

RR1 <- train(logSalary~., data=hitterstrain, method="ridge",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"))
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
rr1predicted <- predict(RR1, newdata=hitterstest)
RR1_R2_test <- cor(rr1predicted, hitterstest$logSalary)^2
RR1_R2_test
## [1] 0.4812084

(c)

I’m building a lasso model using training data and computing R^2 with testing data.Also, 10-fold cross validation and, center and scale predictors are required to be used.

LASSO1 <- train(logSalary~., data=hitterstrain, method="lasso",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"))
lasso1predicted <- predict(LASSO1, newdata=hitterstest)
LASSO1_R2_test <- cor(lasso1predicted, hitterstest$logSalary)^2
LASSO1_R2_test
## [1] 0.4686052

(d)

Variables which ones the LASSO removes from the model are computing with predict.enet function.

predict.enet(LASSO1$finalModel, type="coefficients", s=LASSO1$bestTune$fraction, mode='fraction')$coefficients
##       AtBat        Hits       HmRun        Runs         RBI       Walks 
## -0.26718587  0.40050874  0.04954137  0.00000000  0.03072973  0.12541714 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##  0.13511795  0.00000000  0.45915555 -0.09788284  0.04494156  0.00000000 
##      CWalks     PutOuts     Assists      Errors 
## -0.07672909  0.11115853  0.09988299 -0.09130279

Problem 2

For this problem I’m using data of predicting permeability of compounds, which ones have near zero variance values. I have to managed my data wisely to identify those values, the responses and predictors.

data("permeability")

(a)

By combining the predictors and responses into a single data frame and removing all predictors that have small variance, I will able to build my models.

perm <- as.data.frame(fingerprints)
perm$response <- permeability
perm <- perm[,-nearZeroVar(perm)]

(b)

I´m creating the testing and training data.

set.seed(12345)
permtrainindex <- createDataPartition(perm$response,p = 0.7, list = FALSE, times = 1)
permtrain <- perm[permtrainindex,]
permtest <- perm[-permtrainindex,]

(c)

Linear model is built.

LM2 <- train(response~., data=permtrain, method="lm", preProcess=c("center","scale"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: X316, X317, X318
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
#summary(LM2)
lm2predicted <- predict(LM2, newdata=permtest)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
LM2_R2_test <- cor(lm2predicted, permtest$response)^2
LM2_R2_test
##      permeability
## [1,]    0.2323023

(d)

The same linear model is built but this time with “pca” included in my preProcess.

LM2pca <- train(response~., data=permtrain, method="lm", preProcess=c("center","scale", "pca"))
#summary(LM2pca)
lm2pcapredicted <- predict(LM2pca, newdata=permtest)
LM2pca_R2_test <- cor(lm2pcapredicted, permtest$response)^2
LM2pca_R2_test
##      permeability
## [1,]    0.3688938

Did PCA increases R^2

Yes, it increases R^2. Also, I realized that PCA makes LM2 model better.

Problem 3

I’m using a microarray which contains data for 14 different types of cancer. ##(a) I’m getting the training data and the testing data from this microarry. Also, responses and predictors are included. The data are very high dimensional.

cancertrain <- transpose(fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.xtrain"))
cancertrainresponse <- fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.ytrain")
cancertrainresponse <- transpose(cancertrainresponse)
cancertrainresponse <- as.factor(cancertrainresponse$V1)
cancertrain$response <- cancertrainresponse

cancertest <- transpose(fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.xtest"))
cancertestresponse <- fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.ytest")
cancertestresponse <- transpose(cancertestresponse)
cancertestresponse <- as.factor(cancertestresponse$V1)
cancertest$response <- cancertestresponse

set.seed(12345)

(b)

I’m building a linear discriminant analysis model in order to predict the type of cancer. I’m going to evaluate the accuracy with a confussion matrix.

#LDA3 <- lda(formula=response~., data=cancertrain)
#predictedlda3 <- predict(LDA3, newdata=cancertest)
#CMLDA3 <- confusionMatrix(predictedlda3, cancertest$response)
#CMLDA3

(c)

I’m building a k-nearest neighboors model in order to predict the type of cancer. I’m going to evaluate the accuracy with a confussion matrix.

#KNN3 <- train(response~., data=cancertrain, method="knn", #trControl=trainControl(method = "cv", number = 10))
#predictedknn3 <- predict(KNN3, newdata=cancertest)
#CMKNN3 <- confusionMatrix(predictedknn3, cancertest$response)
#CMKNN3

(d)

In order to determinate the efectiveness of a classifier, I’m goIng to show the confusion matrix again.

#CMLDA3
#CMKNN3

Is either classifier very effective?

Problem 4

Making use of digitized handwritten numerals, I’m going to build models which one will allow me to identify handwritten digits.

library(parallel)
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.3.3
## Loading required package: foreach
## Loading required package: iterators
library(plyr)
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:DMwR':
## 
##     join
mnist_test = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/mnist_test.csv")
mnist_train = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/mnist_train.csv")

(a)

I’m removing the predictors with small variance in the training data and testing data. Also renaming the variables.

#revalue(mnist_train$X5,c("0"="c0", "1"="c1","2"="c2","3"="c3","4"="c4","5"="c5","6"="c6","7"="c7","8"="c8","9"="c9"))
#revalue(mnist_test$X7,c("0"="c0", "1"="c1","2"="c2","3"="c3","4"="c4","5"="c5","6"="c6","7"="c7","8"="c8","9"="c9"))
mnist_test <- mnist_test[,-nearZeroVar(mnist_test)]
mnist_train <- mnist_train[,-nearZeroVar(mnist_train)]

(b)

I’m building a k-nearest neighboors model in order to identify handwritten digits. I’m going to evaluate the accuracy with a confussion matrix, using the testing data.

#KNN4 <- train(X5~., data=mnist_train, method="knn", trControl=trainControl(method = "cv", number = 10))
#predictedknn4 <- predict(KNN4, newdata=mnist_test)
#CMKNN4 <- confusionMatrix(predictedknn4, mnist_test$X7)

(c)

I’m building a Linear discriminant analysis model in order to identify handwritten digits. I’m going to evaluate the accuracy with a confussion matrix, using the testing data.

#LDA4 <- train(X5~., data=mnist_train, method="lda", trControl=trainControl(method = "cv", number = 10))
#predictedlda4 <- predict(LDA4, newdata=mnist_test)
#CMLDA4 <- confusionMatrix(predictedlda4, mnist_test$X7)

(d)

#CMKNN4
#CMLDA4

Problem 5

For this final problem, I’m using the spam data in the kernlab package.

data("spam")

(a)

Before build my models, I have to create the partition in order to get my testing and training data.

set.seed(12345)
spamtrainindex <- createDataPartition(spam$type,p = 0.7, list = FALSE, times = 1)
spamtrain <- spam[spamtrainindex,]
spamtest <- spam[-spamtrainindex,]

(b)

I’m building a naive Bayes classifier based on the type of message.

NB5 <- train(type~., data=spamtrain, method="nb", trControl=trainControl(method = "cv", number = 10))
predicted <- predict(NB5, newdata=spamtest)
CM <- confusionMatrix(predicted, spamtest$type)
CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     476   31
##    spam        360  512
##                                           
##                Accuracy : 0.7165          
##                  95% CI : (0.6919, 0.7401)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4631          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5694          
##             Specificity : 0.9429          
##          Pos Pred Value : 0.9389          
##          Neg Pred Value : 0.5872          
##              Prevalence : 0.6062          
##          Detection Rate : 0.3452          
##    Detection Prevalence : 0.3677          
##       Balanced Accuracy : 0.7561          
##                                           
##        'Positive' Class : nonspam         
## 

Here we can clearly see that this “filter” is optimus if you would rather have nonspam e-mail in your inbox, due to an accuracy of 71%; with positive class non-spam. ##(c) I’m building a naive Bayes classifier based on the type of message.The same linear classifier is built but this time with “pca” included in my preProcess.

NB5pca <- train(type~., data=spamtrain, method="nb", trControl=trainControl(method = "cv", number = 10),preProcess=c("center","scale", "pca"))
predictedpca <- predict(NB5pca, newdata=spamtest)
CMpca <- confusionMatrix(predictedpca,spamtest$type)
CMpca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     734   62
##    spam        102  481
##                                           
##                Accuracy : 0.8811          
##                  95% CI : (0.8628, 0.8977)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7541          
##  Mcnemar's Test P-Value : 0.002324        
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.8858          
##          Pos Pred Value : 0.9221          
##          Neg Pred Value : 0.8250          
##              Prevalence : 0.6062          
##          Detection Rate : 0.5323          
##    Detection Prevalence : 0.5772          
##       Balanced Accuracy : 0.8819          
##                                           
##        'Positive' Class : nonspam         
## 

In this one, we can observe that this “filter” is better because of a 88% of accuracy. ##(d) The oNB5pca accurracy is 11% higher than NB5. Therefore I can say that as I show below the confusion matrix, according to the data the best option between the two models is NB5pca

CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     476   31
##    spam        360  512
##                                           
##                Accuracy : 0.7165          
##                  95% CI : (0.6919, 0.7401)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4631          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5694          
##             Specificity : 0.9429          
##          Pos Pred Value : 0.9389          
##          Neg Pred Value : 0.5872          
##              Prevalence : 0.6062          
##          Detection Rate : 0.3452          
##    Detection Prevalence : 0.3677          
##       Balanced Accuracy : 0.7561          
##                                           
##        'Positive' Class : nonspam         
## 
CMpca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     734   62
##    spam        102  481
##                                           
##                Accuracy : 0.8811          
##                  95% CI : (0.8628, 0.8977)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7541          
##  Mcnemar's Test P-Value : 0.002324        
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.8858          
##          Pos Pred Value : 0.9221          
##          Neg Pred Value : 0.8250          
##              Prevalence : 0.6062          
##          Detection Rate : 0.5323          
##    Detection Prevalence : 0.5772          
##       Balanced Accuracy : 0.8819          
##                                           
##        'Positive' Class : nonspam         
##