The following project is a work for Machine Learning course (DAT 315). It will be solve a series of problems. All excercises have been made by Angel Sosa. #Rmarkdown Cache
knitr::opts_chunk$set(cache = TRUE)
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.3.3
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.3
library(klaR)
## Warning: package 'klaR' was built under R version 3.3.3
## Loading required package: MASS
library(ISLR)
library(data.table)
library(kernlab)
library(fmsb)
library(car)
library(stats)
library(leaps)
library(earth)
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
##
## Attaching package: 'TeachingDemos'
## The following object is masked from 'package:klaR':
##
## triplot
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
library(MASS)
library(DMwR)
## Loading required package: grid
library(quadprog)
library(AppliedPredictiveModeling)
I’m using the hitters data, with which one I will build models to evaluate the predictiv capacity, training and testing data.
data(Hitters)
hitters <- Hitters[, -c(14,15,20)]
hitters <- hitters[!is.na(hitters$Salary),]
hitters$logSalary <- log(hitters$Salary)
hitters <- hitters[,-c(17)]
set.seed(12345)
hitterstrainindex <- createDataPartition(hitters$logSalary,p = 0.7, list = FALSE, times = 1)
hitterstrain <- hitters[hitterstrainindex,]
hitterstest <- hitters[-hitterstrainindex,]
I’m building a linear model using training data and computing R^2 with testing data.
LM1 <- lm(formula = hitterstrain$logSalary ~ ., data = hitterstrain)
summary(LM1)
##
## Call:
## lm(formula = hitterstrain$logSalary ~ ., data = hitterstrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27147 -0.44874 0.01607 0.40936 2.79909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.677e+00 1.959e-01 23.872 <2e-16 ***
## AtBat -3.931e-03 1.523e-03 -2.582 0.0107 *
## Hits 1.445e-02 5.813e-03 2.485 0.0139 *
## HmRun 9.406e-03 1.480e-02 0.636 0.5258
## Runs -3.868e-03 6.987e-03 -0.554 0.5805
## RBI 3.493e-03 6.463e-03 0.540 0.5896
## Walks 1.066e-02 4.492e-03 2.373 0.0188 *
## Years 4.177e-02 3.128e-02 1.335 0.1836
## CAtBat -2.807e-05 3.299e-04 -0.085 0.9323
## CHits 1.097e-03 1.745e-03 0.629 0.5302
## CHmRun 5.071e-04 3.912e-03 0.130 0.8970
## CRuns 8.409e-04 1.764e-03 0.477 0.6343
## CRBI -1.265e-03 1.822e-03 -0.694 0.4886
## CWalks -9.993e-04 8.293e-04 -1.205 0.2299
## PutOuts 4.154e-04 1.765e-04 2.355 0.0197 *
## Assists 1.052e-03 5.131e-04 2.051 0.0418 *
## Errors -1.647e-02 1.017e-02 -1.619 0.1074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6188 on 168 degrees of freedom
## Multiple R-squared: 0.5713, Adjusted R-squared: 0.5305
## F-statistic: 13.99 on 16 and 168 DF, p-value: < 2.2e-16
lm1predicted <- predict(LM1, newdata=hitterstest)
LM1_R2_test <- cor(lm1predicted, hitterstest$logSalary)^2
LM1_R2_test
## [1] 0.4518025
I’m building a linear model using training data and computing R^2 with testing data.Also, 10-fold cross validation and, center and scale predictors are required to be used.
RR1 <- train(logSalary~., data=hitterstrain, method="ridge",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"))
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
rr1predicted <- predict(RR1, newdata=hitterstest)
RR1_R2_test <- cor(rr1predicted, hitterstest$logSalary)^2
RR1_R2_test
## [1] 0.4812084
I’m building a lasso model using training data and computing R^2 with testing data.Also, 10-fold cross validation and, center and scale predictors are required to be used.
LASSO1 <- train(logSalary~., data=hitterstrain, method="lasso",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"))
lasso1predicted <- predict(LASSO1, newdata=hitterstest)
LASSO1_R2_test <- cor(lasso1predicted, hitterstest$logSalary)^2
LASSO1_R2_test
## [1] 0.4686052
Variables which ones the LASSO removes from the model are computing with predict.enet function.
predict.enet(LASSO1$finalModel, type="coefficients", s=LASSO1$bestTune$fraction, mode='fraction')$coefficients
## AtBat Hits HmRun Runs RBI Walks
## -0.26718587 0.40050874 0.04954137 0.00000000 0.03072973 0.12541714
## Years CAtBat CHits CHmRun CRuns CRBI
## 0.13511795 0.00000000 0.45915555 -0.09788284 0.04494156 0.00000000
## CWalks PutOuts Assists Errors
## -0.07672909 0.11115853 0.09988299 -0.09130279
For this problem I’m using data of predicting permeability of compounds, which ones have near zero variance values. I have to managed my data wisely to identify those values, the responses and predictors.
data("permeability")
By combining the predictors and responses into a single data frame and removing all predictors that have small variance, I will able to build my models.
perm <- as.data.frame(fingerprints)
perm$response <- permeability
perm <- perm[,-nearZeroVar(perm)]
I´m creating the testing and training data.
set.seed(12345)
permtrainindex <- createDataPartition(perm$response,p = 0.7, list = FALSE, times = 1)
permtrain <- perm[permtrainindex,]
permtest <- perm[-permtrainindex,]
Linear model is built.
LM2 <- train(response~., data=permtrain, method="lm", preProcess=c("center","scale"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: X316, X317, X318
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
#summary(LM2)
lm2predicted <- predict(LM2, newdata=permtest)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
LM2_R2_test <- cor(lm2predicted, permtest$response)^2
LM2_R2_test
## permeability
## [1,] 0.2323023
The same linear model is built but this time with “pca” included in my preProcess.
LM2pca <- train(response~., data=permtrain, method="lm", preProcess=c("center","scale", "pca"))
#summary(LM2pca)
lm2pcapredicted <- predict(LM2pca, newdata=permtest)
LM2pca_R2_test <- cor(lm2pcapredicted, permtest$response)^2
LM2pca_R2_test
## permeability
## [1,] 0.3688938
Yes, it increases R^2. Also, I realized that PCA makes LM2 model better.
I’m using a microarray which contains data for 14 different types of cancer. ##(a) I’m getting the training data and the testing data from this microarry. Also, responses and predictors are included. The data are very high dimensional.
cancertrain <- transpose(fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.xtrain"))
cancertrainresponse <- fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.ytrain")
cancertrainresponse <- transpose(cancertrainresponse)
cancertrainresponse <- as.factor(cancertrainresponse$V1)
cancertrain$response <- cancertrainresponse
cancertest <- transpose(fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.xtest"))
cancertestresponse <- fread("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.ytest")
cancertestresponse <- transpose(cancertestresponse)
cancertestresponse <- as.factor(cancertestresponse$V1)
cancertest$response <- cancertestresponse
set.seed(12345)
I’m building a linear discriminant analysis model in order to predict the type of cancer. I’m going to evaluate the accuracy with a confussion matrix.
#LDA3 <- lda(formula=response~., data=cancertrain)
#predictedlda3 <- predict(LDA3, newdata=cancertest)
#CMLDA3 <- confusionMatrix(predictedlda3, cancertest$response)
#CMLDA3
I’m building a k-nearest neighboors model in order to predict the type of cancer. I’m going to evaluate the accuracy with a confussion matrix.
#KNN3 <- train(response~., data=cancertrain, method="knn", #trControl=trainControl(method = "cv", number = 10))
#predictedknn3 <- predict(KNN3, newdata=cancertest)
#CMKNN3 <- confusionMatrix(predictedknn3, cancertest$response)
#CMKNN3
In order to determinate the efectiveness of a classifier, I’m goIng to show the confusion matrix again.
#CMLDA3
#CMKNN3
Making use of digitized handwritten numerals, I’m going to build models which one will allow me to identify handwritten digits.
library(parallel)
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.3.3
## Loading required package: foreach
## Loading required package: iterators
library(plyr)
##
## Attaching package: 'plyr'
## The following object is masked from 'package:DMwR':
##
## join
mnist_test = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/mnist_test.csv")
mnist_train = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/mnist_train.csv")
I’m removing the predictors with small variance in the training data and testing data. Also renaming the variables.
#revalue(mnist_train$X5,c("0"="c0", "1"="c1","2"="c2","3"="c3","4"="c4","5"="c5","6"="c6","7"="c7","8"="c8","9"="c9"))
#revalue(mnist_test$X7,c("0"="c0", "1"="c1","2"="c2","3"="c3","4"="c4","5"="c5","6"="c6","7"="c7","8"="c8","9"="c9"))
mnist_test <- mnist_test[,-nearZeroVar(mnist_test)]
mnist_train <- mnist_train[,-nearZeroVar(mnist_train)]
I’m building a k-nearest neighboors model in order to identify handwritten digits. I’m going to evaluate the accuracy with a confussion matrix, using the testing data.
#KNN4 <- train(X5~., data=mnist_train, method="knn", trControl=trainControl(method = "cv", number = 10))
#predictedknn4 <- predict(KNN4, newdata=mnist_test)
#CMKNN4 <- confusionMatrix(predictedknn4, mnist_test$X7)
I’m building a Linear discriminant analysis model in order to identify handwritten digits. I’m going to evaluate the accuracy with a confussion matrix, using the testing data.
#LDA4 <- train(X5~., data=mnist_train, method="lda", trControl=trainControl(method = "cv", number = 10))
#predictedlda4 <- predict(LDA4, newdata=mnist_test)
#CMLDA4 <- confusionMatrix(predictedlda4, mnist_test$X7)
#CMKNN4
#CMLDA4
For this final problem, I’m using the spam data in the kernlab package.
data("spam")
Before build my models, I have to create the partition in order to get my testing and training data.
set.seed(12345)
spamtrainindex <- createDataPartition(spam$type,p = 0.7, list = FALSE, times = 1)
spamtrain <- spam[spamtrainindex,]
spamtest <- spam[-spamtrainindex,]
I’m building a naive Bayes classifier based on the type of message.
NB5 <- train(type~., data=spamtrain, method="nb", trControl=trainControl(method = "cv", number = 10))
predicted <- predict(NB5, newdata=spamtest)
CM <- confusionMatrix(predicted, spamtest$type)
CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 476 31
## spam 360 512
##
## Accuracy : 0.7165
## 95% CI : (0.6919, 0.7401)
## No Information Rate : 0.6062
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4631
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5694
## Specificity : 0.9429
## Pos Pred Value : 0.9389
## Neg Pred Value : 0.5872
## Prevalence : 0.6062
## Detection Rate : 0.3452
## Detection Prevalence : 0.3677
## Balanced Accuracy : 0.7561
##
## 'Positive' Class : nonspam
##
Here we can clearly see that this “filter” is optimus if you would rather have nonspam e-mail in your inbox, due to an accuracy of 71%; with positive class non-spam. ##(c) I’m building a naive Bayes classifier based on the type of message.The same linear classifier is built but this time with “pca” included in my preProcess.
NB5pca <- train(type~., data=spamtrain, method="nb", trControl=trainControl(method = "cv", number = 10),preProcess=c("center","scale", "pca"))
predictedpca <- predict(NB5pca, newdata=spamtest)
CMpca <- confusionMatrix(predictedpca,spamtest$type)
CMpca
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 734 62
## spam 102 481
##
## Accuracy : 0.8811
## 95% CI : (0.8628, 0.8977)
## No Information Rate : 0.6062
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7541
## Mcnemar's Test P-Value : 0.002324
##
## Sensitivity : 0.8780
## Specificity : 0.8858
## Pos Pred Value : 0.9221
## Neg Pred Value : 0.8250
## Prevalence : 0.6062
## Detection Rate : 0.5323
## Detection Prevalence : 0.5772
## Balanced Accuracy : 0.8819
##
## 'Positive' Class : nonspam
##
In this one, we can observe that this “filter” is better because of a 88% of accuracy. ##(d) The oNB5pca accurracy is 11% higher than NB5. Therefore I can say that as I show below the confusion matrix, according to the data the best option between the two models is NB5pca
CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 476 31
## spam 360 512
##
## Accuracy : 0.7165
## 95% CI : (0.6919, 0.7401)
## No Information Rate : 0.6062
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4631
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5694
## Specificity : 0.9429
## Pos Pred Value : 0.9389
## Neg Pred Value : 0.5872
## Prevalence : 0.6062
## Detection Rate : 0.3452
## Detection Prevalence : 0.3677
## Balanced Accuracy : 0.7561
##
## 'Positive' Class : nonspam
##
CMpca
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 734 62
## spam 102 481
##
## Accuracy : 0.8811
## 95% CI : (0.8628, 0.8977)
## No Information Rate : 0.6062
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7541
## Mcnemar's Test P-Value : 0.002324
##
## Sensitivity : 0.8780
## Specificity : 0.8858
## Pos Pred Value : 0.9221
## Neg Pred Value : 0.8250
## Prevalence : 0.6062
## Detection Rate : 0.5323
## Detection Prevalence : 0.5772
## Balanced Accuracy : 0.8819
##
## 'Positive' Class : nonspam
##