Project 3

Description of the Project

The following project is a work for Machine Learning course (DAT 315). It will be solve a series of problems. All excercises have been made by Angel Sosa.

In order to avoid some rstudio erros, I’m loading first all libraries needed.

library(ISLR)
library(fmsb)
library(car)
library(stats)
library(leaps)
library(earth)

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

library(class)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(MASS)
library(DMwR)

## Loading required package: grid

library(quadprog)
library(AppliedPredictiveModeling)

Problem 1

We are using the “Breast Cancer Wisconsin Diagnostic” dataset in this problem.

(a)

The patient identification number is not useful so is removed.

bcwd <- read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/bcwd.txt")
bcwd <- bcwd[, -c(1)]

(b)

Setting the “seed” to 12345, we create the Data partition based in “Diagnosis” from the dataset.

set.seed(12345)
trainindexbcwd <- createDataPartition(bcwd$Diagnosis,p = 0.7, list = FALSE, times = 1)
bcwdtrain <- bcwd[trainindexbcwd,]
bcwdtest <- bcwd[-trainindexbcwd,]

(c)

Building a kNN model (with “train”), with optional arguments, we managed our data. Optional arguments: “trControl”, “preProcess” and “tuneGrid”.

kNN1 <- train(Diagnosis~., data=bcwdtrain, method="knn",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"), tuneGrid=data.frame(.k=1:20))
kNN1

## k-Nearest Neighbors 
## 
## 399 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 359, 360, 359, 359, 359, 359, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9474359  0.8873717
##    2  0.9599359  0.9129673
##    3  0.9599359  0.9133774
##    4  0.9523718  0.8967667
##    5  0.9650000  0.9239854
##    6  0.9650000  0.9245889
##    7  0.9675000  0.9291044
##    8  0.9575000  0.9079077
##    9  0.9600000  0.9131708
##   10  0.9625000  0.9182936
##   11  0.9625000  0.9182936
##   12  0.9549359  0.9015341
##   13  0.9600000  0.9123037
##   14  0.9625000  0.9184645
##   15  0.9574359  0.9066434
##   16  0.9574359  0.9066434
##   17  0.9599359  0.9121969
##   18  0.9549359  0.9016783
##   19  0.9549359  0.9016783
##   20  0.9625000  0.9184685
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 7.

(d)

Graphing accurracy as a function of k.

plot(kNN1, xlab="k", ylab="Accuracy (k)")

What is the optimal value of k?

According to the graph, the optimal value of k is 7. In 7, it aproximates to 1 than in other k values.

(e)

predicted = predict(kNN1, newdata = bcwdtest)
confusionMatrix(predicted, bcwdtest$Diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 105   5
##          M   2  58
##                                          
##                Accuracy : 0.9588         
##                  95% CI : (0.917, 0.9833)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9109         
##  Mcnemar's Test P-Value : 0.4497         
##                                          
##             Sensitivity : 0.9813         
##             Specificity : 0.9206         
##          Pos Pred Value : 0.9545         
##          Neg Pred Value : 0.9667         
##              Prevalence : 0.6294         
##          Detection Rate : 0.6176         
##    Detection Prevalence : 0.6471         
##       Balanced Accuracy : 0.9510         
##                                          
##        'Positive' Class : B              
##

What does the confusion matrix preditc for alpha and Beta ?

For alpha, which is the Type I error, preditc: 0.0794 of value. (Given by 1-Specificity) For Beta, which is the Type II error, preditc: 0.0187 of value. (Given by 1-Sensitivity)

(f)

In the cases where the model fails on the testing data, what are the associated probabilities? Any surprises?

In this case I got a high probability of False positive (Type I error) a 79.4%, than False negative (Type II error) a 18.7%. which I think is normal because of my data. I mean that if we have more data to analyze, it will be better for a prediction. The differences between the erros are surprising but I think is normal in this case where we are looking for a false positive rather than a False negative.

Problem 2

The “Boston” dataset is used for this problem.

(a)

Setting the “seed” to 12345, we create the Data partition based in “medv” from the dataset.

data(Boston)
boston <- Boston
set.seed(12345)
trainindexboston <- createDataPartition(boston$medv,p = 0.7, list = FALSE, times = 1)
bostontrain <- boston[trainindexboston,]
bostontest <- boston[-trainindexboston,]

(b)

Building a kNN model (with “train”), with an optional argument, we managed our data. Optional argument: “trControl”. Then, in order to compute the R-square is needed to use our testdata (In this case “bostontest”). Functions “cor” and “predict” are used.

kNN2 <- train(medv~., data=bostontrain, method="knn",trControl=trainControl(method = "cv", number = 10))
predicted <- predict(kNN2, newdata=bostontest)
kNN2_R2 <- cor(predicted, bostontest$medv)^2
kNN2

## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 321, 322, 320, 321, 320, 319, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared 
##   5  6.321588  0.5619085
##   7  6.614972  0.5070256
##   9  6.613283  0.5052519
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 5.

kNN2_R2

## [1] 0.489145

(c)

Building a kNN model (with “train”), with optional arguments, we managed our data. Optional arguments: “trControl” and “preProcess”. Then, in order to compute the R-square is needed to use our testdata (In this case “bostontest”). Functions “cor” and “predict” are used.

kNN3 <- train(medv~., data=bostontrain, method="knn",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"))
predicted <- predict(kNN3, newdata=bostontest)
kNN3_R2 <- cor(predicted, bostontest$medv)^2
kNN3

## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 321, 320, 321, 320, 321, 319, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared 
##   5  4.694714  0.7605271
##   7  4.860577  0.7440205
##   9  4.859645  0.7434985
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 5.

kNN3_R2

## [1] 0.784935

(d)

Scaling and centering the predictors.

plot(kNN2)

plot(kNN3)

The effectiveness of scaling and centering the predictors lies in the optimal way to follow a graphic model in which the predictors interact with our k value. Also we can make a better observation of the best k value and its R-squared for the kNN model.

(e)

What value of k is optimal?

For the kNN2 model, the optimal value of k is 5, because its RMSE is lower than the other two value of k. For the kNN3 model, the optimal value of k is 9, because its RMSE is lower than the other two value of k.

Problem 3

We are using ChemicalManufacturingProcess dataset for this problem. (Packeg: AppliedPredictiveModeling)

(a)

Which is.na I determined the number of NA. It will help us soon.

data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
sum(is.na(chem))

## [1] 106

(b)

To replace the missing data is optimal to use the function knnImputation, in order to managed our data for a kNN model.

chem <- knnImputation(chem, k = 5)

(c)

Setting the “seed” to 12345, we create the Data partition based in “Yield” from the dataset.

set.seed(12345)
trainindexchem <- createDataPartition(chem$Yield,p = 0.7, list = FALSE, times = 1)
chemtrain <- chem[trainindexchem,]
chemtest <- chem[-trainindexchem,]

(d)

Building a kNN model (with “train”), with optional arguments, we managed our data. Optional arguments: “trControl”, “preProcess” and “tuneGrid”.

kNN4 <- train(Yield~., data=chemtrain, method="knn",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"), tuneGrid=data.frame(.k=1:20))

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07

predicted <- predict(kNN4, newdata=chemtest)
kNN4_R2 <- cor(predicted, chemtest$Yield)^2
kNN4

## k-Nearest Neighbors 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared 
##    1  1.433272  0.4660500
##    2  1.251046  0.5727127
##    3  1.284574  0.5545142
##    4  1.228839  0.6032811
##    5  1.253296  0.5783980
##    6  1.253527  0.5726865
##    7  1.288125  0.5406860
##    8  1.311436  0.5171141
##    9  1.329873  0.4993553
##   10  1.328004  0.5011248
##   11  1.356679  0.4722567
##   12  1.372249  0.4561680
##   13  1.380990  0.4481100
##   14  1.384885  0.4442949
##   15  1.393764  0.4446986
##   16  1.401031  0.4376488
##   17  1.410588  0.4301057
##   18  1.420762  0.4198853
##   19  1.424922  0.4239067
##   20  1.431547  0.4212337
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 4.

kNN4_R2

## [1] 0.4344609

(e)

What value of k is optimal?

plot(kNN4)

The best value of K is 3, which one return us the best value for R-squared.

Problem 4

Using the Techstocks dataset, we are generating the average return for the portfolio with minimal risk by solving the quadratic program.

(a)

I’m computing the returns, means and covariance. All of this three elements annualized. I have to create a function in order to apply it to all the stocks, and not just one by one.

tech <- read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/TechStocks.csv")
tech <- tech[, -1]
logratio <- function(x){log(x[2:length(x)]/x[1:length(x)-1])}
techreturns <- apply(tech, 2, logratio)
techmeans <- apply(techreturns, 2, mean)*252
techcov <- cov(techreturns)*252

techreturns, techmeans and techcov will help us in the next step.

(b)

Creating a function for the minimal risk, which one will let us get this value for other dataset in the next step. It must contains all the elements in order. I’m using the solve.QP formula for my function.

minimalrisk <- function(techmeans, techcov){
dvec <- rep(0,9)
Amat <- matrix(1,9,1)
bvec <- c(1)
techquads <- solve.QP(techcov, dvec, Amat, bvec, meq=1)
sum(techquads$solution*techmeans) 
}
minimalrisk(techmeans, techcov)

## [1] 0.1837864

(c)

Using our previous function to fit a new one, we are reasampling our dataset to generate another minimal-risk mean return.

resam2 <- function(){
resam <- apply(techreturns, 2, function(x) sample(x, length(x), replace = TRUE))
techmeans <- apply(resam, 2, mean)*252
techcov <- cov(resam)*252
minimalrisk(techmeans, techcov)
}

(d)

Finally we replicate 1000 times to create a histogram in which we can observe the minimal risk mean return. Also I’m computing a 95% confidence interval.

resam3 <- sort(replicate(1000, resam2()))
hist(resam3)

c(resam3[25],resam3[975])

## [1] 0.1097394 0.3202161

Project 3

Angel Sosa

March 3rd, 2017

Description of the Project

In order to avoid some rstudio erros, I’m loading first all libraries needed.

Problem 1

(a)

(b)

(c)

(d)

What is the optimal value of k?

(e)

What does the confusion matrix preditc for alpha and Beta ?

(f)

In the cases where the model fails on the testing data, what are the associated probabilities? Any surprises?

Problem 2

(a)

(b)

(c)

(d)

(e)

What value of k is optimal?

Problem 3

(a)

(b)

(c)

(d)

(e)

Problem 4

(a)

(b)

(c)

(d)