The following project is a work for Machine Learning course (DAT 315). It will be solve a series of problems. All excercises have been made by Angel Sosa.
library(ISLR)
library(fmsb)
library(car)
library(stats)
library(leaps)
library(earth)
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(MASS)
library(DMwR)
## Loading required package: grid
library(quadprog)
library(AppliedPredictiveModeling)
We are using the “Breast Cancer Wisconsin Diagnostic” dataset in this problem.
The patient identification number is not useful so is removed.
bcwd <- read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/bcwd.txt")
bcwd <- bcwd[, -c(1)]
Setting the “seed” to 12345, we create the Data partition based in “Diagnosis” from the dataset.
set.seed(12345)
trainindexbcwd <- createDataPartition(bcwd$Diagnosis,p = 0.7, list = FALSE, times = 1)
bcwdtrain <- bcwd[trainindexbcwd,]
bcwdtest <- bcwd[-trainindexbcwd,]
Building a kNN model (with “train”), with optional arguments, we managed our data. Optional arguments: “trControl”, “preProcess” and “tuneGrid”.
kNN1 <- train(Diagnosis~., data=bcwdtrain, method="knn",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"), tuneGrid=data.frame(.k=1:20))
kNN1
## k-Nearest Neighbors
##
## 399 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 359, 360, 359, 359, 359, 359, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9474359 0.8873717
## 2 0.9599359 0.9129673
## 3 0.9599359 0.9133774
## 4 0.9523718 0.8967667
## 5 0.9650000 0.9239854
## 6 0.9650000 0.9245889
## 7 0.9675000 0.9291044
## 8 0.9575000 0.9079077
## 9 0.9600000 0.9131708
## 10 0.9625000 0.9182936
## 11 0.9625000 0.9182936
## 12 0.9549359 0.9015341
## 13 0.9600000 0.9123037
## 14 0.9625000 0.9184645
## 15 0.9574359 0.9066434
## 16 0.9574359 0.9066434
## 17 0.9599359 0.9121969
## 18 0.9549359 0.9016783
## 19 0.9549359 0.9016783
## 20 0.9625000 0.9184685
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
Graphing accurracy as a function of k.
plot(kNN1, xlab="k", ylab="Accuracy (k)")
According to the graph, the optimal value of k is 7. In 7, it aproximates to 1 than in other k values.
predicted = predict(kNN1, newdata = bcwdtest)
confusionMatrix(predicted, bcwdtest$Diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 105 5
## M 2 58
##
## Accuracy : 0.9588
## 95% CI : (0.917, 0.9833)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9109
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9813
## Specificity : 0.9206
## Pos Pred Value : 0.9545
## Neg Pred Value : 0.9667
## Prevalence : 0.6294
## Detection Rate : 0.6176
## Detection Prevalence : 0.6471
## Balanced Accuracy : 0.9510
##
## 'Positive' Class : B
##
For alpha, which is the Type I error, preditc: 0.0794 of value. (Given by 1-Specificity) For Beta, which is the Type II error, preditc: 0.0187 of value. (Given by 1-Sensitivity)
In this case I got a high probability of False positive (Type I error) a 79.4%, than False negative (Type II error) a 18.7%. which I think is normal because of my data. I mean that if we have more data to analyze, it will be better for a prediction. The differences between the erros are surprising but I think is normal in this case where we are looking for a false positive rather than a False negative.
The “Boston” dataset is used for this problem.
Setting the “seed” to 12345, we create the Data partition based in “medv” from the dataset.
data(Boston)
boston <- Boston
set.seed(12345)
trainindexboston <- createDataPartition(boston$medv,p = 0.7, list = FALSE, times = 1)
bostontrain <- boston[trainindexboston,]
bostontest <- boston[-trainindexboston,]
Building a kNN model (with “train”), with an optional argument, we managed our data. Optional argument: “trControl”. Then, in order to compute the R-square is needed to use our testdata (In this case “bostontest”). Functions “cor” and “predict” are used.
kNN2 <- train(medv~., data=bostontrain, method="knn",trControl=trainControl(method = "cv", number = 10))
predicted <- predict(kNN2, newdata=bostontest)
kNN2_R2 <- cor(predicted, bostontest$medv)^2
kNN2
## k-Nearest Neighbors
##
## 356 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 321, 322, 320, 321, 320, 319, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared
## 5 6.321588 0.5619085
## 7 6.614972 0.5070256
## 9 6.613283 0.5052519
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
kNN2_R2
## [1] 0.489145
Building a kNN model (with “train”), with optional arguments, we managed our data. Optional arguments: “trControl” and “preProcess”. Then, in order to compute the R-square is needed to use our testdata (In this case “bostontest”). Functions “cor” and “predict” are used.
kNN3 <- train(medv~., data=bostontrain, method="knn",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"))
predicted <- predict(kNN3, newdata=bostontest)
kNN3_R2 <- cor(predicted, bostontest$medv)^2
kNN3
## k-Nearest Neighbors
##
## 356 samples
## 13 predictor
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 321, 320, 321, 320, 321, 319, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared
## 5 4.694714 0.7605271
## 7 4.860577 0.7440205
## 9 4.859645 0.7434985
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
kNN3_R2
## [1] 0.784935
Scaling and centering the predictors.
plot(kNN2)
plot(kNN3)
The effectiveness of scaling and centering the predictors lies in the optimal way to follow a graphic model in which the predictors interact with our k value. Also we can make a better observation of the best k value and its R-squared for the kNN model.
For the kNN2 model, the optimal value of k is 5, because its RMSE is lower than the other two value of k. For the kNN3 model, the optimal value of k is 9, because its RMSE is lower than the other two value of k.
We are using ChemicalManufacturingProcess dataset for this problem. (Packeg: AppliedPredictiveModeling)
Which is.na I determined the number of NA. It will help us soon.
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
sum(is.na(chem))
## [1] 106
To replace the missing data is optimal to use the function knnImputation, in order to managed our data for a kNN model.
chem <- knnImputation(chem, k = 5)
Setting the “seed” to 12345, we create the Data partition based in “Yield” from the dataset.
set.seed(12345)
trainindexchem <- createDataPartition(chem$Yield,p = 0.7, list = FALSE, times = 1)
chemtrain <- chem[trainindexchem,]
chemtest <- chem[-trainindexchem,]
Building a kNN model (with “train”), with optional arguments, we managed our data. Optional arguments: “trControl”, “preProcess” and “tuneGrid”.
kNN4 <- train(Yield~., data=chemtrain, method="knn",trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale"), tuneGrid=data.frame(.k=1:20))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: BiologicalMaterial07
predicted <- predict(kNN4, newdata=chemtest)
kNN4_R2 <- cor(predicted, chemtest$Yield)^2
kNN4
## k-Nearest Neighbors
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 110, 112, 112, 112, 112, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared
## 1 1.433272 0.4660500
## 2 1.251046 0.5727127
## 3 1.284574 0.5545142
## 4 1.228839 0.6032811
## 5 1.253296 0.5783980
## 6 1.253527 0.5726865
## 7 1.288125 0.5406860
## 8 1.311436 0.5171141
## 9 1.329873 0.4993553
## 10 1.328004 0.5011248
## 11 1.356679 0.4722567
## 12 1.372249 0.4561680
## 13 1.380990 0.4481100
## 14 1.384885 0.4442949
## 15 1.393764 0.4446986
## 16 1.401031 0.4376488
## 17 1.410588 0.4301057
## 18 1.420762 0.4198853
## 19 1.424922 0.4239067
## 20 1.431547 0.4212337
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 4.
kNN4_R2
## [1] 0.4344609
What value of k is optimal?
plot(kNN4)
The best value of K is 3, which one return us the best value for R-squared.
Using the Techstocks dataset, we are generating the average return for the portfolio with minimal risk by solving the quadratic program.
I’m computing the returns, means and covariance. All of this three elements annualized. I have to create a function in order to apply it to all the stocks, and not just one by one.
tech <- read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/TechStocks.csv")
tech <- tech[, -1]
logratio <- function(x){log(x[2:length(x)]/x[1:length(x)-1])}
techreturns <- apply(tech, 2, logratio)
techmeans <- apply(techreturns, 2, mean)*252
techcov <- cov(techreturns)*252
techreturns, techmeans and techcov will help us in the next step.
Creating a function for the minimal risk, which one will let us get this value for other dataset in the next step. It must contains all the elements in order. I’m using the solve.QP formula for my function.
minimalrisk <- function(techmeans, techcov){
dvec <- rep(0,9)
Amat <- matrix(1,9,1)
bvec <- c(1)
techquads <- solve.QP(techcov, dvec, Amat, bvec, meq=1)
sum(techquads$solution*techmeans)
}
minimalrisk(techmeans, techcov)
## [1] 0.1837864
Using our previous function to fit a new one, we are reasampling our dataset to generate another minimal-risk mean return.
resam2 <- function(){
resam <- apply(techreturns, 2, function(x) sample(x, length(x), replace = TRUE))
techmeans <- apply(resam, 2, mean)*252
techcov <- cov(resam)*252
minimalrisk(techmeans, techcov)
}
Finally we replicate 1000 times to create a histogram in which we can observe the minimal risk mean return. Also I’m computing a 95% confidence interval.
resam3 <- sort(replicate(1000, resam2()))
hist(resam3)
c(resam3[25],resam3[975])
## [1] 0.1097394 0.3202161