Project 3

Problem 1

#Loading the BCWC dataset and naming the variables
BCancer=read.csv("C:/Users/joshr/Documents/Machine Learning R/wdbc.txt", header = FALSE)
names(BCancer) <- c("ID_number", "Diagnosis", "Mean radius", "Mean texture", "Mean perimeter", "Mean area", "Mean smoothness", "Mean compactness", "Mean concavity", "Mean concave points", "Mean symmetry", "Mean fractal dimension", "Radius SE", "Texture SE", "Perimeter SE", "Area SE", "Smoothness SE", "Compactness SE", "Concavity SE", "Concave points SE", "Symmetry SE", "Fractal dimension SE", "Worst radius", "Worst texture", "Worst perimeter", "Worst area", "Worst smoothness", "Worst compactness", "Worst concavity", "Worst concave points", "Worst symmetry", "Worst fractal dimension")

1A

#Removing the first column
BCancer$ID_number <- NULL

1B

#Setting the seed and partitioning the data into training and testing sets
library(caret)
set.seed(12345)
trainingIndices <- createDataPartition(BCancer$Diagnosis, p=0.7, list=FALSE)
training <- BCancer[trainingIndices, ]
testing <- BCancer[-trainingIndices, ]

1C

#Building kNN1, a kNN classification model
kNN1 <- train(Diagnosis~., data=training, method='knn', trControl=trainControl(method="cv", number=10), preProcess=c("center","scale"), tuneGrid=data.frame(.k=1:20))

1D

#plotting accuracy as a function of k
plot(kNN1$results$k, kNN1$results$Accuracy, main = "Accuracy for kNN1", xlab = "k", ylab = "Accuracy", type = "o", pch = 20, col = "purple")

#Finding the maximum accuracy
max(kNN1$results$Accuracy)

## [1] 0.9549359

From the output, we can tell that the maximum accuracy for kNN1 is 0.9549359. This maximum occurs at k = 6, k = 7, and k = 10. Therefore, we have three optimal values of k (6, 7, and 10), in this case.

1E

#Generating a confusion matrix on the testing data
predicted <- predict(kNN1, newdata=testing)
confusionMatrix(predicted, testing$Diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 104   2
##          M   3  61
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9327, 0.9904)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9372          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9720          
##             Specificity : 0.9683          
##          Pos Pred Value : 0.9811          
##          Neg Pred Value : 0.9531          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6118          
##    Detection Prevalence : 0.6235          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : B               
##

With a type 1 error, you reject that the tissue is benign when it actually is benign. The probability of the type 1 error in this case is 3/(104+3) = .028. With a type 2 error, you fail to reject that the tissue is benign when it actually is not benign. The probability of type 2 error in this case is 2/(61+2) = .032.

1F

#Finding the probabilities for the cases where the model fails on the testing data
newprediction <- predict(kNN1, newdata = testing, type = 'prob')
newprediction[predicted!=testing$Diagnosis, ]

##       B   M
## 25  0.5 0.5
## 28  0.2 0.8
## 68  0.5 0.5
## 75  0.6 0.4
## 148 0.6 0.4

In the cases shown above when the model fails, it is unsurprising that in two of the cases, the probability was 50% for malignant and 50% for benign. In these cases, it was essentially a guess as to whether it would be truly benign or malignant. In two other cases, the probability was 60% that the case would be benign but it ended up being malignant. Again, this probability is not very high, so it isn’t too surprising that it was misdiagnosed. However, for case 28, the probability that the case would be malignant was high, at 80%, so it is surprising that it ended up being benign.

Problem 2

2A

#Setting the seed and partitioning the data into training and testing sets
library(MASS)
library(caret)

set.seed(12345)
trainingIndices = createDataPartition(Boston$medv, p =0.7, list = FALSE, times=1)
Training = Boston[trainingIndices, ]
Testing = Boston [-trainingIndices, ]

2B

#Building knn2, a kNN classification model made without standardized predictors
knn2 = train(medv~., data = Training, method = "knn", trControl = trainControl(method = "cv", number = 10))
#Computing r-squared for knn2 on the testing data
predictedknn2 = predict (knn2, newdata = Testing)
rsquareknn2 = cor(predictedknn2, Testing$medv)^2
rsquareknn2

## [1] 0.5261986

When we run knn2 on the testing data, we get an r-squared value of 0.5261986.

2C

#Building knn3, a kNN classification model made with standardized predictors
knn3 = train(medv~., data = Training, method = "knn", trControl = trainControl(method = "cv", number = 10), preProcess = c("center", "scale"))
#Computing r-squared for knn3 on the testing data
predictedknn3 = predict(knn3, newdata = Testing)
rsquareknn3 = cor(predictedknn3, Testing$medv)^2
rsquareknn3

## [1] 0.8391416

When we run knn3 on the testing data, we get an r-squared value of 0.8391416.

2D

#Computing the difference between the r-squareds of knn2 and knn3
rsquareknn3 - rsquareknn2

## [1] 0.312943

R-squared increased greatly (by 0.312943, as shown) when the model was scaled and centered. Scaling and centering the model increases the predictability greatly and thus, should be done.

2E

#Looking at knn3's data
knn3

## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 320, 320, 319, 321, 320, 321, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  5.062430  0.7072873  3.280971
##   7  5.107461  0.7094686  3.302478
##   9  5.165949  0.7105568  3.350741
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

#Plotting RMSE as a function of k for knn3
plot(knn3$results$k, knn3$results$RMSE, main = "RMSE for knn3", xlab = "k", ylab = "RMSE", type = "o", pch = 20, col = "indianred2")

#Plotting r-squared as a function of k for knn3
plot(knn3$results$k, knn3$results$Rsquared, main = "R-squared for knn3", xlab = "k", ylab = "R-squared", type = "o", pch = 20, col = "indianred2")

k = 5 nearest neighbors is the best number of k nearest neighbors if we use RMSE as our measure to determine the optimal model. This is because it has the lowest RMSE, as shown by the first graph. If we use R-squared as our measure to determine the optimal model instead, then k = 9 would be the optimal number of k nearest neighbors (since it has the highest R-squared value, as shown by the second graph).

Problem 3

3A

library(DMwR)
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
sum(is.na(ChemicalManufacturingProcess))

## [1] 106

This shows that there are 106 NAs in the ChemicalManufacturingProcess data.

3B

NewChemProcess <- knnImputation(ChemicalManufacturingProcess, 5)
sum(is.na(NewChemProcess))

## [1] 0

This shows that there are now zero NAs in the ChemicalManufacturingProcess data. They have all been replaced.

3C

#Setting the seed and partitioning the data into training and testing sets
set.seed(12345)
trainingIndices <- createDataPartition(NewChemProcess$Yield, p = 0.7, list = FALSE, times = 1)

Training <- NewChemProcess[trainingIndices, ]
Testing <- NewChemProcess[-trainingIndices, ]

3D

#Building knn4, a kNN regression model
knn4 <- train(Yield~., data = Training, method = "knn", trControl = trainControl(method = "cv", number = 10), preProcess = c("center", "scale"), tuneGrid = data.frame(.k=1:20))
#Computing r-squared for knn4 on the testing data
predictedknn4 <- predict(knn4, newdata = Testing)
rsquareknn4 <- cor(predictedknn4, Testing$Yield)^2
rsquareknn4

## [1] 0.2318765

When we run knn4 on the testing data, we get an r-squared value of 0.2318765.

3E

#Looking at knn4's data
knn4

## k-Nearest Neighbors 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 112, 112, 110, 110, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    1  1.329002  0.5612418  0.9937976
##    2  1.204573  0.6107339  0.9574643
##    3  1.179339  0.6225532  0.9317937
##    4  1.242031  0.5903446  0.9555625
##    5  1.257073  0.5745020  0.9815190
##    6  1.262658  0.5748186  0.9845496
##    7  1.253990  0.5820202  0.9926125
##    8  1.289082  0.5572460  1.0106860
##    9  1.295272  0.5620830  1.0201786
##   10  1.290987  0.5652431  1.0112990
##   11  1.293758  0.5625803  1.0169033
##   12  1.290825  0.5673890  1.0221247
##   13  1.310573  0.5531906  1.0448777
##   14  1.314760  0.5570245  1.0626735
##   15  1.322895  0.5509255  1.0701576
##   16  1.323631  0.5595409  1.0668292
##   17  1.335317  0.5550728  1.0754575
##   18  1.343720  0.5503510  1.0802472
##   19  1.358229  0.5393939  1.0902423
##   20  1.361988  0.5368745  1.0898083
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

#Plotting r-squared as a function of k for knn4
plot(knn4$results$k, knn4$results$Rsquared, main = "R-squared for knn4", xlab = "k", ylab = "R-squared", type = "o", pch = 20, col = "khaki4")

#Plotting RMSE as a function of k for knn4
plot(knn4$results$k, knn4$results$RMSE, main = "RMSE for knn4", xlab = "k", ylab = "RMSE", type = "o", pch = 20, col = "khaki4")

As can be seen from the data and the graphs, k = 3 produces the highest R-squared and the lowest RMSE. Thus, 3 is the optimal value for k.

Problem 4

4A

#Reading the TechStocks file
setwd("C:/Users/joshr/Documents/Machine Learning R")
TechStocks <- read.csv(file = 'TechStocks.csv')
#Removing the date column
TechStocks1 <- TechStocks[, -c(1)]
#Creating a function called Logarithm to be used to find the returns
Logarithm <- function(x)
{log(x[2:length(x)]/x[1:(length(x)-1)])}
#Creating a function to find the means
Means <- function(x) {252*mean(x)}
#Computing the returns using the Logarithm function
Returns <- apply(TechStocks1, 2, Logarithm)
#Computing the annualized means
AnnualizedMeans <- apply(Returns, 2, Means)
AnnualizedMeans

##       DOX       CAN      VRTU      CTSH      INFY       WIT       GIB      EPAM 
## 0.1674081 0.1535161 0.4611042 0.1625373 0.2116694 0.1886095 0.2458692 0.4622137 
##      HCKT 
## 0.3730890

#Computing the annualized covariance matrix
AnnualizedCovarianceMatrix <- 252*cov(Returns)
AnnualizedCovarianceMatrix

##              DOX        CAN       VRTU       CTSH        INFY        WIT
## DOX  0.017354764 0.01023309 0.01062900 0.01328624 0.011191268 0.01344444
## CAN  0.010233089 0.03414610 0.01788330 0.02023627 0.015555104 0.01798685
## VRTU 0.010629003 0.01788330 0.08682392 0.01987732 0.018613612 0.02052116
## CTSH 0.013286235 0.02023627 0.01987732 0.06573609 0.031553790 0.03257487
## INFY 0.011191268 0.01555510 0.01861361 0.03155379 0.099148515 0.04464843
## WIT  0.013444442 0.01798685 0.02052116 0.03257487 0.044648430 0.07313797
## GIB  0.010191088 0.01605604 0.01584490 0.01552407 0.018726367 0.02427504
## EPAM 0.015116009 0.02229656 0.03909279 0.02889489 0.026542878 0.02806391
## HCKT 0.008001817 0.01130643 0.02469106 0.01856779 0.008879767 0.01307661
##             GIB       EPAM        HCKT
## DOX  0.01019109 0.01511601 0.008001817
## CAN  0.01605604 0.02229656 0.011306425
## VRTU 0.01584490 0.03909279 0.024691059
## CTSH 0.01552407 0.02889489 0.018567791
## INFY 0.01872637 0.02654288 0.008879767
## WIT  0.02427504 0.02806391 0.013076610
## GIB  0.08065217 0.01618000 0.011985526
## EPAM 0.01618000 0.16418374 0.028492805
## HCKT 0.01198553 0.02849280 0.131535962

4B

#Writing a function to compute the annualized mean return for the minimal-risk portfolio
library(quadprog)
AnnualizedMeanfunction <- function(ReturnMatrix, MeanVector)
{
dmat <- cov(ReturnMatrix)
dvec <- rep(0,9)
amat <- matrix(1, 9, 1)
bvec <- c(1)
sol <- solve.QP(dmat, dvec, amat, bvec, meq = 1, factorized = FALSE)
sum(sol$solution*MeanVector)
}
#Computing the corresponding mean return
AnnualizedMeanfunction(Returns, AnnualizedMeans)

## [1] 0.1837864

The annualized mean return for the minimal-risk portfolio is 0.1837864.

4C

#Writing a function to resample the returns and recompute the minimal-risk mean return
ResampleFunction <- function()
{
ResampleReturns <- apply(Returns, 2, function(x) sample(x, length(x), replace = TRUE))
ResampleMeans <- 252*apply(ResampleReturns, 2, mean)
ResampleCovariance <- 252*cov(ResampleReturns)
AnnualizedMeanfunction(ResampleReturns, ResampleMeans)
}
ResampleFunction()

## [1] 0.3069331

Using resampled returns, we now got a minimal-risk mean return of 0.3069331.

4D

#setting the seed and generating a histogram for the minimal-risk mean return
set.seed(12345)
table <- sort(replicate(1000,ResampleFunction()))
hist(table, xlab = "Minimal-Risk Mean Return", main = "Histogram of Minimal-Risk Mean Return"  )

#Finding a 95% confidence interval for the minimal-risk mean return
CI <- c(table[26], table[975])
CI

## [1] 0.09912929 0.32593648

The 95% confidence interval for the minimal-risk mean return that we got is (0.09912929, 0.32593648).

Project 3

Josh Albin, Jack Tuman, and Max Billante

3/19/2020

Problem 1

1A

1B

1C

1D

1E

1F

Problem 2

2A

2B

2C

2D

2E

Problem 3

3A

3B

3C

3D

3E

Problem 4

4A

4B

4C

4D