DAT 315 Machine Learning Project 3

Problem 1

Load the “Breast Cancer Wisconsin Diagnostic” dataset from http://archive.ics.uci.edu/ml. into R. Thereare 569 observations on 32 variables. Be sure to include appropriate variable names.

#Loads packages
library(caret)
library(ISLR)
library(car)
library(leaps)
library(earth)
library(MASS)
library(AppliedPredictiveModeling)
library(DMwR)
library(quadprog)
library(gdata)

(a)

The first column is a patient identification number. Remove it from the data.

cols <- c("id","diagnosis","radiusMean","textureMean","perimeterMean","areaMean","smoothnessMean","compactnessMean","concavityMean","concave_ptsMean","symmetryMean","fractal_dimMean","radiusSE","textureSE","perimeterSE","areaSE","smoothnessSE","compactnessSE","concavitySE","concave_ptsSE","symmetrySE","fractal_dimSE","radiusWorst","textureWorst","perimeterWorst","areaWorst","smoothnessWorst","compactnessWorst","concavityWorst","concave_ptsWorst","symmetryWorst","fractal_dimWorst")
bcwd <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', col.names = cols)
bcwd$id <- NULL

(b)

Set the seed to 12345 and use createDataPartition with \(p=0.7\) to partition the data into training and testing sets. The response variable is diagnosis.

set.seed(12345)
trainingIndices <- createDataPartition(bcwd$diagnosis, p=0.7, list=FALSE)
bcwdTraining <- bcwd[trainingIndices, ]
bcwdTesting <- bcwd[-trainingIndices, ]

(c)

Build a kNN classification model (response =diagnosis) called kNN1 using train with optional arguments trControl=trainControl( method=“cv”, number=10)(for cross-validation), preProcess=c( “center”,“scale”) (to resize the predictors), andtuneGrid=data.frame(.k=1:20) (to try values of k from 1 to 20).

kNN1 <- train(diagnosis~., data=bcwdTraining, method ="knn", trControl=trainControl(method = "cv", number = 10), preProcess=c( "center","scale"), tuneGrid=data.frame(.k=1:20))
kNN1

## k-Nearest Neighbors 
## 
## 398 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 358, 359, 359, 358, 358, 358, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9450000  0.8818822
##    2  0.9399359  0.8698237
##    3  0.9599359  0.9125390
##    4  0.9548718  0.9019636
##    5  0.9598718  0.9128724
##    6  0.9598718  0.9127744
##    7  0.9523077  0.8964513
##    8  0.9548077  0.9015605
##    9  0.9523077  0.8958527
##   10  0.9573718  0.9065824
##   11  0.9548077  0.9010345
##   12  0.9548077  0.9010345
##   13  0.9598077  0.9116972
##   14  0.9573077  0.9059895
##   15  0.9523077  0.8950306
##   16  0.9573718  0.9061258
##   17  0.9548077  0.9002817
##   18  0.9548077  0.9002817
##   19  0.9548077  0.9002817
##   20  0.9548077  0.9002817
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

(d)

Graph the accuracy as a function of k and determine the optimal value of k.

plot(kNN1,
     main = "kNN1",
     xlab="k",
     ylab="Accuracy (k)",
     col = "blue"
     )

(e)

Evaluate the model by generating a confusion matrix for the model applied to the testing data. If H0: Tissue is benign (B), then what does the confusion matrix predict for α=P(Type I error) and β=P(Type II error)?

predicted = predict(kNN1, newdata = bcwdTesting)
confusionMatrix(predicted, bcwdTesting$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 104   5
##          M   3  58
##                                           
##                Accuracy : 0.9529          
##                  95% CI : (0.9094, 0.9795)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8985          
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.9720          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9541          
##          Neg Pred Value : 0.9508          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6118          
##    Detection Prevalence : 0.6412          
##       Balanced Accuracy : 0.9463          
##                                           
##        'Positive' Class : B               
##

According to the confusion matrix, \(α=\dfrac{3}{107}≈0.0280\) and \(β=\dfrac{5}{63}≈0.0794\).

(f)

In the cases where the model fails on the testing data, what are the associated probabilities? Any surprises?

predictedProb = predict(kNN1, newdata = bcwdTesting, type = 'prob')
predictedProb[predicted!=bcwdTesting$diagnosis, ]

##             B         M
## 28  0.0000000 1.0000000
## 29  0.6666667 0.3333333
## 34  0.6666667 0.3333333
## 63  0.3333333 0.6666667
## 81  1.0000000 0.0000000
## 84  0.6666667 0.3333333
## 108 0.3333333 0.6666667
## 117 0.6666667 0.3333333

Values \(28\) and \(81\) are both particularly surprising because the associated predicted probabilities are in complete logical opposition to the actual diagnoses.

(2)

The Boston data set in the MASS package contains 506 observations on 14 variables.

(a)

Set the seed to 12345 and usecreateDataPartition with \(p=0.7\) to partition the data into training and testing sets. The response variable is medv.

set.seed(12345)
boston <- Boston
trainingIndices <- createDataPartition(boston$medv, p=0.7, list=FALSE, times = 1)
bostonTraining <- boston[trainingIndices, ]
bostonTesting <- boston[-trainingIndices, ]

(b)

Build a kNN regression model called kNN2 formedvusing all other variables as predictors. Use 10-fold crossvalidation but do not standardizethe predictors. Compute \(R^2\) for the model on the testing data.

kNN2 <- train(medv~., data=bostonTraining, method ="knn", trControl=trainControl(method = "cv", number = 10))
predicted_kNN2 <- predict(kNN2, newdata = bostonTesting)
rSquare_kNN2 <- cor(predicted_kNN2, bostonTesting$medv)^2
kNN2

## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 322, 320, 320, 320, 320, 320, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  6.501370  0.5328995  4.613295
##   7  6.621806  0.5028579  4.729864
##   9  6.629041  0.4998204  4.758269
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

rSquare_kNN2

## [1] 0.489145

# Graphing to make sure that the predicted model follow the abline
plot(bostonTesting$medv, predicted_kNN2,
     main = "Predicted vs. Observed: kNN2 Model",
     xlab="medv",
     ylab="Predicted", 
     col = "Blue"
     )
abline(0,1, col = "red")

(c)

Build a kNN regression model called kNN3 for medv using all other variables as predictors. Use 10-fold crossvalidation and standardizethe predictors. Compute \(R^2\) for the model on the testing data.

kNN3 <- train(medv~., data=bostonTraining, method ="knn", trControl=trainControl(method = "cv", number = 10), preProcess=c( "center","scale"))
predicted_kNN3 <- predict(kNN3, newdata = bostonTesting)
rSquare_kNN3 <- cor(predicted_kNN3, bostonTesting$medv)^2
kNN3

## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 321, 321, 321, 320, 320, 321, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  4.848538  0.7364466  3.162348
##   7  4.818407  0.7450562  3.180713
##   9  4.776094  0.7500352  3.106382
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

rSquare_kNN3

## [1] 0.7969769

(d)

Comment on the effectiveness of standardizing the predictors.

Standardizing the predictors made the \(R^2\) of kNN3 higher than the \(R^2\) of kNN2, which means that kNN3 has a higher predicitive ability than kNN2.

(e)

What value of \(k\) is optimal for kNN3?

kNN3$finalModel$k

## [1] 9

(3)

The ChemicalManufacturingProcess data in the AppliedPredictiveModeling package contains \(176\) observations on \(58\) variables.

(a)

Determine the number of NAs in theChemicalManufacturingProcessdata.

data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
sum(is.na(chem))

## [1] 106

(b)

Use k-nearest neighbors imputation with \(k = 5\) to replace the NAs.

chem <-knnImputation(chem, 5)

(c)

Set the seed to 12345 and use createDataPartition with \(p=0.7\) to partition the data into training and testing sets. The response variable isYield.

set.seed(12345)
trainingIndices <- createDataPartition(chem$Yield, p=0.7, list=FALSE, times = 1)
chemTraining <- chem[trainingIndices, ]
chemTesting <- chem[-trainingIndices, ]

(d)

Build a kNN regression model called kNN4 for Yield using all other variables as predictors. Use 10-fold crossvalidation, standardize the predictors, and have \(k\) run from 1 to 20. Compute \(R^2\) for the model on the testingdata.

kNN4 <- train(Yield~., data=chemTraining, method ="knn", trControl=trainControl(method = "cv", number = 10), preProcess=c( "center","scale"), tuneGrid=data.frame(.k=1:20))
predicted_kNN4 <- predict(kNN4, newdata = chemTesting)
rSquare_kNN4 <- cor(predicted_kNN4, chemTesting$Yield)^2
kNN4

## k-Nearest Neighbors 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 111, 111, 112, 112, 111, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    1  1.427711  0.4644080  1.1057308
##    2  1.365624  0.4725430  1.0805609
##    3  1.334300  0.4907510  1.0533397
##    4  1.225791  0.5761424  0.9866593
##    5  1.278859  0.5495754  1.0169051
##    6  1.308214  0.5229591  1.0356905
##    7  1.296904  0.5371365  1.0484542
##    8  1.309094  0.5141480  1.0572115
##    9  1.342208  0.4932970  1.1037261
##   10  1.376904  0.4587657  1.1308800
##   11  1.373152  0.4681376  1.1202925
##   12  1.397733  0.4467969  1.1362206
##   13  1.385955  0.4640259  1.1253542
##   14  1.377975  0.4761592  1.1187266
##   15  1.404477  0.4508466  1.1482355
##   16  1.414462  0.4372621  1.1581106
##   17  1.412914  0.4450490  1.1545351
##   18  1.435392  0.4250024  1.1703251
##   19  1.440372  0.4235065  1.1728085
##   20  1.444242  0.4274910  1.1778401
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 4.

rSquare_kNN4

## [1] 0.4344609

(e)

Graph \(R^2\) as a function of \(k\) and determine the optimal value of \(k\).

plot(kNN4$results$k, kNN4$results$Rsquared,
     main = "kNN4",
     xlab = "k",
     ylab = "rSquare",
     col = "Blue"
     )
lines(kNN4$results$k, kNN4$results$Rsquared,
      col = "Blue"
      )

kNN4$finalModel$k

## [1] 4

(4)

T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA463/TechStocks.xlsx contains closing prices for 9 different technology stocks for the years 2013-2014. Let \(f_{j}\) be the fraction of money invested in each stock j , j= 1,2,…,9. Note that a negative fraction indicates a short sale. Our goal is to generate the average return forthe portfolio with minimal risk by solving the quadratic program minimize fTΣf subject to f1+f2+…f9 = 1,where f = [f1f2… f9]T and Σ is the (annualized) covariance matrix of the returns. The mean return on the portfolio isfTμ, where μ is a vector containing the (annualized) mean returns for all of the stocks.

(a)

Let the closing price for a given stock on day t be St. For each stock, compute the returns ln((St+1)/St), their (annualized) means μ, and the (annualized) covariance matrix Σ.

TechStocks <- read.xls("TechStocks.xlsx", sheet = 1, header=TRUE)
TechStocks$Date <- NULL

x <- TechStocks[c(1:9)]
logRatio = function(x) { log(x[2:length(x)]/x[1:length(x)-1]) }
returns = as.data.frame(apply(TechStocks, 2, logRatio))
annualizedMeans = colMeans(returns)*252
annualizedCovMatrix = cov(returns)*252

(b)

Write a function that computes the (annualized) mean return for the minimal-risk portfolio and compute thecorresponding mean return for the given data. You may find that thequadprogpackage is helpful.

dvec <- matrix(0,9)
amat <- matrix(1,9,1)
b <- c(1)
solution <- solve.QP(annualizedCovMatrix, dvec, amat, b, meq=1)
meanReturn = sum(solution$solution * annualizedMeans)
meanReturn

## [1] 0.1837864

(c)

Write a function that resamples the returns and recomputes the minimal-risk mean return.

SampleCol = function(col) { sample(col, length(col), replace=TRUE) }

ResampleReturn = function(returns) {
  reSamp = as.data.frame(apply(returns, 2, SampleCol))
  annualizedMeans = colMeans(reSamp)*252
  annualizedCovMatrix = cov(reSamp)*252
  dvec = rep(0,9)
  amat = matrix(1, 9, 1)
  b = c(1)
  solution = solve.QP(annualizedCovMatrix, dvec, amat, b, meq = 1 )
  meanReturn = sum(solution$solution*annualizedMeans)
}

(d)

Set the seed to 12345. Use 1000 resamples to generate a histogram for the minimal-risk mean return and computea 95% confidence interval for the minimal-risk mean return.

set.seed(12345)
MinRiskMeanReturn = sort(replicate(1000, ResampleReturn(returns)))
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(MinRiskMeanReturn, probs = c(lowerlevel, upperlevel))
ci

##      2.5%     97.5% 
## 0.1086617 0.3265208

hist(MinRiskMeanReturn,
     main = "Minimum Risk Mean Return",
     col = rgb(0,1,1,1/4))

DAT 315 Machine Learning Project 3

Eden Biskin

Dannielle Kerstetter

Tuyen Le

2/14/2019

Problem 1

(a)

(b)

(c)

(d)

(e)

(f)

(2)

(a)

(b)

(c)

(d)

(e)

(3)

(a)

(b)

(c)

(d)

(e)

(4)

(a)

(b)

(c)

(d)