R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ISLR)
library(car)
## Loading required package: carData
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(leaps)

#question 1 Load the “Breast Cancer Wisconsin Diagnostic” dataset from http://archive.ics.uci.edu/ml. into R. There are 569 observations on 32 variables. Be sure to include appropriate variable names.

download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",destfile = "./Project3Q1")

renamecol <- c("id","diagnosis","radiusMean","textureMean","perimeterMean","areaMean","smoothnessMean","compactnessMean","concavityMean","concave_ptsMean",
               "symmetryMean","fractal_dimMean","radiusSE","textureSE","perimeterSE","areaSE","smoothnessSE","compactnessSE","concavitySE","concave_ptsSE",
               "symmetrySE","fractal_dimSE","radiusWorst","textureWorst","perimeterWorst","areaWorst","smoothnessWorst","compactnessWorst","concavityWorst",
               "concave_ptsWorst","symmetryWorst","fractal_dimWorst")
BCWDdata <- read.csv("./Project3Q1", col.names = renamecol)

Q1 a

The first column is a patient identification number. Remove it from the data

BCWDdata$id <- NULL

Q1 b

Set the seed to 12345 and use createDataPartition with p=0.7 to partition the data into training and testing sets. The response variable is diagnosis.

set.seed(12345)
trainingIndex <- createDataPartition(BCWDdata$diagnosis, p = 0.7, list = FALSE)
training <- BCWDdata[trainingIndex, ]
testing <- BCWDdata[-trainingIndex, ]

Q1 c

Build a kNN classification model (response = diagnosis) called kNN1 using train with optional arguments trControl=trainControl(method=“cv”,number=10)(forcross-validation),preProcess=c(“center”,“scale”) (to resize the predictors), and tuneGrid=data.frame(.k=1:20) (to try values of k from 1 to 20).

knn1 <- train(diagnosis~., data = training, method = "knn", trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"), tuneGrid=data.frame(.k=1:20))
knn1
## k-Nearest Neighbors 
## 
## 398 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 358, 358, 359, 358, 358, 359, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9724359  0.9404060
##    2  0.9650000  0.9248715
##    3  0.9624359  0.9191981
##    4  0.9574359  0.9073848
##    5  0.9549359  0.9016087
##    6  0.9649359  0.9229776
##    7  0.9549359  0.9007671
##    8  0.9649359  0.9233086
##    9  0.9573718  0.9064154
##   10  0.9549359  0.9011837
##   11  0.9673718  0.9283293
##   12  0.9648718  0.9230720
##   13  0.9598718  0.9113476
##   14  0.9548718  0.9003926
##   15  0.9548718  0.8996137
##   16  0.9548718  0.8996137
##   17  0.9548718  0.8996137
##   18  0.9548718  0.8996137
##   19  0.9548718  0.8996137
##   20  0.9548718  0.8996137
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

Q1 d

Graph the accuracy as a function of k and determine the optimal value of k.

plot(knn1, main = "knn1", xlab = "k", ylab = "accuracy k")

# Q1 e Evaluate the model by generating a confusion matrix for the model applied to the testing data. If H0 : Tissue is benign (B),then what does the confusion matrix predict for α = P (Type I error) and β = P (Type II error)?

predicted <- predict(knn1, newdata = testing)
confusionMatrix(predicted, testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 100   9
##          M   7  54
##                                           
##                Accuracy : 0.9059          
##                  95% CI : (0.8517, 0.9452)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7969          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.9346          
##             Specificity : 0.8571          
##          Pos Pred Value : 0.9174          
##          Neg Pred Value : 0.8852          
##              Prevalence : 0.6294          
##          Detection Rate : 0.5882          
##    Detection Prevalence : 0.6412          
##       Balanced Accuracy : 0.8959          
##                                           
##        'Positive' Class : B               
## 

for the typeI error α = 7/(100+7) ≈0.0654 for the typeII error β = 9/(54+9) ≈0.1429

Q1 f

In the cases where the model fails on the testing data, what are the associated probabilities? Any surprises?

probabilities <- predict(knn1, newdata = testing, type = 'prob')
probabilities[predicted!=testing$diagnosis,]
##     B M
## 16  1 0
## 17  1 0
## 18  1 0
## 28  0 1
## 30  0 1
## 33  1 0
## 45  1 0
## 46  0 1
## 63  1 0
## 70  0 1
## 96  1 0
## 129 0 1
## 146 0 1
## 147 1 0
## 149 0 1
## 160 1 0

there is no any surprising in this testing data

question 2

The Boston data set in the MASS package contains 506 observations on 14 variables.

library(MASS)

Q2 a

Set the seed to 12345 and use createDataPartition with p=0.7 to partition the data into training and testing sets. The response variable is medv.

set.seed(12345)
boston <- Boston
BtrainingIndex <- createDataPartition(boston$medv, p = 0.7, list = FALSE)
Btraining <- boston[BtrainingIndex, ]
Btesting <- boston[-BtrainingIndex, ]

Q2 b

Build a kNN regression model called kNN2 for medv using all other variables as predictors. Use 10-fold cross validation but do not standardize the predictors. Compute R2 for the model on the testing data.

knn2 <- train(medv~., data = Btraining, method = 'knn', trControl = trainControl(method = "cv", number = 10))
predicted_knn2 <- predict(knn2, newdata = Btesting)
Rsquare_knn2 <- cor(predicted_knn2, Btesting$medv)^2

knn2
## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 321, 320, 321, 320, 321, 321, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  6.680476  0.4944616  4.754010
##   7  6.858160  0.4690622  4.875427
##   9  7.114447  0.4277455  5.065073
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Rsquare_knn2
## [1] 0.5261986

Q2 c

Build a kNN regression model called kNN3 for medv using all other variables as predictors. Use 10-fold cross validation and standardize the predictors. Compute R2 for the model on the testing data.

KNN3 <- train(medv~., data = Btraining, method = 'knn', trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"))
predicted_KNN3 <- predict(KNN3, newdata = Btesting)
Rsquare_KNN3 <- cor(predicted_KNN3, Btesting$medv)^2

KNN3
## k-Nearest Neighbors 
## 
## 356 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 320, 320, 319, 321, 320, 321, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  5.062430  0.7072873  3.280971
##   7  5.107461  0.7094686  3.302478
##   9  5.165949  0.7105568  3.350741
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Rsquare_KNN3
## [1] 0.8391416

Q2 d

comment on the effectiveness of standardizing the predictors.

For the Rsquare I get from KNN3 that is greater than the Rsquare I get from knn2, which means KNN3 will be easier to predict than knn2.

Q2 e

What value of k is optimal for kNN3?

According to question c, the final value used for the model was k = 9.

Question 3

The ChemicalManufacturingProcess data in the AppliedPredictiveModeling package contains 176 observations on 58 variables.

library(AppliedPredictiveModeling)
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
data("ChemicalManufacturingProcess")
CMP <- ChemicalManufacturingProcess

Q3 a

Determine the number of NAs in the ChemicalManufacturingProcess data.

sum(is.na(CMP))
## [1] 106

Q3 b

Use k-nearest neighbors imputation with k = 5 to replace the NAs.

CMP <- knnImputation(CMP, k = 5)

Q3 c

Set the seed to 12345 and use createDataPartition with p=0.7 to partition the data into training and testing sets. The response variable is Yield.

set.seed(12345)
CtrainingIndex <- createDataPartition(CMP$Yield, p = 0.7, list = FALSE)
Ctraining <- CMP[CtrainingIndex, ]
Ctesting <- CMP[-CtrainingIndex, ]

Q3 d

Build a kNN regression model called kNN4 for Yield using all other variables as predictors. Use 10-fold cross validation, standardize the predictors, and have k run from 1 to 20. Compute R2 for the model on the testing data.

knn4 <- train(Yield~., data = Ctraining, method = 'knn', trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"), tuneGrid=data.frame(.k=1:20))
predicted_knn4 <- predict(knn4, newdata = Ctesting)
Rsquare_knn4 <- cor(predicted_knn4, Ctesting$Yield)^2

knn4
## k-Nearest Neighbors 
## 
## 124 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 112, 112, 110, 110, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    1  1.329002  0.5612418  0.9937976
##    2  1.204573  0.6107339  0.9574643
##    3  1.179339  0.6225532  0.9317937
##    4  1.242031  0.5903446  0.9555625
##    5  1.257073  0.5745020  0.9815190
##    6  1.262658  0.5748186  0.9845496
##    7  1.253990  0.5820202  0.9926125
##    8  1.289082  0.5572460  1.0106860
##    9  1.295272  0.5620830  1.0201786
##   10  1.290987  0.5652431  1.0112990
##   11  1.293758  0.5625803  1.0169033
##   12  1.290825  0.5673890  1.0221247
##   13  1.310573  0.5531906  1.0448777
##   14  1.314760  0.5570245  1.0626735
##   15  1.322895  0.5509255  1.0701576
##   16  1.323631  0.5595409  1.0668292
##   17  1.335317  0.5550728  1.0754575
##   18  1.343720  0.5503510  1.0802472
##   19  1.358229  0.5393939  1.0902423
##   20  1.361988  0.5368745  1.0898083
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
Rsquare_knn4
## [1] 0.2318765

Q3 e

Graph R2 as a function of k and determine the optimal value of k.

plot(knn4$results$k,knn4$results$Rsquared, main = 'knn4', xlab = 'k', ylab = 'R^2', col = 'blue')
lines(knn4$results$k,knn4$results$Rsquared, col = 'blue')

# question 4 T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA463/TechStocks.xlsx contains closing prices for 9 different technology stocks for the years 2013-2014. Let fj be the fraction of money invested in each stock j, j = 1, 2, . . . , 9. Note that a negative fraction indicates a short sale. Our goal is to generate the average return for the portfolio with minimal risk by solving the quadratic program minimize f T Σf subject to f1 + f2 + . . . f9 = 1, where f = 􏰑f1 f2 . . . f9􏰒T and Σ is the (annualized) covariance matrix of the returns. The mean return on the T1 portfolio is f μ, where μ is a vector containing the (annualized) mean returns for all of the stocks.

library('readxl')
library('quadprog')

Input the data

tstock <- read_xlsx("/Users/anmeicao/Desktop/TechStocks.xlsx", sheet = 1, col_names = TRUE)
tstock$Date <- NULL

Q4 a

Let the closing price for a given stock on day t be St. For each stock, compute the returns ln S , the (annualized) means μ, and the (annualized) covariance matrix Σ.

x <- tstock[c(1:9)]
logRatio = function(x) {
  log(x[2:length(x)]/x[1:length(x)-1])
}
returns = as.data.frame(apply(tstock, 2, logRatio))
mu = colMeans(returns)*252
covmatrix = cov(returns)*252

Q4 b

Write a function that computes the (annualized) mean return for the minimal-risk portfolio and compute the corresponding mean return for the given data. You may find that the quadprog package is helpful.

dvec <- matrix(0,9)
amat <- matrix(1,9,1)
b <- c(1)

sol <- solve.QP(covmatrix, dvec, amat, b, meq = 1)
returnmean = sum(sol$solution * mu)

returnmean
## [1] 0.1837864

Q4 c

Write a function that resamples the returns and recomputes the minimal-risk mean return.

samplecol <- function(col) {sample(col, length(col), replace = TRUE)}

resamplereturn <- function(returns) 
{
  resamp = as.data.frame(apply(returns, 2, samplecol))
  mu = colMeans(resamp)*252
  covmatrix = cov(resamp)*252
  
  dvec <- matrix(0,9)
  amat <- matrix(1,9,1)
  b <- c(1)

  sol <- solve.QP(covmatrix, dvec, amat, b, meq = 1)
  returnmean = sum(sol$solution * mu)
}

Q4 d

Set the seed to 12345. Use 1000 resamples to generate a histogram for the minimal-risk mean return and compute a 95% confidence interval for the minimal-risk mean return.

set.seed(12345)
meanreturn <- sort(replicate(1000, resamplereturn(returns)))
c_level = 0.95
lower = (1-c_level)/2
upper = c_level + (1-c_level)/2
CI <- quantile(meanreturn, probs = c(lower, upper))

CI
##       2.5%      97.5% 
## 0.09912642 0.32594109
hist(meanreturn, main = 'minimal-risk mean return')