This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ISLR)
library(car)
## Loading required package: carData
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(leaps)
#question 1 Load the “Breast Cancer Wisconsin Diagnostic” dataset from http://archive.ics.uci.edu/ml. into R. There are 569 observations on 32 variables. Be sure to include appropriate variable names.
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",destfile = "./Project3Q1")
renamecol <- c("id","diagnosis","radiusMean","textureMean","perimeterMean","areaMean","smoothnessMean","compactnessMean","concavityMean","concave_ptsMean",
"symmetryMean","fractal_dimMean","radiusSE","textureSE","perimeterSE","areaSE","smoothnessSE","compactnessSE","concavitySE","concave_ptsSE",
"symmetrySE","fractal_dimSE","radiusWorst","textureWorst","perimeterWorst","areaWorst","smoothnessWorst","compactnessWorst","concavityWorst",
"concave_ptsWorst","symmetryWorst","fractal_dimWorst")
BCWDdata <- read.csv("./Project3Q1", col.names = renamecol)
The first column is a patient identification number. Remove it from the data
BCWDdata$id <- NULL
Set the seed to 12345 and use createDataPartition with p=0.7 to partition the data into training and testing sets. The response variable is diagnosis.
set.seed(12345)
trainingIndex <- createDataPartition(BCWDdata$diagnosis, p = 0.7, list = FALSE)
training <- BCWDdata[trainingIndex, ]
testing <- BCWDdata[-trainingIndex, ]
Build a kNN classification model (response = diagnosis) called kNN1 using train with optional arguments trControl=trainControl(method=“cv”,number=10)(forcross-validation),preProcess=c(“center”,“scale”) (to resize the predictors), and tuneGrid=data.frame(.k=1:20) (to try values of k from 1 to 20).
knn1 <- train(diagnosis~., data = training, method = "knn", trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"), tuneGrid=data.frame(.k=1:20))
knn1
## k-Nearest Neighbors
##
## 398 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 358, 358, 359, 358, 358, 359, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9724359 0.9404060
## 2 0.9650000 0.9248715
## 3 0.9624359 0.9191981
## 4 0.9574359 0.9073848
## 5 0.9549359 0.9016087
## 6 0.9649359 0.9229776
## 7 0.9549359 0.9007671
## 8 0.9649359 0.9233086
## 9 0.9573718 0.9064154
## 10 0.9549359 0.9011837
## 11 0.9673718 0.9283293
## 12 0.9648718 0.9230720
## 13 0.9598718 0.9113476
## 14 0.9548718 0.9003926
## 15 0.9548718 0.8996137
## 16 0.9548718 0.8996137
## 17 0.9548718 0.8996137
## 18 0.9548718 0.8996137
## 19 0.9548718 0.8996137
## 20 0.9548718 0.8996137
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
Graph the accuracy as a function of k and determine the optimal value of k.
plot(knn1, main = "knn1", xlab = "k", ylab = "accuracy k")
# Q1 e Evaluate the model by generating a confusion matrix for the model applied to the testing data. If H0 : Tissue is benign (B),then what does the confusion matrix predict for α = P (Type I error) and β = P (Type II error)?
predicted <- predict(knn1, newdata = testing)
confusionMatrix(predicted, testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 100 9
## M 7 54
##
## Accuracy : 0.9059
## 95% CI : (0.8517, 0.9452)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7969
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.9346
## Specificity : 0.8571
## Pos Pred Value : 0.9174
## Neg Pred Value : 0.8852
## Prevalence : 0.6294
## Detection Rate : 0.5882
## Detection Prevalence : 0.6412
## Balanced Accuracy : 0.8959
##
## 'Positive' Class : B
##
for the typeI error α = 7/(100+7) ≈0.0654 for the typeII error β = 9/(54+9) ≈0.1429
In the cases where the model fails on the testing data, what are the associated probabilities? Any surprises?
probabilities <- predict(knn1, newdata = testing, type = 'prob')
probabilities[predicted!=testing$diagnosis,]
## B M
## 16 1 0
## 17 1 0
## 18 1 0
## 28 0 1
## 30 0 1
## 33 1 0
## 45 1 0
## 46 0 1
## 63 1 0
## 70 0 1
## 96 1 0
## 129 0 1
## 146 0 1
## 147 1 0
## 149 0 1
## 160 1 0
there is no any surprising in this testing data
The Boston data set in the MASS package contains 506 observations on 14 variables.
library(MASS)
Set the seed to 12345 and use createDataPartition with p=0.7 to partition the data into training and testing sets. The response variable is medv.
set.seed(12345)
boston <- Boston
BtrainingIndex <- createDataPartition(boston$medv, p = 0.7, list = FALSE)
Btraining <- boston[BtrainingIndex, ]
Btesting <- boston[-BtrainingIndex, ]
Build a kNN regression model called kNN2 for medv using all other variables as predictors. Use 10-fold cross validation but do not standardize the predictors. Compute R2 for the model on the testing data.
knn2 <- train(medv~., data = Btraining, method = 'knn', trControl = trainControl(method = "cv", number = 10))
predicted_knn2 <- predict(knn2, newdata = Btesting)
Rsquare_knn2 <- cor(predicted_knn2, Btesting$medv)^2
knn2
## k-Nearest Neighbors
##
## 356 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 321, 320, 321, 320, 321, 321, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 6.680476 0.4944616 4.754010
## 7 6.858160 0.4690622 4.875427
## 9 7.114447 0.4277455 5.065073
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Rsquare_knn2
## [1] 0.5261986
Build a kNN regression model called kNN3 for medv using all other variables as predictors. Use 10-fold cross validation and standardize the predictors. Compute R2 for the model on the testing data.
KNN3 <- train(medv~., data = Btraining, method = 'knn', trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"))
predicted_KNN3 <- predict(KNN3, newdata = Btesting)
Rsquare_KNN3 <- cor(predicted_KNN3, Btesting$medv)^2
KNN3
## k-Nearest Neighbors
##
## 356 samples
## 13 predictor
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 320, 320, 319, 321, 320, 321, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 5.062430 0.7072873 3.280971
## 7 5.107461 0.7094686 3.302478
## 9 5.165949 0.7105568 3.350741
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Rsquare_KNN3
## [1] 0.8391416
comment on the effectiveness of standardizing the predictors.
For the Rsquare I get from KNN3 that is greater than the Rsquare I get from knn2, which means KNN3 will be easier to predict than knn2.
What value of k is optimal for kNN3?
According to question c, the final value used for the model was k = 9.
The ChemicalManufacturingProcess data in the AppliedPredictiveModeling package contains 176 observations on 58 variables.
library(AppliedPredictiveModeling)
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
data("ChemicalManufacturingProcess")
CMP <- ChemicalManufacturingProcess
Determine the number of NAs in the ChemicalManufacturingProcess data.
sum(is.na(CMP))
## [1] 106
Use k-nearest neighbors imputation with k = 5 to replace the NAs.
CMP <- knnImputation(CMP, k = 5)
Set the seed to 12345 and use createDataPartition with p=0.7 to partition the data into training and testing sets. The response variable is Yield.
set.seed(12345)
CtrainingIndex <- createDataPartition(CMP$Yield, p = 0.7, list = FALSE)
Ctraining <- CMP[CtrainingIndex, ]
Ctesting <- CMP[-CtrainingIndex, ]
Build a kNN regression model called kNN4 for Yield using all other variables as predictors. Use 10-fold cross validation, standardize the predictors, and have k run from 1 to 20. Compute R2 for the model on the testing data.
knn4 <- train(Yield~., data = Ctraining, method = 'knn', trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"), tuneGrid=data.frame(.k=1:20))
predicted_knn4 <- predict(knn4, newdata = Ctesting)
Rsquare_knn4 <- cor(predicted_knn4, Ctesting$Yield)^2
knn4
## k-Nearest Neighbors
##
## 124 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 112, 112, 112, 112, 110, 110, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1.329002 0.5612418 0.9937976
## 2 1.204573 0.6107339 0.9574643
## 3 1.179339 0.6225532 0.9317937
## 4 1.242031 0.5903446 0.9555625
## 5 1.257073 0.5745020 0.9815190
## 6 1.262658 0.5748186 0.9845496
## 7 1.253990 0.5820202 0.9926125
## 8 1.289082 0.5572460 1.0106860
## 9 1.295272 0.5620830 1.0201786
## 10 1.290987 0.5652431 1.0112990
## 11 1.293758 0.5625803 1.0169033
## 12 1.290825 0.5673890 1.0221247
## 13 1.310573 0.5531906 1.0448777
## 14 1.314760 0.5570245 1.0626735
## 15 1.322895 0.5509255 1.0701576
## 16 1.323631 0.5595409 1.0668292
## 17 1.335317 0.5550728 1.0754575
## 18 1.343720 0.5503510 1.0802472
## 19 1.358229 0.5393939 1.0902423
## 20 1.361988 0.5368745 1.0898083
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
Rsquare_knn4
## [1] 0.2318765
Graph R2 as a function of k and determine the optimal value of k.
plot(knn4$results$k,knn4$results$Rsquared, main = 'knn4', xlab = 'k', ylab = 'R^2', col = 'blue')
lines(knn4$results$k,knn4$results$Rsquared, col = 'blue')
# question 4 T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA463/TechStocks.xlsx contains closing prices for 9 different technology stocks for the years 2013-2014. Let fj be the fraction of money invested in each stock j, j = 1, 2, . . . , 9. Note that a negative fraction indicates a short sale. Our goal is to generate the average return for the portfolio with minimal risk by solving the quadratic program minimize f T Σf subject to f1 + f2 + . . . f9 = 1, where f = f1 f2 . . . f9T and Σ is the (annualized) covariance matrix of the returns. The mean return on the T1 portfolio is f μ, where μ is a vector containing the (annualized) mean returns for all of the stocks.
library('readxl')
library('quadprog')
Input the data
tstock <- read_xlsx("/Users/anmeicao/Desktop/TechStocks.xlsx", sheet = 1, col_names = TRUE)
tstock$Date <- NULL
Let the closing price for a given stock on day t be St. For each stock, compute the returns ln S , the (annualized) means μ, and the (annualized) covariance matrix Σ.
x <- tstock[c(1:9)]
logRatio = function(x) {
log(x[2:length(x)]/x[1:length(x)-1])
}
returns = as.data.frame(apply(tstock, 2, logRatio))
mu = colMeans(returns)*252
covmatrix = cov(returns)*252
Write a function that computes the (annualized) mean return for the minimal-risk portfolio and compute the corresponding mean return for the given data. You may find that the quadprog package is helpful.
dvec <- matrix(0,9)
amat <- matrix(1,9,1)
b <- c(1)
sol <- solve.QP(covmatrix, dvec, amat, b, meq = 1)
returnmean = sum(sol$solution * mu)
returnmean
## [1] 0.1837864
Write a function that resamples the returns and recomputes the minimal-risk mean return.
samplecol <- function(col) {sample(col, length(col), replace = TRUE)}
resamplereturn <- function(returns)
{
resamp = as.data.frame(apply(returns, 2, samplecol))
mu = colMeans(resamp)*252
covmatrix = cov(resamp)*252
dvec <- matrix(0,9)
amat <- matrix(1,9,1)
b <- c(1)
sol <- solve.QP(covmatrix, dvec, amat, b, meq = 1)
returnmean = sum(sol$solution * mu)
}
Set the seed to 12345. Use 1000 resamples to generate a histogram for the minimal-risk mean return and compute a 95% confidence interval for the minimal-risk mean return.
set.seed(12345)
meanreturn <- sort(replicate(1000, resamplereturn(returns)))
c_level = 0.95
lower = (1-c_level)/2
upper = c_level + (1-c_level)/2
CI <- quantile(meanreturn, probs = c(lower, upper))
CI
## 2.5% 97.5%
## 0.09912642 0.32594109
hist(meanreturn, main = 'minimal-risk mean return')