Project 3

Problem 1

A.

wdbc <- fread('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
head(wdbc)

##          V1 V2    V3    V4     V5     V6      V7      V8     V9     V10
## 1:   842302  M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710
## 2:   842517  M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017
## 3: 84300903  M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790
## 4: 84348301  M 11.42 20.38  77.58  386.1 0.14250 0.28390 0.2414 0.10520
## 5: 84358402  M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430
## 6:   843786  M 12.45 15.70  82.57  477.1 0.12780 0.17000 0.1578 0.08089
##       V11     V12    V13    V14   V15    V16      V17     V18     V19
## 1: 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373
## 2: 0.1812 0.05667 0.5435 0.7339 3.398  74.08 0.005225 0.01308 0.01860
## 3: 0.2069 0.05999 0.7456 0.7869 4.585  94.03 0.006150 0.04006 0.03832
## 4: 0.2597 0.09744 0.4956 1.1560 3.445  27.23 0.009110 0.07458 0.05661
## 5: 0.1809 0.05883 0.7572 0.7813 5.438  94.44 0.011490 0.02461 0.05688
## 6: 0.2087 0.07613 0.3345 0.8902 2.217  27.19 0.007510 0.03345 0.03672
##        V20     V21      V22   V23   V24    V25    V26    V27    V28    V29
## 1: 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119
## 2: 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416
## 3: 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504
## 4: 0.01867 0.05963 0.009208 14.91 26.50  98.87  567.7 0.2098 0.8663 0.6869
## 5: 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000
## 6: 0.01137 0.02165 0.005082 15.47 23.75 103.40  741.6 0.1791 0.5249 0.5355
##       V30    V31     V32
## 1: 0.2654 0.4601 0.11890
## 2: 0.1860 0.2750 0.08902
## 3: 0.2430 0.3613 0.08758
## 4: 0.2575 0.6638 0.17300
## 5: 0.1625 0.2364 0.07678
## 6: 0.1741 0.3985 0.12440

wdbc$V1 <- NULL

Removed the patient ID column from the data set.

B.

set.seed(12345)
dp <- createDataPartition(wdbc$V2, p=0.7, list=FALSE)
training <- wdbc[dp,]
testing <- wdbc[-dp,]

Set the seed for ease of reproduction, created training and testing data partitions.

C.

kNN1 <- train(V2~., method = "knn", data = training, trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"), tuneGrid = data.frame(.k=1:20))

Made a k-nearest neighbors model testing values from 1-20 for k. Centered and scaled the predictors.

D.

plot(kNN1$results$k, kNN1$results$Accuracy, xlab = "k", ylab = "Accuracy")

As seen in the plot, when k=6 it has the highest accuracy, making 6 the optimal value for k.

E.

kNN1.predicted <- predict(kNN1, newdata = testing)
confusionMatrix(kNN1.predicted, testing$V2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 103   5
##          M   4  58
##                                           
##                Accuracy : 0.9471          
##                  95% CI : (0.9019, 0.9755)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8861          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9626          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9537          
##          Neg Pred Value : 0.9355          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6059          
##    Detection Prevalence : 0.6353          
##       Balanced Accuracy : 0.9416          
##                                           
##        'Positive' Class : B               
##

Made a predicted model based off of kNN1 using the testing data. Then generated a confusion matrix comparing the predicted values from KNN1.predicted and the actual values from the testing data.

Based on the confusion matrix: Alpha (Type I Error) = 5/63 Beta (Type II Error) = 4/107

F.

5/63

## [1] 0.07936508

4/107

## [1] 0.03738318

The failures of the testing data are shown above, Type I Error probability = 7.94% and Type II Error probability = 3.74%. Both are less than 8% which is not too surprising as the predicted model seems to be rather accurate.

Problem 2

A.

set.seed(12345)
dp2 <- createDataPartition(Boston$medv, p=0.7, list=FALSE)
training <- Boston[dp2,]
testing <- Boston[-dp2,]

Set seed and created testing and training partitions of the Boston data.

B.

kNN2 <- train(medv~., method = "knn", data = training, trControl = trainControl(method = "cv", number = 10))
cor(testing$medv, predict(kNN2, newdata = testing))^2

## [1] 0.489145

Created a k-nearest neighbors model for the medv variable using all other variables as predictors. kNN2 did not center and scale the predictors.

As seen above, R² using the testing data with kNN2 is rather low at 0.489145.

C.

kNN3 <- train(medv~., method = "knn", data = training, trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"))
cor(testing$medv, predict(kNN3, newdata = testing))^2

## [1] 0.7969769

Created another k-nearest neighbors model for the medv variable using all other variables as predictors. This time the predictors were centered and scaled.

As seen above, R² using the testing data with kNN3 is much higher 0.7969769.

D.

Scaling and centering the predictors made a huge difference as seen by the R² values produced from the respective models above. In kNN3, where the predictors were scaled and centered, the predictive ability of the model was much higher than kNN2.

E.

kNN3$results

##   k     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1 5 4.848538 0.7364466 3.162348 1.152061  0.1465797 0.4490481
## 2 7 4.818407 0.7450562 3.180713 1.124558  0.1199281 0.3722370
## 3 9 4.776094 0.7500352 3.106382 1.336867  0.1266857 0.4753765

Based off the information above, 9 is the optimal value as the RMSE for k=9 is the lowest of the three given.

Problem 3

A.

data("ChemicalManufacturingProcess")
CMP <- ChemicalManufacturingProcess
sum(is.na(CMP))

## [1] 106

Calculated the number of “NA” entries in CMP. There is 106 such entries.

B.

CMP <- knnImputation(CMP, k=5)

Replaced the NA entries using the knnImputation function with k=5.

C.

set.seed(12345)
dp3 <- createDataPartition(CMP$Yield, p=0.7, list=FALSE)
training <- CMP[dp3,]
testing <- CMP[-dp3,]

Set the seed and created testing and training data partitions using the newly “NA”-free dataset.

D.

kNN4 <- train(Yield~., method = "knn", data = training, trControl = trainControl(method = "cv", number = 10), preProcess = c("center","scale"), tuneGrid = data.frame(.k=1:20))
cor(testing$Yield, predict(kNN4, newdata = testing))^2

## [1] 0.4344609

Created a k-nearest neighbors model for the Yield variable, centering and scaling the predictors and trying values from 1-20 for k.

As seen above, R² using the testing data with kNN3 is 0.4344609.

E.

kNN4$results

##     k     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1   1 1.427711 0.4644080 1.1057308 0.4483978  0.2368294 0.3181594
## 2   2 1.365624 0.4725430 1.0805609 0.3448858  0.2119629 0.2827429
## 3   3 1.334300 0.4907510 1.0533397 0.2961471  0.2236947 0.2208984
## 4   4 1.225791 0.5761424 0.9866593 0.2787682  0.1851034 0.2200482
## 5   5 1.278859 0.5495754 1.0169051 0.2830547  0.1710121 0.2216175
## 6   6 1.308214 0.5229591 1.0356905 0.2339732  0.1288105 0.1584882
## 7   7 1.296904 0.5371365 1.0484542 0.2381766  0.1406646 0.1611279
## 8   8 1.309094 0.5141480 1.0572115 0.2350217  0.1535222 0.1677642
## 9   9 1.342208 0.4932970 1.1037261 0.2429742  0.1672224 0.1883707
## 10 10 1.376904 0.4587657 1.1308800 0.2207559  0.1499499 0.1732258
## 11 11 1.373152 0.4681376 1.1202925 0.1983151  0.1275934 0.1583883
## 12 12 1.397733 0.4467969 1.1362206 0.1960747  0.1134860 0.1529012
## 13 13 1.385955 0.4640259 1.1253542 0.2100404  0.1026455 0.1673762
## 14 14 1.377975 0.4761592 1.1187266 0.2131420  0.1180236 0.1741279
## 15 15 1.404477 0.4508466 1.1482355 0.2170641  0.1259262 0.1786018
## 16 16 1.414462 0.4372621 1.1581106 0.2146812  0.1185752 0.1844132
## 17 17 1.412914 0.4450490 1.1545351 0.2086262  0.1098979 0.1833349
## 18 18 1.435392 0.4250024 1.1703251 0.2184785  0.1177511 0.1934116
## 19 19 1.440372 0.4235065 1.1728085 0.2064806  0.1163990 0.1802061
## 20 20 1.444242 0.4274910 1.1778401 0.2154125  0.1144566 0.1823835

Based off the information above, 4 is the optimal value as the RMSE for k=4 is the lowest of the values 1-20.

Problem 4

A.

TechStocks <- read.csv(file="/Users/Alex/Dropbox/College/4-Senior/Machine Learning/Project3/TechStocks.csv", header=TRUE, sep=",")
TechStocks$Date <- NULL

logRatio = function(x) { log(x[2:length(x)]/x[1:length(x)-1]) }
returns = as.data.frame(apply(TechStocks, 2, logRatio))
annualizedMeans = colMeans(returns)*252
annualizedCovMatrix = cov(returns)*252

Imported the data and removed the date column. Computed the returns, annualized means of the returns, and the annualized covariance matrix.

B.

dvec <- matrix(0,9)
amat <- matrix(1,9,1)
b <- c(1)
solution <- solve.QP(annualizedCovMatrix, dvec, amat, b, meq=1)
meanReturn = sum(solution$solution * annualizedMeans)
meanReturn

## [1] 0.1837864

Computes the mean return for the data from TechStocks.

C.

SampleCol = function(col) { sample(col, length(col), replace=TRUE) }

ResampleReturn = function(returns) {
  reSamp = as.data.frame(apply(returns, 2, SampleCol))
  annualizedMeans = colMeans(reSamp)*252
  annualizedCovMatrix = cov(reSamp)*252
  dvec = rep(0,9)
  amat = matrix(1, 9, 1)
  b = c(1)
  solution = solve.QP(annualizedCovMatrix, dvec, amat, b, meq = 1 )
  meanReturn = sum(solution$solution*annualizedMeans)
}

The resampling functions to use in part D.

D.

MinRiskMeanReturn = sort(replicate(1000, ResampleReturn(returns)))
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(MinRiskMeanReturn, probs = c(lowerlevel, upperlevel))
ci

##      2.5%     97.5% 
## 0.1032236 0.3212093

hist(MinRiskMeanReturn)

Did 1000 resamples and created a histogram using that data of the mean returns. Then computed a 95% confidence interval for the resampled mean returns.

Project 3

Alex Good

3/20/2018

Problem 1

A.

B.

C.

D.

E.

F.

Problem 2

A.

B.

C.

D.

E.

Problem 3

A.

B.

C.

D.

E.

Problem 4

A.

B.

C.

D.