dat <-read.csv("lab3.csv", header = TRUE)
k_fold_cross_validation <- function(data, k) {
id <- sample(rep(seq.int(k), length.out=nrow(data)))
matrices <- lapply(seq.int(k), function(x) list(
train_matrix = data[x!=id, ],
test_matrix = data[x==id, ]
))
return(matrices)
}
SSE_J <- list()
sse_cross_validation <- function(data, matrices) {
SSE_J <<- lapply(matrices, function(x) {
model1 <- lm(y~., data = x$train_matrix)
model2 <- lm(y~x1+x2, data = x$train_matrix)
SSE1 <- sum((predict(model1, x$test_matrix) - x$test_matrix$y)^2)
SSE2 <- sum((predict(model2, x$test_matrix) - x$test_matrix$y)^2)
list(
SSE1 = SSE1,
SSE2 = SSE2
)
})
SSE_CV <- Reduce(function(x, y) {
list(
SSE1 = x$SSE1 + y$SSE1,
SSE2 = x$SSE2 + y$SSE2
)
}, SSE_J)
return(SSE_CV)
}
set.seed(1)
matrices <- k_fold_cross_validation(dat, 10)
Why is it important to randomly assign the observations? Why is it important that the size of the parts is as balanced as possible?
We use cross validation to determine how well a predictive model will perform in practice with an independent data set. We use the same partitions each time so we include all of our data points in the training and testing set. If we were to redesignate partitions each round, we could possibly include a data point more than once or not include at all. We also keep the partitions the same each round to reduce the variability and bias.
Why is it important use the same partion to compare the two models?
Cross validation only works when the data sets are pulled from the same population, so we have to partition the training and testing sets from the sample population data set. We want each partition to be approximately the same size so they are comparable to each other.
SSE_CV <- sse_cross_validation(dat, matrices)
c(SSE_CV$SSE1, SSE_CV$SSE2)
## [1] 29680.51 24748.01
The better model is going to be the one with the smallest SSE. In this case, the second model (x1+x2) is the better model because it has a significantly smaller SSE than the first one.
seeds <- c(23423, 834, 2342, 456, 3467, 5778, 983, 22, 13, 6757, 6356, 2323, 8578, 8872, 3223, 43432, 3444, 45575, 2213, 9989)
SSE_CV_20 <- as.data.frame(do.call(rbind, lapply(1:20, function(i) {
set.seed(seeds[i])
matrices <- k_fold_cross_validation(dat, 10)
sse_cross_validation(dat, matrices)
})))
SSE_CV_20
## SSE1 SSE2
## 1 30749.48 25021.58
## 2 30305.24 25973.3
## 3 30726.71 25391.23
## 4 27847.01 25473.37
## 5 31464.08 26267.26
## 6 29825.34 26390.75
## 7 31040.09 25300.47
## 8 28545.69 24892.85
## 9 31669.89 25428.94
## 10 31053.13 25716.91
## 11 31145.08 25217.04
## 12 30108.45 24802.32
## 13 30504.68 26334.47
## 14 29330.52 25635.28
## 15 29615.82 25813.29
## 16 30717.07 25686.82
## 17 32286.29 25620.87
## 18 30989.85 25787.39
## 19 29851.01 25498.87
## 20 31675.23 25427.28
Discuss whether or not the results are consistent from replicate-to-replicate. Explain how you might reconcile this inconsistency by combining the results across all twenty replicates.
In Problem 2, we were using K-fold cross validation. The positive about using this method is that each all of the data points are included in either the training or the testing data sets and each data point is only included once.
In Problem 3, we are using repeated random sub-sampling distribution. The positive about using this method is that each number of partitions is not dependent on the number of iterations. In this method, each data point can be included more than once or not at all. We would average the splits so we could reconcile this inconsistency across all twenty replicates.
As can be seen from the table with the twenty runs, we can see that the SSEcv value for the model with two predictors is consistently lower than that of the other model. In order to reconcile any inconsistencies that can be seen by combining the results across all twenty replicates, we can take the average value of all the runs and decide which one is consistently smaller (therefore allowing us to see which the better model is.) Additionally, we could make a confidence interval for SSEcv in order to verify that one of those is statistically significantly higher or lower than the other value.
mean(as.numeric(SSE_CV_20$SSE1))
## [1] 30472.53
mean(as.numeric(SSE_CV_20$SSE2))
## [1] 25584.01
After combining the results from Problem 3 across all twenty replicates, what is your conclusion regarding whether the 8-predictor model or the 2-predictor model is preferable?
The results from each one of our twenty cross-validation trials indicated that the full model with all eight predictors had a higher SSEcv than the two predictor model. From these results, we can confidently conclude that the two model predictor is a better model for this set of data.
Discuss in some detail whether this agrees with your conclusion from Lab 2.
This concurs with what we found in Lab 2 from the partialF-test on the significance of predictors x3-x8. We found that none of those six predictors were significant due and higher SSEcv is also indicative of this.