I have provided two data sets: a training data set and a predictions data set. The training data set (full description below) contains ozone measurements as well as other factors. The prediction data set contains all of the factors but is missing the ozone measurement. 1. Use the training data set to build a regression tree with ozone (O3) as the outcome.
set.seed(1)
train = read.csv("SmogData.csv", header = T)
indices = sample(255, 180)
T1 = train[indices,]
T2 = train[-indices,]
predict = read.csv("smogDataPredict.csv", header = T)
library(tree)
tree<-tree(O3~.-ID,data=T1, control = tree.control(nrow(T1), mincut = 10, minsize = 30, mindev = 0.01))
p = predict(tree, T2)
mean((p-T2$O3)^2)
## [1] 19.89734
tree<-tree(O3~.-ID,data=train, control = tree.control(nrow(train), mincut = 10, minsize = 30, mindev = 0.01))
p.tree = predict(tree, predict)
Use CART and random forest modeling to predict the missing ozone values. Whoever’s prediction minimize mean squared error (MSE) wins a prize or something.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
rf = randomForest(O3~.-ID, data = train)
p.rf = predict(rf, predict)
mean((p.tree-p.rf)^2)
## [1] 4.73952
df.rf = cbind(predict$ID, p.tree, p.rf)
write.csv(df.rf, file = "BlohmPredictions.csv")