I have provided two data sets: a training data set and a predictions data set. The training data set (full description below) contains ozone measurements as well as other factors. The prediction data set contains all of the factors but is missing the ozone measurement. 1. Use the training data set to build a regression tree with ozone (O3) as the outcome.

set.seed(1)
train = read.csv("SmogData.csv", header = T)
indices = sample(255, 180)
T1 = train[indices,]
T2 = train[-indices,]
predict = read.csv("smogDataPredict.csv", header = T)
library(tree)
tree<-tree(O3~.-ID,data=T1, control = tree.control(nrow(T1), mincut = 10, minsize = 30, mindev = 0.01))
p = predict(tree, T2)

mean((p-T2$O3)^2)
## [1] 19.89734
  1. Use the tree you built to predict the missing values of ozone based on the observed factors.
tree<-tree(O3~.-ID,data=train, control = tree.control(nrow(train), mincut = 10, minsize = 30, mindev = 0.01))
p.tree = predict(tree, predict)

Use CART and random forest modeling to predict the missing ozone values. Whoever’s prediction minimize mean squared error (MSE) wins a prize or something.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
rf = randomForest(O3~.-ID, data = train)
p.rf = predict(rf, predict)

mean((p.tree-p.rf)^2)
## [1] 4.73952
df.rf = cbind(predict$ID, p.tree, p.rf)
write.csv(df.rf, file = "BlohmPredictions.csv")