Predicting a known function

Outcomes

Fit statistical learning models to univariate data
Plot fitted models
Interpret models

Instructions

Answer the following questions, and show all your R code.
Upload your submission to Canvas in nicely formatted HTML generated from Rstudio.

generating simulated data

Choose n between 30 and 200, and sample n values for x from a random uniform (0, 1) distribution. Define y corresponding to x from the following quadratic function:

\[ y = 5x^2 - 4x - 10 + \epsilon \]

Here ε is normally distributed with mean 0 and standard deviation 0.1. Plot your data.

set.seed(138)

n = 100
x = sort(runif(n))
noise = rnorm(x, sd = 0.1)
y = 5*x^(2) - 4*x - 10 + noise
d = data.frame(x = seq(from = 30, to = 200, length.out = n))
plot(x,y)

linear model

Use fit1 = lm(y ~ x) to fit a linear model to the data. What mathematical function of x does the fitted model represent? Implement the fitted model as a function in R, and verify that it matches the values predicted by the model.

Hint: you can do something like the following:

x = seq(from = 0, to = 1, by = 0.1)
predict(fit1, data.frame(x))

fit1 = lm(y~x)
pred = predict(fit1, data.frame(x))
plot(x, y)
lines(x, pred)

quadratic model

Create a new linear model than includes a quadratic x^2 term, for example, using lm(y ~ x + I(x^2)). What mathematical function of x does the fitted model represent?

z = I(x^2)
fit2 = lm(y ~ x + z)
pred2 = predict(fit2, data.frame(x))
plot(x, y)
lines(x, pred2)

comparing models

Plot lines for the linear and quadratic model together with the data points. Which appears to do a better job fitting the data? Explain.

Quadratic predictive usually appears fitting due to it showing a behavior in the data. A linear prediction is good to see for a long-term visualization but if you want to see a prediction from point A to point B.

However for this particular data, I would prefer the quadratic prediction

plot(x, y)
lines(x, pred, col = "red")
lines(x, pred2, col = "blue")

recursive partitioning

Fit a recursive partitioning model to the data using rpart.
Experiment with the parameters of the algorithm by passing different parameters to the algorithm, see ?rpart.control.
Plot and compare two different models from rpart for this data set.
Which parameters appear to make the recursive partitioning model fit better or worse on this data set?

library(rpart)
x = sort(runif(n))
y = y + noise
d2 = data.frame(x, y)
fit_p = rpart(y ~ x, data = d)

noise2 = rnorm(x, sd = 0.3)
y2 = y + noise2
fit_p2 = rpart(y2 ~ x, data = d)

plot(x,y)
lines(x, predict(fit_p), col = "red")

#plot(x,y)
lines(x, predict(fit_p2), col = "blue")

For this model, I chose the constant noise and noise2 to compare their rparts in this dataset. Noise fits better because the sd matches with the dataset’s constant noise therefore it should give more accurate results.

test data performance

Simulate more values from the true model

\[ y = 5x^2 - 4x - 10 + \epsilon \] where x is between 0 and 1.

Compare the performance of three different models (linear, quadratic, and recursive partitioning) on this test set. Which model does the best job minimizing the sum of squared error?

mse = function(model, testdata)
{
  yhat = predict(model, testdata) 
  y = testdata[, "y"]
  d2 = (yhat - y)^2
  mean(d2)
}

d$y = 5*x^(2) - 4*x - 10 + noise
test_index = sample(seq(from = 0, to = 1, length.out = 500))

d_t = d[test_index, ]

fit1_t = lm(y ~ x, d_t)
fit2_t = lm(y ~ x + I(x^2), d_t)
fitp_t = rpart(y ~ x, d_t)
  
mse(fit1_t, d)

## Warning in predict.lm(model, testdata): prediction from a rank-deficient fit may
## be misleading

## [1] 0.2532607

mse(fit2_t, d)

## Warning in predict.lm(model, testdata): prediction from a rank-deficient fit may
## be misleading

## [1] 0.2532607

mse(fitp_t, d)

## [1] 0.2532607

rpart does best job minimizing the sum of squared error.

a data set to suit the model

Simulate a slightly noisy data set where the recursive partitioning model should perform much better than the simple linear model. What characteristics of the data make the recursive partitioning model work well? Fit and plot both a linear model and a recursive partitioning model on the same plot for this data to demonstrate that recursive partitioning performs better.

library(rpart)

noise3 = (1/1000000000)*rnorm(x, sd = 0.3)
y3 = y + noise3
fit_p3 = rpart(y3 ~ x, data = d)

plot(x,y)
lines(x, predict(fit_p3), col = "green")

#plot(x,y)
lines(x, predict(fit1), col = "orange")

The coefficients of the noise helps better predict this data set. I multiplied it by a very small number, perhaps \epsilon.