Outcomes

Instructions

generating simulated data

Choose n between 30 and 200, and sample n values for x from a random uniform (0, 1) distribution. Define y corresponding to x from the following quadratic function:

\[ y = 5x^2 - 4x - 10 + \epsilon \]

Here ε is normally distributed with mean 0 and standard deviation 0.1. Plot your data.

set.seed(138)

n = 100
x = sort(runif(n))
noise = rnorm(x, sd = 0.1)
y = 5*x^(2) - 4*x - 10 + noise
d = data.frame(x = seq(from = 30, to = 200, length.out = n))
plot(x,y)

linear model

Use fit1 = lm(y ~ x) to fit a linear model to the data. What mathematical function of x does the fitted model represent? Implement the fitted model as a function in R, and verify that it matches the values predicted by the model.

Hint: you can do something like the following:

x = seq(from = 0, to = 1, by = 0.1)
predict(fit1, data.frame(x))
fit1 = lm(y~x)
pred = predict(fit1, data.frame(x))
plot(x, y)
lines(x, pred)

quadratic model

Create a new linear model than includes a quadratic x^2 term, for example, using lm(y ~ x + I(x^2)). What mathematical function of x does the fitted model represent?

z = I(x^2)
fit2 = lm(y ~ x + z)
pred2 = predict(fit2, data.frame(x))
plot(x, y)
lines(x, pred2)

comparing models

Plot lines for the linear and quadratic model together with the data points. Which appears to do a better job fitting the data? Explain.

Quadratic predictive usually appears fitting due to it showing a behavior in the data. A linear prediction is good to see for a long-term visualization but if you want to see a prediction from point A to point B.

However for this particular data, I would prefer the quadratic prediction

plot(x, y)
lines(x, pred, col = "red")
lines(x, pred2, col = "blue")

recursive partitioning

library(rpart)
x = sort(runif(n))
y = y + noise
d2 = data.frame(x, y)
fit_p = rpart(y ~ x, data = d)

noise2 = rnorm(x, sd = 0.3)
y2 = y + noise2
fit_p2 = rpart(y2 ~ x, data = d)

plot(x,y)
lines(x, predict(fit_p), col = "red")

#plot(x,y)
lines(x, predict(fit_p2), col = "blue")

For this model, I chose the constant noise and noise2 to compare their rparts in this dataset. Noise fits better because the sd matches with the dataset’s constant noise therefore it should give more accurate results.

test data performance

Simulate more values from the true model

\[ y = 5x^2 - 4x - 10 + \epsilon \] where x is between 0 and 1.

Compare the performance of three different models (linear, quadratic, and recursive partitioning) on this test set. Which model does the best job minimizing the sum of squared error?

mse = function(model, testdata)
{
  yhat = predict(model, testdata) 
  y = testdata[, "y"]
  d2 = (yhat - y)^2
  mean(d2)
}

d$y = 5*x^(2) - 4*x - 10 + noise
test_index = sample(seq(from = 0, to = 1, length.out = 500))

d_t = d[test_index, ]

fit1_t = lm(y ~ x, d_t)
fit2_t = lm(y ~ x + I(x^2), d_t)
fitp_t = rpart(y ~ x, d_t)
  
mse(fit1_t, d)
## Warning in predict.lm(model, testdata): prediction from a rank-deficient fit may
## be misleading
## [1] 0.2532607
mse(fit2_t, d)
## Warning in predict.lm(model, testdata): prediction from a rank-deficient fit may
## be misleading
## [1] 0.2532607
mse(fitp_t, d)
## [1] 0.2532607

rpart does best job minimizing the sum of squared error.

a data set to suit the model

Simulate a slightly noisy data set where the recursive partitioning model should perform much better than the simple linear model. What characteristics of the data make the recursive partitioning model work well? Fit and plot both a linear model and a recursive partitioning model on the same plot for this data to demonstrate that recursive partitioning performs better.

library(rpart)

noise3 = (1/1000000000)*rnorm(x, sd = 0.3)
y3 = y + noise3
fit_p3 = rpart(y3 ~ x, data = d)

plot(x,y)
lines(x, predict(fit_p3), col = "green")

#plot(x,y)
lines(x, predict(fit1), col = "orange")

The coefficients of the noise helps better predict this data set. I multiplied it by a very small number, perhaps \epsilon.