This homework is due on October 23, the day we get back from Fall Break. It is slightly longer than usual due to our missed 20 minutes of clas on wednesday.

Reading

Read the pdf-file t-test refresher that has been placed on the resources page of Sakai. Our brief discussion of t-tests will mostly consist of a couple of examples. If you remember t-tests well from STOR 155 or STOR 455, you can skip this reading.

Exercises - Pleaes consider this assignment to be seriously graded

Attempt to do as much of this assignment as you can by next monday with no help from others. Then we will talk about it briefly at the end of class to see how everyone is doing.

    1. One downside of the least-squares model is that it is sensitive to unusual values because the distance incorporates a squared term. Fit a linear model to the simulated data below, and visualise the results. Rerun a few times to generate different simulated datasets. What do you notice about the model?
sim1a <- tibble(
  x = rep(1:10, each = 3),
  y = x * 2 + 8 + rt(length(x), df = 2)
)
  1. Now use purrr or a for-loop to regenerate sim1a 100 times and make a scatterplot of all of the fitted models. Comment on the wide range of models that results. Why such a large range when the ‘true’ model is y=2x+8+noise?



2. One way to make linear models more robust is to use a different distance measure. For example, instead of root-mean-squared distance, you could use mean-absolute distance:

measure_distance <- function(mod, data) {
  diff <- data$y - make_prediction(mod, data)
  mean(abs(diff))
}
  1. Redo Exercise 1 with this metric for how good a fit our line is (100 regressions on 100 datasets). You will have to use `optim()’ to find the best-fitting model.

  2. What do you notice about the range of models compared to the range in Exercise 1. How do you explain the results?



3. Get the time series data set salaries.csv and put it in your usual working directory.

  1. Load the data set and have a look. It shows an employee’s salary as a function of number of years worked.

  2. Fit a linear model to the data set and look as summary(your_model) to see various summary statistics. All indications are that the model is a pretty good fit, but just to be sure, look at plot(your_model, which = 1:2) to see two important plots you may remember from a past statistics course.

  3. Based on your answer to (B) what evidence do we have that the linear model is not appropriate?

  4. You may be aware that salaries, like most financial data, tend to grow exponentially over time. Use optim() and the sum-of-squares of residuals to fit a model of the form y = a*b^x to these data points. This will basically involve re-doing what we did in class for linear models but changing `model1()’ to be a new model that is exponential. In case it is helpful, here are the regression lecture commands in an R-script.

  5. This model has the same number of parameters as the linear model in Part (B) so both models are equally parsimonious. Were you able to improve on the sum-of-squares fit from Part (B)?

  6. Plot the residuals for the model in Part (D). Which model is superior, the linear model or the exponential model? Explain.