Assignment 3 Max Zalewski

load("assignment 3.RData")
library(gt)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Question 1

  1. What is the predicted value and 95% predicted interval for the muscle mass for women of age 60? Interpret your predicted interval.
ci_60 = predict(object = mm_lm, newdata = data.frame(age = c(60)), interval = "confidence", level = .95)
ci_60 = as.data.frame(ci_60)
ci_60 %>% select(lwr,fit,upr) %>% gt()
lwr fit upr
82.83471 84.94683 87.05895

We can say with a 95% confidence level that a person of age 60 will have a muscle mass within the interval of 82.83 and 87.06.

  1. Plot the residuals y_i − y_hat_i against x_i on one graph.
residual_plot_0

  1. Plot the values y_hat_i - y_bar against xi on one graph, using the same scales as in the graph in part (a)
y_hat_less_y_bar_plot

  1. From two graphs in part (b) and (c), does SSE or SSR appear to be the larger component of SSTO? What does this imply about the magnitude of R2?

SSR appears to be the larger component of SSTO. This implies that the magnitude of R2 will be greater than or equal to .5 as R2 = SSR/SSTO

  1. Provide the ANOVA table.
anova_gt
NULL
  1. What proportion of the total variation in muscle mass remains unexplained when age is added into the model? Is this proportion relatively small or large?
sse_prop
Proportion of Unexplained Variance
0.2499332

Relative to SSR this is large as it means that a around 25% of variation our total variation is remains unexplained.

  1. Conduct a hypothesis test H_0 : β1 = 0 using an F test with significance level α = 0.05. Clearly state the alternatives, test statistics and conclusion.

We can obtain our F-statistic and p_value through R by using anova.

anova_mm
TERM DF SS MS F-stat p-value
SSR 1 11627.486 11627.48584 174.062 4.123987e-19
SSE 58 3874.447 66.80082 NA NA

We state our alternative to be H_a: β1 != 0, != means does not equal, and we see our F-statistic to be 174.062 with a p value of 4.12e-19 which is less than our significance level of a = 0.05. We therefore have significant evidence to reject the null hypothesis in favor of the alternative hypothesis of β1 != 0.

  1. Obtain R2 and r.
R2 = anova_table[1,"Sum Sq"] / (anova_table[1,"Sum Sq"] + anova_table[2,"Sum Sq"])
r = sqrt(R2)
print(paste("R^2 : ",R2))
[1] "R^2 :  "
print(paste("r : ", r))
[1] "r :  "

Question 2

  1. Plot a scatter plot of the data. Is a simple linear regression appropriate for modeling this data?
prod_scatter_plot

No, I would not say this data is fit very well for linear regression.

  1. Obtain the estimated linear regression function for the data.
print(paste0("E[y] = ", lm_prod_0$coef[1], " + ", lm_prod_0$coef[2], "x"))
[1] "E[y] = 6.86348691334742 + 0.533274922835527x"
  1. Do you consider any transformation on X or Y? Explain your reasoning.

I would either consider squaring the realized data of our response variable or taking the square root of our explanatory variable. We see that our data could possibly be fit by a line that can be described as the square root function. So we could take the square root of the explanatory variable or square our response variable to make it better fit for linear regression.

  1. Use the transformation x′ = √x and obtain the estimated linear regression function for the transformed data.
print(paste0("E[y] = ", lm_prod_1$coef[1], " + ", lm_prod_1$coef[2], "x"))
[1] "E[y] = 1.25469659535303 + 3.62352027735283x"
  1. Plot a scatter plot of the transformed data then add the estimated regression line on a graph. Is a simple linear regression appropriate for modeling this transformed data?
lm_prod_1_plot
`geom_smooth()` using formula = 'y ~ x'

Yes, I would say this data is now appropriate for linear regression.

  1. Plot the residuals against the fitted values. What does this plot show?
lm_prod_1_residuals_plot

This shows that our residuals for our model are indeed random and do not follow any clear pattern. This shows that our linear model is indeed a good fit.

  1. Provide Normal Q-Q plot. What does this plot show?
qq_prod_1_plot

This plot helps strengthen our assumptions of our residuals. We assume that our residuals follow a normal distribution with mean 0 and a constant variance.