Assignment 3 Max Zalewski

load("assignment 3.RData")
library(gt)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Question 1

What is the predicted value and 95% predicted interval for the muscle mass for women of age 60? Interpret your predicted interval.

ci_60 = predict(object = mm_lm, newdata = data.frame(age = c(60)), interval = "confidence", level = .95)
ci_60 = as.data.frame(ci_60)
ci_60 %>% select(lwr,fit,upr) %>% gt()

lwr	fit	upr
82.83471	84.94683	87.05895

We can say with a 95% confidence level that a person of age 60 will have a muscle mass within the interval of 82.83 and 87.06.

Plot the residuals y_i − y_hat_i against x_i on one graph.

residual_plot_0

Plot the values y_hat_i - y_bar against xi on one graph, using the same scales as in the graph in part (a)

y_hat_less_y_bar_plot

From two graphs in part (b) and (c), does SSE or SSR appear to be the larger component of SSTO? What does this imply about the magnitude of R2?

SSR appears to be the larger component of SSTO. This implies that the magnitude of R2 will be greater than or equal to .5 as R2 = SSR/SSTO

Provide the ANOVA table.

anova_gt

NULL

What proportion of the total variation in muscle mass remains unexplained when age is added into the model? Is this proportion relatively small or large?

sse_prop

Proportion of Unexplained Variance
0.2499332

Relative to SSR this is large as it means that a around 25% of variation our total variation is remains unexplained.

Conduct a hypothesis test H_0 : β1 = 0 using an F test with significance level α = 0.05. Clearly state the alternatives, test statistics and conclusion.

We can obtain our F-statistic and p_value through R by using anova.

anova_mm

TERM	DF	SS	MS	F-stat	p-value
SSR	1	11627.486	11627.48584	174.062	4.123987e-19
SSE	58	3874.447	66.80082	NA	NA

We state our alternative to be H_a: β1 != 0, != means does not equal, and we see our F-statistic to be 174.062 with a p value of 4.12e-19 which is less than our significance level of a = 0.05. We therefore have significant evidence to reject the null hypothesis in favor of the alternative hypothesis of β1 != 0.

Obtain R2 and r.

R2 = anova_table[1,"Sum Sq"] / (anova_table[1,"Sum Sq"] + anova_table[2,"Sum Sq"])
r = sqrt(R2)
print(paste("R^2 : ",R2))

[1] "R^2 :  "

print(paste("r : ", r))

[1] "r :  "

Question 2

Plot a scatter plot of the data. Is a simple linear regression appropriate for modeling this data?

prod_scatter_plot

No, I would not say this data is fit very well for linear regression.

Obtain the estimated linear regression function for the data.

print(paste0("E[y] = ", lm_prod_0$coef[1], " + ", lm_prod_0$coef[2], "x"))

[1] "E[y] = 6.86348691334742 + 0.533274922835527x"

Do you consider any transformation on X or Y? Explain your reasoning.

I would either consider squaring the realized data of our response variable or taking the square root of our explanatory variable. We see that our data could possibly be fit by a line that can be described as the square root function. So we could take the square root of the explanatory variable or square our response variable to make it better fit for linear regression.

Use the transformation x′ = √x and obtain the estimated linear regression function for the transformed data.

print(paste0("E[y] = ", lm_prod_1$coef[1], " + ", lm_prod_1$coef[2], "x"))

[1] "E[y] = 1.25469659535303 + 3.62352027735283x"

Plot a scatter plot of the transformed data then add the estimated regression line on a graph. Is a simple linear regression appropriate for modeling this transformed data?

lm_prod_1_plot

`geom_smooth()` using formula = 'y ~ x'

Yes, I would say this data is now appropriate for linear regression.

Plot the residuals against the fitted values. What does this plot show?

lm_prod_1_residuals_plot

This shows that our residuals for our model are indeed random and do not follow any clear pattern. This shows that our linear model is indeed a good fit.

Provide Normal Q-Q plot. What does this plot show?

qq_prod_1_plot

This plot helps strengthen our assumptions of our residuals. We assume that our residuals follow a normal distribution with mean 0 and a constant variance.