Consider the following data with x as the predictor and y as as the outcome.
x <- c(0.61, 0.93, 0.83, 0.35, 0.54, 0.16, 0.91, 0.62, 0.62)
y <- c(0.67, 0.84, 0.6, 0.18, 0.85, 0.47, 1.1, 0.65, 0.36)
Give a P-value for the two sided hypothesis test of whether \(\beta_1\) from a linear regression model is 0 or not.
fit <- lm(y~x)
summary(fit)$coeff[2,4]
## [1] 0.05296439
#summary(lm(y~x))$coefficients[2,4] # gives the same value
Consider the previous problem, give the estimate of the residual standard deviation.
summary(fit)$sigma
## [1] 0.2229981
In the mtcars data set, fit a linear regression model of weight (predictor) on mpg (outcome). Get a 95% confidence interval for the expected mpg at the average weight. What is the lower endpoint?
y <- mtcars$mpg
x <- mtcars$wt
fit <- lm(y ~ x)
predict(fit, data.frame(x = mean(x)),interval = ("confidence"))[2]
## [1] 18.99098
Refer to the previous question. Read the help file for mtcars. What is the weight coefficient interpreted as?
Answer:
mpg Miles/(US) gallon
wt Weight (1000 lbs)
So, the weight coefficient is the estimated expected change in mpg per 1,000 lb increase in weight.
Consider again the mtcars data set and a linear regression model with mpg as predicted by weight (1,000 lbs). A new car is coming weighing 3000 pounds. Construct a 95% prediction interval for its mpg. What is the upper endpoint?
predict(fit, data.frame(x = 3),interval = ("prediction"))[3]
## [1] 27.57355
Consider again the mtcars data set and a linear regression model with mpg as predicted by weight (in 1,000 lbs). A “short” ton is defined as 2,000 lbs. Construct a 95% confidence interval for the expected change in mpg per 1 short ton increase in weight. Give the lower endpoint.
y2 <- mtcars$mpg
x2 <- mtcars$wt / 2
fit2 <- lm(y2 ~ x2)
mn <- summary(fit2)$coeff[2,1]
sd <- summary(fit2)$coeff[2,2]
df = fit2$df
ci <- mn + c(-1, 1) * sd * qt(0.975, df)
ci
## [1] -12.97262 -8.40527
If my X from a linear regression is measured in centimeters and I convert it to meters what would happen to the slope coefficient?
Answer: As slope of linear regression is the change of Y per 1 - th increase in X, then for the 100 - th increase in X slope will be 100 times greater.
I have an outcome, YY, and a predictor, XX and fit a linear regression model with \(Y = \beta_0 + \beta_1 X + \epsilon Y\) to obtain \(\hat \beta_0\) and \(\hat \beta_1\) What would be the consequence to the subsequent slope and intercept if I were to refit the model with a new regressor, X+c for some constant, c?
Answer:
The slope will be the same, but intercept will change. As \(\beta_0 = \overline Y - \beta_1 \overline X\) then the new intercept value \({\beta_0}_{2}\) will be \({\beta_0}_{2} = \overline Y - \beta_1 (\overline X + c) => {\beta_0}_{2} = \beta_0 - \beta_1 c\)
Refer back to the mtcars data set with mpg as an outcome and weight (wt) as the predictor. About what is the ratio of the the sum of the squared errors, \(\sum_{i=1}^n (Y_i - \hat Y_i)^2\) when comparing a model with just an intercept (denominator) to the model with the intercept and slope (numerator)?
y <- mtcars$mpg
x <- mtcars$wt
var1 <- sum((y - mean(y))^2)
fit <- lm(y ~ x)
var2 <- sum((fit$residuals)^2)
var2/var1
## [1] 0.2471672
Do the residuals always have to sum to 0 in linear regression?
Answer:
Assume 1D case. The equation for linear regression coefficients with slope and intercepts is:
\(\sum_{i = 1}^n(y_i - b x_i - a)^2 -> min(a, b)\)
To minimize this sum one should take the partial derivative of this sum by \(a\) and \(b\) and make it equal to \(0\). The partial derivative by \(a\) gives the next equation:
\(-2 \sum_{i = 1}^n(y_i - b x_i - a) = 0\)
But the \(\sum_{i = 1}^n(y_i - b x_i - a)\) is the sum of residuals. Also, if one set \(b=0\), the same equation is valid for the linear regression case when there is no slope and only intercept exists. It means what the sum of residuals of the linear regression with intercept included (no matter if the slope is included or not) is always equal to \(0\).
Now let assume the case of linear regression without the intercept. It gives the next equation:
\(\sum_{i = 1}^n(y_i - b x_i)^2 -> min(a, b)\)
To find the minimum of this sum one should take the derivative of this sum by \(b\) and set it equal to \(0\). It gives: \(-2 \sum_{i = 1}^n(y_i - b x_i) x_i = 0\) or
\(\sum_{i = 1}^n r_i x_i = 0\) where \(r_i\) is the \(i\) - th residual.
It means what if the intercept is not included in the linear regression model, then most likely the sum of residuals will not be equal to \(0\).
So, the correct answer is:
If an intercept is included, then they will sum to 0.