1. Consider the data set given below
x<-c(0.22, -2.54, 0.52, 0.75, 25)
# and weights given by
w<-c(2, 1, 3, 1, 2)
# Give the value of mu that minimizes the least squares equation
Answer: Note that sum(w*(x - mu)ˆ2) or the least squares equation is minimized by the empirical mean. Hence, we have mu=
sum(x*w)/sum(w)
[1] 5.578889
2. Consider the following data set
x<-c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y<-c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)
# Fit the regression through the origin and get the slope treating y as the outcome and x is the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)
Answer:
fit<- lm(y ~ x + 0)
coef(fit)
x
1.151408
3. Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and drat (Rear axle ratio) as the predictor. Give the slope coefficient.
Answer
data("mtcars")
fit<- lm(mpg ~ drat, mtcars)
fit
Call:
lm(formula = mpg ~ drat, data = mtcars)
Coefficients:
(Intercept) drat
-7.525 7.678
thus, we have a slope coefficient of 7.678.
4. Refer to question 3. Test the hypothesis of no linear relationship between rear axle ration and miles per gallon.
Answer: We have our hypotheses:
H_0: The coefficient in front of drat is 0, which implies that there is
no linear relationship.
H_a:The coefficient in front of drat is nonzero, which implies that
there is a linear relationship.
summary(lm(mpg ~ drat, mtcars))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.524618 5.476663 -1.373942 0.1796390847
drat 7.678233 1.506705 5.096042 0.0000177624
Using alpha=0.05, we have seen that the p-value for drat is 0.0000177624 which is less than 0.05, hence we reject the null hypothesis that the coefficient in front of drat is zero. Therefore, there is a linear relationship between rear axle ration and mpg.
5. Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one third that of the outcome. The correlation between the two variables is 0.7. What value would the slope coefficient for the regression model with Y as the outcome and X as the predictor?
Answer: The slope of a regression line is the correlation between the two sets of dependent and independent variables multiplied by the ratio of their standard deviations.
slope <- 0.7 * 1/(1/3)
slope
[1] 2.1
6. You ask a collection of husbands and wives to guess how many jellybeans are in a jar. The correlation is 0.6. The standard deviation for the husbands is 14 beans while the standard deviation for wives is 10 beans. Assume that the data were centered so that 0 is the mean for each. The centered guess for a husband was 40 beans (above the mean). What would be your best estimate of the wife’s guess?
Answer: In this case, we need to compute for the slope for the wife as predictor and the husband is the outcome and vice versa which is equal to the correlation multiplied by the ratio of their standard deviations. Hence, we have
slope<-0.6*(10/14)
slope
[1] 0.4285714
Now, to estimate the wife’s guess, we multiply the slope to the centered guess for a husband. Then we have
wg<-slope*40
wg
[1] 17.14286
7. Consider the data given by the following
x <- c(10.45, 9.45, 12.41, 14.46, 15.26)
# What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?
Answer: Normalizing x is dividing the standard deviation from the difference of each data and their corresponding mean. Thus, we get the value of the first measurement equal to
xn<-(x-mean(x))/sd(x)
xn[1]
[1] -0.7835272
8. Consider the following data set (used above as well). What is the intercept for fitting the model with x as the predictor and y as the outcome?
x<-c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y<-c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)
Answer:
fit<-lm(y ~ x)
fit
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
2.2472 -0.2627
The intercept is 2.2472.
9. Consider the data given by
x <- c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
What value minimizes the sum of the squared distances between these points and itself?
Answer: The value that minimizes the sum of squared distances between these points and itself is the mean of x. Thus,
mean<-mean(x)
mean
[1] 1.573
10. Fit a linear regression model to the mtcars
dataset with the variable drat as the predictor and the variable mpg as
the outcome. Plot the drat (horizontal axis) versus the residuals
(vertical axis).
Answer:
library(ggplot2)
fit<-lm(mpg~drat, mtcars)
temp<-mtcars;temp$resid<-resid(fit)
g<-ggplot(temp, aes(x=drat, y=resid))+geom_hline(yintercept=0, col="red")+geom_point(alpha=0.5, cex=5)
g
11. Refer to question 10. Directly estimate the residual variance and compare this estimate to the output of lm.
Answer: The residual variance is estimating the variation that is left unexplained by linear model. We have the residual variance equal to
fit<-lm(mpg~drat, mtcars)
sum(resid(fit)^2)/(nrow(mtcars)-2)
[1] 20.11889
summary(fit)$sigma^2
[1] 20.11889
Thus we say that the residual variance is the average of the squared residuals divided by n-2.
12. Refer to question 10. Give the R squared for
this model.
Answer: We get the R squared for this model by getting
the summary of fit. We have,
summary(fit)
Call:
lm(formula = mpg ~ drat, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-9.0775 -2.6803 -0.2095 2.2976 9.0225
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.525 5.477 -1.374 0.18
drat 7.678 1.507 5.096 1.78e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.485 on 30 degrees of freedom
Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
Hence, we have the R squared equal to 0.464. Therefore, about 46.4 percent of the variation in miles per gallon is explained by the linear relationship with drat.
13. Load the mtcars dataset. Fit a linear regression
with miles per gallon as the outcome and drat as the predictor. Plot
drat versus the residuals.
Answer:
library(ggplot2)
fit<-lm(mpg~drat, mtcars)
temp<-mtcars;temp$resid<-resid(fit)
g<-ggplot(temp, aes(x=drat, y=resid))+geom_hline(yintercept=0, col="red")+geom_point(alpha=0.5, cex=5)
g
14. Refer to question 13. Directly estimate the
residual variance and compare this estimate to the output of lm.
Answer: The residual variance is estimating the
variation that is left unexplained by linear model. We have the residual
variance equal to
fit<-lm(mpg~drat, mtcars)
sum(resid(fit)^2)/(nrow(mtcars)-2)
[1] 20.11889
summary(fit)$sigma^2
[1] 20.11889
Thus we say that the residual variance is the average of the squared residuals divided by n-2.
15. Refer to question 13. Give the R squared for
this model.
Answer: We get the R squared for this model by getting
the summary of fit. We have,
summary(fit)
Call:
lm(formula = mpg ~ drat, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-9.0775 -2.6803 -0.2095 2.2976 9.0225
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.525 5.477 -1.374 0.18
drat 7.678 1.507 5.096 1.78e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.485 on 30 degrees of freedom
Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
Hence, we have the R squared equal to 0.464. Therefore, about 46.4 percent of the variation in miles per gallon is explained by the linear relationship with drat.