Regression Analysis Midterm Exam

1. Consider the data set given below

x<-c(0.22, -2.54, 0.52, 0.75, 25)
# and weights given by
w<-c(2, 1, 3, 1, 2)
# Give the value of mu that minimizes the least squares equation

Answer: Note that sum(w*(x - mu)ˆ2) or the least squares equation is minimized by the empirical mean. Hence, we have mu=

sum(x*w)/sum(w)

  [1] 5.578889

2. Consider the following data set

x<-c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y<-c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)

# Fit the regression through the origin and get the slope treating y as the outcome and x is the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)

Answer:

fit<- lm(y ~ x + 0)
coef(fit)

         x 
  1.151408

3. Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and drat (Rear axle ratio) as the predictor. Give the slope coefficient.

Answer

data("mtcars")
fit<- lm(mpg ~ drat, mtcars)
fit

  
  Call:
  lm(formula = mpg ~ drat, data = mtcars)
  
  Coefficients:
  (Intercept)         drat  
       -7.525        7.678

thus, we have a slope coefficient of 7.678.

4. Refer to question 3. Test the hypothesis of no linear relationship between rear axle ration and miles per gallon.

Answer: We have our hypotheses:
H_0: The coefficient in front of drat is 0, which implies that there is no linear relationship.
H_a:The coefficient in front of drat is nonzero, which implies that there is a linear relationship.

summary(lm(mpg ~ drat, mtcars))$coef

               Estimate Std. Error   t value     Pr(>|t|)
  (Intercept) -7.524618   5.476663 -1.373942 0.1796390847
  drat         7.678233   1.506705  5.096042 0.0000177624

Using alpha=0.05, we have seen that the p-value for drat is 0.0000177624 which is less than 0.05, hence we reject the null hypothesis that the coefficient in front of drat is zero. Therefore, there is a linear relationship between rear axle ration and mpg.

5. Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one third that of the outcome. The correlation between the two variables is 0.7. What value would the slope coefficient for the regression model with Y as the outcome and X as the predictor?

Answer: The slope of a regression line is the correlation between the two sets of dependent and independent variables multiplied by the ratio of their standard deviations.

slope <- 0.7 * 1/(1/3)
slope

  [1] 2.1

6. You ask a collection of husbands and wives to guess how many jellybeans are in a jar. The correlation is 0.6. The standard deviation for the husbands is 14 beans while the standard deviation for wives is 10 beans. Assume that the data were centered so that 0 is the mean for each. The centered guess for a husband was 40 beans (above the mean). What would be your best estimate of the wife’s guess?

Answer: In this case, we need to compute for the slope for the wife as predictor and the husband is the outcome and vice versa which is equal to the correlation multiplied by the ratio of their standard deviations. Hence, we have

slope<-0.6*(10/14)
slope

  [1] 0.4285714

Now, to estimate the wife’s guess, we multiply the slope to the centered guess for a husband. Then we have

wg<-slope*40
wg

  [1] 17.14286

7. Consider the data given by the following

x <- c(10.45, 9.45, 12.41, 14.46, 15.26)
# What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?

Answer: Normalizing x is dividing the standard deviation from the difference of each data and their corresponding mean. Thus, we get the value of the first measurement equal to

xn<-(x-mean(x))/sd(x)
xn[1]

  [1] -0.7835272

8. Consider the following data set (used above as well). What is the intercept for fitting the model with x as the predictor and y as the outcome?

x<-c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)

y<-c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)

Answer:

fit<-lm(y ~ x)
fit

  
  Call:
  lm(formula = y ~ x)
  
  Coefficients:
  (Intercept)            x  
       2.2472      -0.2627

The intercept is 2.2472.

9. Consider the data given by

x <- c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)

What value minimizes the sum of the squared distances between these points and itself?

Answer: The value that minimizes the sum of squared distances between these points and itself is the mean of x. Thus,

mean<-mean(x)
mean

  [1] 1.573

10. Fit a linear regression model to the mtcars dataset with the variable drat as the predictor and the variable mpg as the outcome. Plot the drat (horizontal axis) versus the residuals (vertical axis).
Answer:

library(ggplot2)
fit<-lm(mpg~drat, mtcars)
temp<-mtcars;temp$resid<-resid(fit)
g<-ggplot(temp, aes(x=drat, y=resid))+geom_hline(yintercept=0, col="red")+geom_point(alpha=0.5, cex=5)
g

11. Refer to question 10. Directly estimate the residual variance and compare this estimate to the output of lm.

Answer: The residual variance is estimating the variation that is left unexplained by linear model. We have the residual variance equal to

fit<-lm(mpg~drat, mtcars)
sum(resid(fit)^2)/(nrow(mtcars)-2)

  [1] 20.11889

summary(fit)$sigma^2

  [1] 20.11889

Thus we say that the residual variance is the average of the squared residuals divided by n-2.

12. Refer to question 10. Give the R squared for this model.
Answer: We get the R squared for this model by getting the summary of fit. We have,

summary(fit)

  
  Call:
  lm(formula = mpg ~ drat, data = mtcars)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -9.0775 -2.6803 -0.2095  2.2976  9.0225 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   -7.525      5.477  -1.374     0.18    
  drat           7.678      1.507   5.096 1.78e-05 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 4.485 on 30 degrees of freedom
  Multiple R-squared:  0.464,   Adjusted R-squared:  0.4461 
  F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

Hence, we have the R squared equal to 0.464. Therefore, about 46.4 percent of the variation in miles per gallon is explained by the linear relationship with drat.

13. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and drat as the predictor. Plot drat versus the residuals.
Answer:

library(ggplot2)
fit<-lm(mpg~drat, mtcars)
temp<-mtcars;temp$resid<-resid(fit)
g<-ggplot(temp, aes(x=drat, y=resid))+geom_hline(yintercept=0, col="red")+geom_point(alpha=0.5, cex=5)
g

14. Refer to question 13. Directly estimate the residual variance and compare this estimate to the output of lm.
Answer: The residual variance is estimating the variation that is left unexplained by linear model. We have the residual variance equal to

fit<-lm(mpg~drat, mtcars)
sum(resid(fit)^2)/(nrow(mtcars)-2)

  [1] 20.11889

summary(fit)$sigma^2

  [1] 20.11889

Thus we say that the residual variance is the average of the squared residuals divided by n-2.

15. Refer to question 13. Give the R squared for this model.
Answer: We get the R squared for this model by getting the summary of fit. We have,

summary(fit)

  
  Call:
  lm(formula = mpg ~ drat, data = mtcars)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -9.0775 -2.6803 -0.2095  2.2976  9.0225 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   -7.525      5.477  -1.374     0.18    
  drat           7.678      1.507   5.096 1.78e-05 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 4.485 on 30 degrees of freedom
  Multiple R-squared:  0.464,   Adjusted R-squared:  0.4461 
  F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

Hence, we have the R squared equal to 0.464. Therefore, about 46.4 percent of the variation in miles per gallon is explained by the linear relationship with drat.

Regression Analysis Midterm Exam

Math 53, SY:23-24, First Semester

November 14, 2023