Question 1.

Consider the data set given below

x<-c(0.22, -2.54, 0.52, 0.75, 25)
# and weights given by
w<-c(2, 1, 3, 1, 2)
# Give the value of mu that minimizes the least squares equation

Answer:

The least squares equation is sum((x - mu)^2).

x = c(0.22, -2.54, 0.52, 0.75, 25)
w = c(2, 1, 3, 1, 2)
mu = sum(x * w) / sum(w)

cat("The value of mu that minimizes the above least squares equation:", mu, "\n")
The value of mu that minimizes the above least squares equation: 5.578889 

Question 2.

Consider the following data set

x<-c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y<-c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)

# Fit the regression through the origin and get the slope treating y as the outcome and x is the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)

Answer:

To fit regression through the origin, we remove the intercept term by specifying the formula as either “y ~ x + 0”, “y ~ 0 + x”, or “y ~ x - 1”.

x = c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y = c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)
fit = lm(y ~ x + 0)
coef(fit)
       x 
1.151408 
reg = coef(fit)
cat("The slope treating y as the outcome and x is the regressor:", reg, "\n")
The slope treating y as the outcome and x is the regressor: 1.151408 

Question 3.

Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and drat (Rear axle ratio) as the predictor. Give the slope coefficient.

Answer:

data("mtcars")
fit = lm(mpg~drat, mtcars)
coef(fit)
(Intercept)        drat 
  -7.524618    7.678233 
x = mtcars$drat
y = mtcars$mpg
slope = cor(x,y) *sd(y)/sd(x)
cat("The slope coefficient for the regression model with mpg as the outcome and drat as the predictor:", slope, "\n")
The slope coefficient for the regression model with mpg as the outcome and drat as the predictor: 7.678233 

Question 4.

Refer to question 3. Test the hypothesis of no linear relationship between rear axle ration and miles per gallon.

Answer:

Null hypothesis: There is no significant linear relationship between rear axle ration and miles per gallon.

Alternative hypothesis: There is a significant linear relationship between rear axle ration and miles per gallon.

We will use α =0.05.

fit = lm(mpg~drat, mtcars)
summary(fit)

Call:
lm(formula = mpg ~ drat, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0775 -2.6803 -0.2095  2.2976  9.0225 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -7.525      5.477  -1.374     0.18    
drat           7.678      1.507   5.096 1.78e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.485 on 30 degrees of freedom
Multiple R-squared:  0.464, Adjusted R-squared:  0.4461 
F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

In this case, the p-value for drat is 1.78e-05, which is less than 0.05. Therefore, we would reject the null hypothesis and conclude that there is a significant linear relationship between drat and mpg.

Question 5.

Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one third that of the outcome. The correlation between the two variables is 0.7. What value would the slope coefficient for the regression model with Y as the outcome and X as the predictor?

Answer:

The slope coefficient (β) in a simple linear regression model is given by the formula: β = ρ * SD(X)/SD(Y) , where ρ is the correlation between X and Y, SD(Y) is the standard deviation of Y, SD(X) is the standard deviation of X.

sd_x = 1/3
sd_y = 1
cor = 0.7

slope = cor * sd_y/sd_x
cat("The slope coefficient for the regression model with Y as the outcome and X as the predictor:", slope, "\n")
The slope coefficient for the regression model with Y as the outcome and X as the predictor: 2.1 

Question 6.

You ask a collection of husbands and wives to guess how many jellybeans are in a jar. The correlation is 0.6. The standard deviation for the husbands is 14 beans while the standard deviation for wives is 10 beans. Assume that the data were centered so that 0 is the mean for each. The centered guess for a husband was 40 beans (above the mean). What would be your best estimate of the wife’s guess?

Answer:

cor = 0.6
sd_x = 14
sd_y = 10
husband_guess = 40

slope = cor * sd_y/sd_x
wife_guess = slope * husband_guess
cat("The estimated wife's guess:", wife_guess, "beans \n")
The estimated wife's guess: 17.14286 beans 

Question 7.

Consider the data given by the following

x <- c(10.45, 9.45, 12.41, 14.46, 15.26)
# What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?

Answer:

To normalize the data, you can subtract the mean and divide by the standard deviation. The formula for normalization is: normalized value = (x - mean(x)) / sd(x).

x = c(10.45, 9.45, 12.41, 14.46, 15.26)
nor_x = (x - mean(x)) / sd(x)
nor_x[1]
[1] -0.7835272
cat("The value of the first measurement if x were normalized:", nor_x[1], "\n")
The value of the first measurement if x were normalized: -0.7835272 

Question 8.

Consider the following data set (used above as well). What is the intercept for fitting the model with x as the predictor and y as the outcome?

x <- c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y <- c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)

Answer:

x = c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
y = c(2.39, 1.72, 2.55, 1.48, 2.19, 0.59, 2.23, 1.65, 2.49, 1.05)
fit = lm(y ~ x)
coef(fit)
(Intercept)           x 
  2.2471749  -0.2626668 
intercept = cor(x,y) *sd(y)/sd(x)
cat("The intercept for fitting the model with x as the predictor and y as the outcome:", intercept, "\n")
The intercept for fitting the model with x as the predictor and y as the outcome: -0.2626668 

Question 9.

Consider the data given by

x <- c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)

What value minimizes the sum of the squared distances between these points and itself?

Answer:

x = c(1.8, 1.47, 1.51, 1.73, 1.36, 1.58, 1.57, 1.85, 1.44, 1.42)
mean_value = mean(x)
cat("The value minimizes the sum of the squared distances between these points and itselfs:", mean_value, "\n")
The value minimizes the sum of the squared distances between these points and itselfs: 1.573 

Question 10.

Fit a linear regression model to the mtcars dataset with the variable drat as the predictor and the variable mpg as the outcome. Plot the drat (horizontal axis) versus the residuals (vertical axis).

Answer:

data("mtcars")
fit = lm(mpg~drat, data = mtcars)
residuals <- resid(fit)

plot(mtcars$drat, residuals, xlab = "drat", ylab = "Residuals",
     main = "drat vs. Residuals")
abline(h = 0, col = "red", lty = 2)

Question 11.

Refer to question 10. Directly estimate the residual variance and compare this estimate to the output of lm.

Answer:

The residual variance is estimating the sum Σe_i^2/n-p where e_i^2 is the residuals squared and p is the number of regression parameter. And so, in linear regression p would be 2 and that’s going to estimate sigma squared (σ^2). Then, we have

fit = lm(mpg~drat, data = mtcars)
residuals = resid(fit)

residual_variance_est <- sum(residuals^2)/(nrow(mtcars)-2)
cat("Estimated Residual Variance:", residual_variance_est, "\n")
Estimated Residual Variance: 20.11889 
residual_variance_lm <- summary(fit)$sigma^2
cat("Residual Variance from lm output:", residual_variance_lm, "\n")
Residual Variance from lm output: 20.11889 

Question 12.

Refer to question 10. Give the R squared for this model.

Answer:

R squared is the percentage of the total variation in our response that is explained by the linear model with our predictor.

summary_fit <- summary(fit)
rsquared <- summary_fit$r.squared
cat("R-squared for the model:", rsquared, "\n")
R-squared for the model: 0.4639952 

So in this case 46% of the variation in the drat is explained by the linear relationship with the mpg.

Question 13.

Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and drat as the predictor. Plot drat versus the residuals.

Answer:

fit = lm(mpg~drat, data = mtcars)
residuals <- resid(fit)

plot(mtcars$drat, residuals, xlab = "drat", ylab = "Residuals",
     main = "Drat vs. Residuals")
abline(h = 0, col = "red", lty = 2)

Question 14.

Refer to question 13. Directly estimate the residual variance and compare this estimate to the output of lm.

Answer:

The residual variance is estimating the sum Σe_i^2/n-p where e_i^2 is the residuals squared and p is the number of regression parameter. And so, in linear regression p would be 2 and that’s going to estimate sigma squared (σ^2). Then, we have

fit = lm(mpg~drat, data = mtcars)
residuals = resid(fit)

residual_variance_est <- sum(residuals^2)/(nrow(mtcars)-2)
cat("Estimated Residual Variance:", residual_variance_est, "\n")
Estimated Residual Variance: 20.11889 
residual_variance_lm <- summary(fit)$sigma^2
cat("Residual Variance from lm output:", residual_variance_lm, "\n")
Residual Variance from lm output: 20.11889 

Question 15.

Refer to question 13. Give the R squared for this model.

Answer:

R squared is the percentage of the total variation in our response that is explained by the linear model with our predictor.

summary_fit <- summary(fit)
rsquared <- summary_fit$r.squared
cat("R-squared for the model:", rsquared, "\n")
R-squared for the model: 0.4639952 

So in this case 46% of the variation in the drat is explained by the linear relationship with the mpg.