Homework 3

Instructions
(Question 1): Covariance and Correlation (26 points + 5 Extra Credit)
(Question 2): Regression Model (44 points)
(Question 3) Prediction and Residuals (20 points)
(Question 4): Hypothesis testing (35 points)
(Question 5): Leverage and Influence (30 points)
(Question 6): Read about residual analysis! (20 points)

Instructions

Total Credit: Maximum 175. There is 5 extra credit.
Make sure you include author information (your name).
Follow the format and approach outlined in the practice set we reviewed in class.
For the final question, refer to the associated reading materials provided.

Let’s use the Sleep in Mammals data from Dr. Gordon K Smyth’s OzDASL Data and Story Library. Here is the link for the metadata. Read the data descriptions first.

My following script is downloading the data and loading it in \(\texttt{R}\) in a dataframe called \(\texttt{sleep_data}\). For this exercise, I am excluding three observations corresponding to the species: African elephant, Asian elephant, and human. Please do not modify the code. Ensure you are connected to the internet.

sleep_data = read.table(url("https://gksmyth.github.io/ozdasl/general/sleep.txt"), header = T)
sleep_data = sleep_data[!sleep_data$Species %in% c("Africanelephant", "Asianelephant", "Man"  ), ]

(Question 1): Covariance and Correlation (26 points + 5 Extra Credit)

A. Suppose that we want to construct a regression model with body weight as the independent variable (covariate) and brain weight as the dependent variable (response). Based on the metadata, can you identify which columns should be used? Additionally, using the typical class notation, which variable is \(x\) and which is \(y\) in the regression model? (5 points)

Answer: : Independent (x) - Body weight;
Dependent (y) - Brain weight

B. Draw a scatterplot between the two variables (i.e., body weight and brain weight). Based on the scatterplot, what type of relationship do you infer between these variables?

You can earn an extra 5 points for correctly labeling the axes as “Body Weight” and “Brain Weight” and adding a title “Scatterplot between Body Weight and Brain Weight” to the plot (Hint: use an AI tool to help with the code after you finish the basic scatterplot code). (5+2 points)

Answer: It appears that there is a weak positive relationship, meaning that as 𝑥increases, y also increases.

BodyWeight=sleep_data$BodyWt
BrainWeight=sleep_data$BrainWt
plot(BodyWeight,BrainWeight,main="Scatterplot between Body Weight and Brain Weight")

C. What is the covariance and correlation between the two variables? What kind of relationship do you guess here between the variables? (6 + 2 points)

Answer: Both covariance and correlation indicate a positive trend; however, the correlation value suggests that this relationship is weak.

cov(BodyWeight,BrainWeight)

## [1] 17833.93

cor(BodyWeight,BrainWeight)

## [1] 0.8884084

D. What is the Spearman’s rank correlation coefficient between the two variables? (3 points)

Answer:

cor(BodyWeight,BrainWeight,method="spearman")

## [1] 0.950011

E. Using the two variables above, can you demonstrate that the Spearman’s correlation between \(\mathbf{x}\) and \(\mathbf{y}\) is the same as the Spearman’s correlation between \(\mathbf{y}\) and \(\mathbf{x}\)? (3 points)

Answer:

cor(BodyWeight,BrainWeight,method="spearman")

## [1] 0.950011

cor(BrainWeight,BodyWeight,method="spearman")

## [1] 0.950011

(Question 2): Regression Model (44 points)

A. Build the above simple linear regression model in \(\texttt{R}\) using body weight as the independent variable and brain weight as response/dependent variable. Display the model output and write down the estimated regression equation. (8 + 2 points)

Answer: My regression equation is:\(\bar{y}\) = 36.572 + 1.228x

model = lm(BrainWeight   ~ BodyWeight, data = sleep_data )
model

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight, data = sleep_data)
## 
## Coefficients:
## (Intercept)   BodyWeight  
##      36.572        1.228

B. What is the interpretation of your estimated \(\widehat{\beta}_0\)? (5 points)

Answer: \(\widehat{\beta}_0\) = 36.572 is the estimated value of Brain Weight (y) when Body Weight (x) is equal to 0.

C. What is the interpretation of your estimated \(\widehat{\beta}_1\)? (5 points)

Answer: \(\widehat{\beta}_1\) = 1.228 is the estimated change in Brain Weight (𝑦) when Body Weight (x) changes by 1 unit.

D. Display the model summary. From the summary, find and interpret the coefficient of determination, \(R^2\). (3 + 3 + 5 points)

Answer: From the output table, \(R^2\) = 0.7893. Only 79% variability in Brain Weight (𝑦) is being explained by the model.

summary(model)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight, data = sleep_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.81  -34.52  -27.16    0.67  339.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.57231   10.95089    3.34  0.00148 ** 
## BodyWeight   1.22847    0.08408   14.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.15 on 57 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7856 
## F-statistic: 213.5 on 1 and 57 DF,  p-value: < 2.2e-16

E. Verify that \(R^2\) is square of the corrleation coefficient between the two variables. (3 points)

Answer:

cor(BodyWeight,BrainWeight)^2

## [1] 0.7892696

F. Plot the regression line on top of the scatter plot of the data. Give the regression line a different color. (10 points)

Answer:

plot(BodyWeight, BrainWeight)
abline(model, col=("purple"))

(Question 3) Prediction and Residuals (20 points)

A. Use the model to predict the brain weight when the body weight is 50. (4 points)

Answer:

predict(model, newdata = data.frame(BodyWeight = 50) )

##        1 
## 97.99588

B. What is the estimated error standard deviation? What is the estimated error variance? (3 + 3 points)

Answer: \(\hat[\sigma]\) = 77.15; \(\hat[\sigma]^2\) = 5952.123

77.15^2

## [1] 5952.123

C. Plot the residuals from the model and plot them against the fitted values. Is there any pattern? (8 + 2 points)

Answer: It seems that most of the data follows the trend line and it seems to be decreasing.

plot(model, which = 1)

(Question 4): Hypothesis testing (35 points)

A. Test whether the slope (\(\beta_1\)) of the regression line is significantly different from zero. Use a significance level of 0.05. State the null and alternative hypotheses. From the summary table, and report the p-value and your conclusion. (3 + 7 points)

Answer: \(H_0\): \(\beta_1\) = 0. \(H_A\): \(\beta_1\) ≠ 0.
p-value = 2e-16 < 𝛼 ( =0.05). So, we have enough evidence to reject the \(H_0\)).

summary(model)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight, data = sleep_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.81  -34.52  -27.16    0.67  339.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.57231   10.95089    3.34  0.00148 ** 
## BodyWeight   1.22847    0.08408   14.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.15 on 57 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7856 
## F-statistic: 213.5 on 1 and 57 DF,  p-value: < 2.2e-16

B. In the above test, what is the value of the test statistic? Show that the test statistic is equal to \(\frac{\widehat{\beta}_1 \text{ - } 0}{\text{sd}(\widehat{\beta}_1)}\). (2 + 4 points)

Answer: From the table our t-value = 14.61. From our calculations we end up getting the same number.

summary(model)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight, data = sleep_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.81  -34.52  -27.16    0.67  339.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.57231   10.95089    3.34  0.00148 ** 
## BodyWeight   1.22847    0.08408   14.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.15 on 57 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7856 
## F-statistic: 213.5 on 1 and 57 DF,  p-value: < 2.2e-16

#Estimate/Std. Error
1.22847/0.08408

## [1] 14.61073

C. Find the 95% confidence interval for the slope \(\beta_1\). From the confidence interval, how would you determine whether \(\beta_1 = 1.4\) at \(\alpha = 0.05\)? (3 + 3 points)

Answer: So, the 95% CI for 𝛽1 is: (1.06011, 1.396833). The corresponding confidence interval contains 1.4. 1.4 is inside; hence, we can not reject 𝐻0.

confint(model, level = 0.95)

##                2.5 %    97.5 %
## (Intercept) 14.64354 58.501081
## BodyWeight   1.06011  1.396833

D. Find the 99% confidence interval for the slope \(\beta_1\). From the confidence interval, how would you determine whether \(\beta_1 = 1.4\) at \(\alpha = 0.01\)? (3 + 3 points)

Answer: So, the 99% CI for 𝛽1 is: (1.004416, 1.452526). The corresponding confidence interval contains 1.4. 1.4 is inside; hence, we can not reject 𝐻0.

confint(model, level = 0.99)

##                0.5 %    99.5 %
## (Intercept) 7.389616 65.755003
## BodyWeight  1.004416  1.452526

E. Test if the model is significant at \(\alpha=0.05\). State the null and alternative hypotheses. Write down the F statistic and it’s degrees of freedom. report the p-value and your conclusion. (3 + 7 points)

Answer The hypothesis test has the form: \(H_0\): 𝛽1=0. \(H_A\): 𝛽1≠0. p-value = 2.2e-16 < 𝛼(=0.05). Reject \(H_0\).

summary(model)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight, data = sleep_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.81  -34.52  -27.16    0.67  339.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.57231   10.95089    3.34  0.00148 ** 
## BodyWeight   1.22847    0.08408   14.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.15 on 57 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7856 
## F-statistic: 213.5 on 1 and 57 DF,  p-value: < 2.2e-16

(Question 5): Leverage and Influence (30 points)

A. Calculate and plot the leverage score values from this regression model. Using the \(\frac{2}{n}\) cutoff as in your lecture, how many influential points do you have? (2 + 3 + 5 points)

Answer: There are 3 influential points.

plot(hatvalues(model), type = 'h') >2/nrow(model)

## logical(0)

sum(hatvalues(model) >2/nrow(model))

## [1] 0

B. How would you identify influential points using Cook’s Distance? Using the cutoff value \(1\), how many influential points do you have? (2 + 3 + 5 points)

Answer: 0

plot(cooks.distance(model), type = 'h')

sum(cooks.distance(model)>1)

## [1] 1

C. How would you interpret the highly influencial points? (10 points)

Answer: Leverage scores range between 0 and 1, i.e., 0≤ℎ𝑖≤1. A high leverage point is usually defined as one where: ℎ𝑖>2/n

(Question 6): Read about residual analysis! (20 points)

To evaluate whether the residuals from your regression model are normally distributed, we check the QQplot of the residuals. The QQ plot compares the theoretical quantiles of a normal distribution (i.e., the ideal quantile values) to the observed quantiles (i.e., the sample quantiles from the residuals).

To help you generate a QQ plot in \(\texttt{R}\), refer to this open resource: Residual Analysis Resource. Focus only on the code under 2. Normal Q–Q (quantile-quantile) Plot. You are also welcome to use other resources on Google or AI tools to find the answer.

A. Please replicate the process shown in the resource for your own regression model and produce the QQ plot of the residuals. (10 points)

Answer:

data("mtcars")
fit=lm(formula = mpg ~ wt, data = mtcars)
plot(fit, which=2, col=c("red"))

B. Based on the QQ plot you generated, do you believe that the residuals (fitted errors) from your model follow a normal distribution? Clearly write your justification. (10 points)

Answer:In our model, the Q-Q plot demonstrates a strong alignment with the line, although a few points at the top are slightly offset. This is likely not significant and indicates a reasonable fit.

Homework 3

Due Date: 10-08-2024

Azzi Parries. Please don’t try editing the date field below.

2024-10-07

Instructions

Let’s use the Sleep in Mammals data from Dr. Gordon K Smyth’s OzDASL Data and Story Library. Here is the link for the metadata. Read the data descriptions first.

(Question 1): Covariance and Correlation (26 points + 5 Extra Credit)

B. Draw a scatterplot between the two variables (i.e., body weight and brain weight). Based on the scatterplot, what type of relationship do you infer between these variables?

You can earn an extra 5 points for correctly labeling the axes as “Body Weight” and “Brain Weight” and adding a title “Scatterplot between Body Weight and Brain Weight” to the plot (Hint: use an AI tool to help with the code after you finish the basic scatterplot code). (5+2 points)

C. What is the covariance and correlation between the two variables? What kind of relationship do you guess here between the variables? (6 + 2 points)

D. What is the Spearman’s rank correlation coefficient between the two variables? (3 points)

E. Using the two variables above, can you demonstrate that the Spearman’s correlation between \(\mathbf{x}\) and \(\mathbf{y}\) is the same as the Spearman’s correlation between \(\mathbf{y}\) and \(\mathbf{x}\)? (3 points)

(Question 2): Regression Model (44 points)

A. Build the above simple linear regression model in \(\texttt{R}\) using body weight as the independent variable and brain weight as response/dependent variable. Display the model output and write down the estimated regression equation. (8 + 2 points)

B. What is the interpretation of your estimated \(\widehat{\beta}_0\)? (5 points)

C. What is the interpretation of your estimated \(\widehat{\beta}_1\)? (5 points)

D. Display the model summary. From the summary, find and interpret the coefficient of determination, \(R^2\). (3 + 3 + 5 points)

E. Verify that \(R^2\) is square of the corrleation coefficient between the two variables. (3 points)

F. Plot the regression line on top of the scatter plot of the data. Give the regression line a different color. (10 points)

(Question 3) Prediction and Residuals (20 points)

A. Use the model to predict the brain weight when the body weight is 50. (4 points)

B. What is the estimated error standard deviation? What is the estimated error variance? (3 + 3 points)

C. Plot the residuals from the model and plot them against the fitted values. Is there any pattern? (8 + 2 points)

(Question 4): Hypothesis testing (35 points)

A. Test whether the slope (\(\beta_1\)) of the regression line is significantly different from zero. Use a significance level of 0.05. State the null and alternative hypotheses. From the summary table, and report the p-value and your conclusion. (3 + 7 points)

B. In the above test, what is the value of the test statistic? Show that the test statistic is equal to \(\frac{\widehat{\beta}_1 \text{ - } 0}{\text{sd}(\widehat{\beta}_1)}\). (2 + 4 points)

C. Find the 95% confidence interval for the slope \(\beta_1\). From the confidence interval, how would you determine whether \(\beta_1 = 1.4\) at \(\alpha = 0.05\)? (3 + 3 points)

D. Find the 99% confidence interval for the slope \(\beta_1\). From the confidence interval, how would you determine whether \(\beta_1 = 1.4\) at \(\alpha = 0.01\)? (3 + 3 points)

E. Test if the model is significant at \(\alpha=0.05\). State the null and alternative hypotheses. Write down the F statistic and it’s degrees of freedom. report the p-value and your conclusion. (3 + 7 points)

(Question 5): Leverage and Influence (30 points)

A. Calculate and plot the leverage score values from this regression model. Using the \(\frac{2}{n}\) cutoff as in your lecture, how many influential points do you have? (2 + 3 + 5 points)

B. How would you identify influential points using Cook’s Distance? Using the cutoff value \(1\), how many influential points do you have? (2 + 3 + 5 points)

C. How would you interpret the highly influencial points? (10 points)

(Question 6): Read about residual analysis! (20 points)

To help you generate a QQ plot in \(\texttt{R}\), refer to this open resource: Residual Analysis Resource. Focus only on the code under 2. Normal Q–Q (quantile-quantile) Plot. You are also welcome to use other resources on Google or AI tools to find the answer.

A. Please replicate the process shown in the resource for your own regression model and produce the QQ plot of the residuals. (10 points)

B. Based on the QQ plot you generated, do you believe that the residuals (fitted errors) from your model follow a normal distribution? Clearly write your justification. (10 points)