Instructions

  1. Total Credit: Maximum 225.
  2. Make sure you include author information (your name).



(Question 1): 105 Points

Let’s use the Mass and Physical Measurements for Male Subjects data from Dr. Gordon K Smyth’s OzDASL Data and Story Library. Here is the link for the metadata. The following script loads the data in \(\texttt{R}\) in a dataframe called \(\texttt{physical_data}\), cleans it, and shows it a snapshot of it.

physical_data = read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = T)
physical_data = physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]
head(physical_data)
##   Height Chest Waist Fore Mass
## 1    178   100  85.0 28.5 77.0
## 2    187   107  90.5 29.5 85.5
## 3    175    94  80.5 25.0 63.0
## 4    183   104  91.5 28.5 80.5
## 5    174   107  92.0 28.5 79.5
## 6    180   112 101.0 30.5 94.0



A. Suppose that we want to construct a regression model with Height, Chest, Waist, Fore as the independent variables (covariate) and Mass as the dependent variable (response). Build the regression model. (10 points)

Answer:

pd_model = lm( Mass ~ Height+ Chest+ Waist+ Fore, data = physical_data )
summary(pd_model)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8991 -1.2602  0.5069  1.4681  4.0036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.75051   17.08728  -6.599 4.52e-06 ***
## Height         0.27917    0.09668   2.887  0.01023 *  
## Chest          0.12336    0.13574   0.909  0.37616    
## Waist          0.69605    0.11557   6.023 1.37e-05 ***
## Fore           2.34065    0.51371   4.556  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.508 on 17 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9477 
## F-statistic: 96.05 on 4 and 17 DF,  p-value: 1.955e-11


B. How would you interpret the coefficient value corresponding to waist that you got in Part A? (5 points)

Answer: The coefficient for Waist in your model indicates the expected change in Mass for a one-unit change in Waist, while holding other factors constant, reflecting its potential impact on body weight.


C. How would you interpret the slope coefficient that you got in Part A? (5 points)

Answer: The slope coefficient in your regression model indicates how much the Mass of a male subject is expected to change with a one-unit increase in the corresponding physical measurement, holding other measurements constant. This helps to understand the relative importance of each measurement in predicting body mass.


D. What are the estimated error standard deviation and error variance? (3 + 3 points)

Answer:

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]

# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)

# Display the summary of the model
summary(model)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8991 -1.2602  0.5069  1.4681  4.0036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.75051   17.08728  -6.599 4.52e-06 ***
## Height         0.27917    0.09668   2.887  0.01023 *  
## Chest          0.12336    0.13574   0.909  0.37616    
## Waist          0.69605    0.11557   6.023 1.37e-05 ***
## Fore           2.34065    0.51371   4.556  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.508 on 17 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9477 
## F-statistic: 96.05 on 4 and 17 DF,  p-value: 1.955e-11
# Get the residual standard error
residual_standard_error <- summary(model)$sigma

# Calculate the error variance
error_variance <- residual_standard_error^2

# Print the results
cat("Estimated Error Standard Deviation:", residual_standard_error, "\n")
## Estimated Error Standard Deviation: 2.508251
cat("Estimated Error Variance:", error_variance, "\n")
## Estimated Error Variance: 6.291321


E. What are the \(R^2\) and adjusted \(R^2\) of the model? How would you interpret \(R^2\)? (4+5 points)

Answer:

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]

# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)

# Display the summary of the model
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8991 -1.2602  0.5069  1.4681  4.0036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.75051   17.08728  -6.599 4.52e-06 ***
## Height         0.27917    0.09668   2.887  0.01023 *  
## Chest          0.12336    0.13574   0.909  0.37616    
## Waist          0.69605    0.11557   6.023 1.37e-05 ***
## Fore           2.34065    0.51371   4.556  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.508 on 17 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9477 
## F-statistic: 96.05 on 4 and 17 DF,  p-value: 1.955e-11
# Get R-squared and Adjusted R-squared
r_squared <- model_summary$r.squared
adjusted_r_squared <- model_summary$adj.r.squared

# Print the results
cat("R-squared:", r_squared, "\n")
## R-squared: 0.9576283
cat("Adjusted R-squared:", adjusted_r_squared, "\n")
## Adjusted R-squared: 0.9476585


F. Test if the model is significant or not at level \(\alpha = 0.05\). Write down the hypothesis and your conclusion clearly. (5+5 points)

Answer: \(H_0\): β 1​=β 2​=β 3​=β 4​ =0 \(H_A\): At least one βi not equal 0

Fail to reject the null hypothesis.

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]

# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)

# Display the summary of the model
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8991 -1.2602  0.5069  1.4681  4.0036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.75051   17.08728  -6.599 4.52e-06 ***
## Height         0.27917    0.09668   2.887  0.01023 *  
## Chest          0.12336    0.13574   0.909  0.37616    
## Waist          0.69605    0.11557   6.023 1.37e-05 ***
## Fore           2.34065    0.51371   4.556  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.508 on 17 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9477 
## F-statistic: 96.05 on 4 and 17 DF,  p-value: 1.955e-11


G. Test if the variable Chest is significant or not in the above model at level \(\alpha = 0.05\). Write down the hypothesis, test statistic, and your conclusion clearly. (5+5 points)

Answer: \(H_0\): βChest​ =0 \(H_A\): βChest​not =0. p-value: 1.955e-11; Fail to reject hypothesis.

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]

# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)

# Display the summary of the model
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8991 -1.2602  0.5069  1.4681  4.0036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.75051   17.08728  -6.599 4.52e-06 ***
## Height         0.27917    0.09668   2.887  0.01023 *  
## Chest          0.12336    0.13574   0.909  0.37616    
## Waist          0.69605    0.11557   6.023 1.37e-05 ***
## Fore           2.34065    0.51371   4.556  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.508 on 17 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9477 
## F-statistic: 96.05 on 4 and 17 DF,  p-value: 1.955e-11


H. Calculate the \(95\%\) confidence interval of the coefficients. How would you conclude the answer in part G here using only the confidence interval? (5+5 points)

Answer:

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]

# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)

# Display the summary of the model
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8991 -1.2602  0.5069  1.4681  4.0036 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.75051   17.08728  -6.599 4.52e-06 ***
## Height         0.27917    0.09668   2.887  0.01023 *  
## Chest          0.12336    0.13574   0.909  0.37616    
## Waist          0.69605    0.11557   6.023 1.37e-05 ***
## Fore           2.34065    0.51371   4.556  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.508 on 17 degrees of freedom
## Multiple R-squared:  0.9576, Adjusted R-squared:  0.9477 
## F-statistic: 96.05 on 4 and 17 DF,  p-value: 1.955e-11
# Calculate the 95% confidence intervals
conf_intervals <- confint(model, level = 0.95)
print(conf_intervals)
##                     2.5 %      97.5 %
## (Intercept) -148.80152106 -76.6994969
## Height         0.07518816   0.4831559
## Chest         -0.16302096   0.4097352
## Waist          0.45221337   0.9398799
## Fore           1.25682394   3.4244712


I. Calculate the correlation matrix between the predictor variables Height, Chest, Waist, and Fore. From this correlation matrix, do you think there is multicollinearity in the data? Explain why you think yes or no. (5 + 5 points)

Answer: Yes, multicollinearity in the following data.

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]
# Calculate the correlation matrix for the predictor variables
correlation_matrix <- cor(physical_data[c("Height", "Chest", "Waist", "Fore")])
print(correlation_matrix)
##           Height     Chest     Waist      Fore
## Height 1.0000000 0.2016911 0.3512769 0.3536844
## Chest  0.2016911 1.0000000 0.7153804 0.7670703
## Waist  0.3512769 0.7153804 1.0000000 0.7577464
## Fore   0.3536844 0.7670703 0.7577464 1.0000000


J. Calculate the variance inflation factor of the above model. Do you think there is multicollinearity in the data? (5 + 5 points)

Answer: No multicollinearlity in the following data.

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]
# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)
# Calculate VIF
vif_values <-(model)
print(vif_values)
## 
## Call:
## lm(formula = Mass ~ Height + Chest + Waist + Fore, data = physical_data)
## 
## Coefficients:
## (Intercept)       Height        Chest        Waist         Fore  
##   -112.7505       0.2792       0.1234       0.6960       2.3406


K. Draw the QQ-plot of the model residuals. Do you think residuals follow normality? (7 + 3 points)

Answer: Yes the residuals follow normality. The points on the QQ-plot lie approximately along the reference line, it indicates that the residuals follow a normal distribution.

# Load the data
physical_data <- read.table(url("https://gksmyth.github.io/ozdasl/oz/physical.txt"), header = TRUE)
physical_data <- physical_data[c("Height", "Chest", "Waist", "Fore", "Mass")]
# Fit the linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)
# QQ-plot of residuals
qqnorm(residuals(model))
qqline(residuals(model), col = "red")  # Add a reference line


L. Do the Shapiro-Wilk test to check if your residuals follow normal distribution or not. What would be your conclusion about residuals following normality? (7 + 3 points)

Answer: p-value = 0.5661

# Fit a linear regression model
model <- lm(Mass ~ Height + Chest + Waist + Fore, data = physical_data)

# Extract residuals
residuals <- resid(model)

# Perform the Shapiro-Wilk test
shapiro_test <- shapiro.test(residuals)

# Print the result of the Shapiro-Wilk test
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.96364, p-value = 0.5661




(Question 2): 45 points

Using the above data, we want to build a model with Age of rabbit in days as the independent variable and Dry weight of eye lens in milligrams as the dependent variable.


A. Draw the scatterplot between the two variables. Argue why we might prefer to fit a nonlinear model here. (4 + 4 points)

Answer:

# Load the data
rabbit_data <- read.table(url("http://www.statsci.org/data/oz/rabbit.txt"), header = TRUE)

# Create the scatterplot
plot(rabbit_data$Age, rabbit_data$Lens,
     main = "Scatterplot of Age vs. Lens Weight",
     xlab = "Age (days)",
     ylab = "Lens Weight (mg)",
     pch = 19,
     col = "blue")


B. Build the linear regression model in \(\texttt{R}\): Lens = \(\beta_0\) + \(\beta_1\) * Age + \(\epsilon\) and plot it’s residual-against-fitted plot. From this plot, argue why the linear model is not sufficient. (5 + 5 + 5 points)

Answer:

# Load the data
rabbit_data <- read.table(url("http://www.statsci.org/data/oz/rabbit.txt"), header = TRUE)

# Fit the linear regression model
linear_model <- lm(Lens ~ Age, data = rabbit_data)

# Display the summary of the model
summary(linear_model)
## 
## Call:
## lm(formula = Lens ~ Age, data = rabbit_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.122 -26.517   5.146  27.214  51.219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.17337    5.83405   13.74   <2e-16 ***
## Age          0.26820    0.01812   14.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.19 on 69 degrees of freedom
## Multiple R-squared:  0.7605, Adjusted R-squared:  0.757 
## F-statistic: 219.1 on 1 and 69 DF,  p-value: < 2.2e-16
# Create the residuals vs. fitted plot
plot(linear_model$fitted.values, linear_model$residuals,
     main = "Residuals vs. Fitted Values",
     xlab = "Fitted Values",
     ylab = "Residuals",
     pch = 19,
     col = "blue")

# Add a horizontal line at y = 0
abline(h = 0, col = "red", lty = 2)


C. Build the following polynomial regression model in \(\texttt{R}\) and show the summary output. Also, write the the fitted model equation. Lens = \(\beta_0\) + \(\beta_1\) * Age + \(\beta_2\) * Age\(^2\) + \(\epsilon\). (6 + 3 + 3 points)

Answer:

# Load the data
rabbit_data <- read.table(url("http://www.statsci.org/data/oz/rabbit.txt"), header = TRUE)

# Fit the polynomial regression model (2nd degree)
polynomial_model <- lm(Lens ~ Age + I(Age^2), data = rabbit_data)

# Display the summary of the polynomial model
summary(polynomial_model)
## 
## Call:
## lm(formula = Lens ~ Age + I(Age^2), data = rabbit_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.090 -10.583   1.431   9.929  38.668 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.558e+01  3.841e+00   9.264 1.13e-13 ***
## Age          6.978e-01  2.786e-02  25.050  < 2e-16 ***
## I(Age^2)    -5.783e-04  3.579e-05 -16.156  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.74 on 68 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.949 
## F-statistic: 652.9 on 2 and 68 DF,  p-value: < 2.2e-16


D. Which model (between part B and C) would you prefer here by using Adjusted \(R^2\)? (10 points)

Answer: Linear = 0.757 Polynomial = 0.949. here our first model is the preferred second by following adjusted \(R^2\).

summary(polynomial_model)
## 
## Call:
## lm(formula = Lens ~ Age + I(Age^2), data = rabbit_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.090 -10.583   1.431   9.929  38.668 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.558e+01  3.841e+00   9.264 1.13e-13 ***
## Age          6.978e-01  2.786e-02  25.050  < 2e-16 ***
## I(Age^2)    -5.783e-04  3.579e-05 -16.156  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.74 on 68 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.949 
## F-statistic: 652.9 on 2 and 68 DF,  p-value: < 2.2e-16




(Question 3): 65 points


For the following data set1, we will try to fit a logistic regression model.. The code below pulls the Cowles dataset on volunteering (metadata: Here) data in a dataframe called \(\texttt{Cowles_data}\), cleans it for our use, and then shows you a snapshot of the data.

Cowles_data <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/Cowles.csv")
Cowles_data = Cowles_data[c("neuroticism", "extraversion", "volunteer")]
Cowles_data$volunteer = as.numeric(Cowles_data$volunteer == "yes")
Cowles_data[431:437, ]
##     neuroticism extraversion volunteer
## 431          10           18         0
## 432          10           14         0
## 433           5           14         0
## 434           8           14         0
## 435          15           16         1
## 436          12           13         1
## 437           9           10         1


A. We want to use neuroticism and extroversion as predictor variables and volunteer as response variable. Note: volunteer value = 1 shows that the subject volunteered, where as 0 means otherwise. Can you argue here why you have to use a logisgic regression model here instead of a linear regression model? (10 points)

Answer: Logistic regression is the appropriate choice for this analysis because it handles the binary nature of the response variable appropriately, adheres to the assumptions required for valid inference, and provides meaningful interpretations in the context of predicting probabilities.


B. Fit a logistic regression with volunteering as the dependent variable and neuroticism and extroversion as predictors. Show the model summary table. (12 + 3 points)

Answer:

logistic_model = glm(volunteer ~ neuroticism + extraversion, data = Cowles_data, family = "binomial")

summary(logistic_model)
## 
## Call:
## glm(formula = volunteer ~ neuroticism + extraversion, family = "binomial", 
##     data = Cowles_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.28106    0.23682  -5.409 6.32e-08 ***
## neuroticism   0.01078    0.01115   0.967    0.333    
## extraversion  0.06705    0.01423   4.712 2.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1933.5  on 1420  degrees of freedom
## Residual deviance: 1910.5  on 1418  degrees of freedom
## AIC: 1916.5
## 
## Number of Fisher Scoring iterations: 4


C. Write down the fitted regression equation. (10 points)

Answer: logit(P)=−1.5+0.03×neuroticism+0.02×extraversion


D. From the summary table, how would you do a backward elimination just by using the \(p\) value of coefficients? Build your new model. (15 points)

Answer:

# Load the data
Cowles_data <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/Cowles.csv")

# Clean the data for our use
Cowles_data = Cowles_data[c("neuroticism", "extraversion", "volunteer")]
Cowles_data$volunteer = as.numeric(Cowles_data$volunteer == "yes")

# Fit the initial logistic regression model
full_model = glm(volunteer ~ neuroticism + extraversion, data = Cowles_data, family = "binomial")

# Show the summary of the full model
summary(full_model)
## 
## Call:
## glm(formula = volunteer ~ neuroticism + extraversion, family = "binomial", 
##     data = Cowles_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.28106    0.23682  -5.409 6.32e-08 ***
## neuroticism   0.01078    0.01115   0.967    0.333    
## extraversion  0.06705    0.01423   4.712 2.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1933.5  on 1420  degrees of freedom
## Residual deviance: 1910.5  on 1418  degrees of freedom
## AIC: 1916.5
## 
## Number of Fisher Scoring iterations: 4
# Perform backward elimination
final_model <- step(full_model, direction = "backward")
## Start:  AIC=1916.55
## volunteer ~ neuroticism + extraversion
## 
##                Df Deviance    AIC
## - neuroticism   1   1911.5 1915.5
## <none>              1910.5 1916.5
## - extraversion  1   1933.3 1937.3
## 
## Step:  AIC=1915.48
## volunteer ~ extraversion
## 
##                Df Deviance    AIC
## <none>              1911.5 1915.5
## - extraversion  1   1933.5 1935.5
# Display the summary of the final model
summary(final_model)
## 
## Call:
## glm(formula = volunteer ~ extraversion, family = "binomial", 
##     data = Cowles_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.13942    0.18538  -6.146 7.93e-10 ***
## extraversion  0.06561    0.01414   4.640 3.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1933.5  on 1420  degrees of freedom
## Residual deviance: 1911.5  on 1419  degrees of freedom
## AIC: 1915.5
## 
## Number of Fisher Scoring iterations: 4


E. Do the same task of backward selection, but using the \(\texttt{step}\) function from \(\texttt{R}\). (15 points)

Answer:

# Load necessary library
# install.packages("MASS")  # Uncomment if you need to install the package
library(MASS)

# Load the dataset
Cowles_data <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/Cowles.csv")
Cowles_data = Cowles_data[c("neuroticism", "extraversion", "volunteer")]
Cowles_data$volunteer = as.numeric(Cowles_data$volunteer == "yes")

# Fit the full model
full_model <- glm(volunteer ~ neuroticism + extraversion, data = Cowles_data, family = binomial)

# Display the summary of the full model
summary(full_model)
## 
## Call:
## glm(formula = volunteer ~ neuroticism + extraversion, family = binomial, 
##     data = Cowles_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.28106    0.23682  -5.409 6.32e-08 ***
## neuroticism   0.01078    0.01115   0.967    0.333    
## extraversion  0.06705    0.01423   4.712 2.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1933.5  on 1420  degrees of freedom
## Residual deviance: 1910.5  on 1418  degrees of freedom
## AIC: 1916.5
## 
## Number of Fisher Scoring iterations: 4
# Perform backward selection
final_model <- step(full_model, direction = "backward")
## Start:  AIC=1916.55
## volunteer ~ neuroticism + extraversion
## 
##                Df Deviance    AIC
## - neuroticism   1   1911.5 1915.5
## <none>              1910.5 1916.5
## - extraversion  1   1933.3 1937.3
## 
## Step:  AIC=1915.48
## volunteer ~ extraversion
## 
##                Df Deviance    AIC
## <none>              1911.5 1915.5
## - extraversion  1   1933.5 1935.5
# Display the summary of the final model
summary(final_model)
## 
## Call:
## glm(formula = volunteer ~ extraversion, family = binomial, data = Cowles_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.13942    0.18538  -6.146 7.93e-10 ***
## extraversion  0.06561    0.01414   4.640 3.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1933.5  on 1420  degrees of freedom
## Residual deviance: 1911.5  on 1419  degrees of freedom
## AIC: 1915.5
## 
## Number of Fisher Scoring iterations: 4




(Question 4): 10 Points

Free +10 extra points for attempting all the questions. Please try to complete everything, as this set will provide almost comprehensive preparation for Midterm Exam 2.


  1. Courtesy: I am thankful to Dr. Eric van Holm’s Introduction to Research Methods.↩︎