Homework 2

Author

Darsh Wagh, Alex Bechakas, Kobe George


Important

Please read the instructions carefully before submitting your assignment.

  1. This assignment requires you to only upload a PDF file on Canvas
  2. Don’t collapse any code cells before submitting.
  3. Remember to make sure all your code output is rendered properly before uploading your submission.

⚠️ Please add your names to the author information in the frontmatter before submitting your assignment ⚠️

For this assignment, we will be using the following packages:

library(tidyverse)
library(car)
Warning: package 'car' was built under R version 4.3.3
Warning: package 'carData' was built under R version 4.3.3






Question 1

20 points

Simulation

Suppose you are given a sample of 100 observations on two variables, \(X,Y\). You want to estimate the relationship: \(Y=\beta_0+\beta_1 X+\epsilon\). Your main interest is in estimating the slope coefficient, \(\beta_1\).

A simple linear regression model on the 100 observations gives the ordinary least squares (OLS) estimator for \(\beta_1\) as \(\hat{\beta}_1\). It has been suggested that it might be better to split the sample into two groups (denoted Group A and Group B) of 50 observations, and estimate \(\beta_1\) using linear regression on each group. The new estimator for \(\beta_1\) is obtained as \(\tilde\beta_1=\frac{\hat\beta_1^A+\hat\beta_1^B}{2}\).

Which estimator will be better, the OLS estimator \(\hat\beta_1\), or the new estimator \(\tilde\beta_1=\frac{\hat\beta_1^A+\hat\beta_1^B}{2}\)? Make your choice by doing simulation with the following steps.

1.1 (5 points)

Generate 100 independent observations of \((X,Y)\) under the following setup:

\[ \begin{aligned} X&\sim\operatorname{Uniform}(0,80),\\ \epsilon&\sim\mathcal{N}(0,100^2),\\ Y&=180+40X+\epsilon. \end{aligned} \]

You can use runif() and rnorm() to generate random numbers from uniform distribution and normal distribution, respectively.

b_estimater<- function(){
x = runif(100,min = 0,max = 80)
normal = rnorm(100,mean = 0,sd = 100)
y = 180 + 40*x + normal

model<- lm(y~x)

b_estimate <- coef(model)[2]
return(b_estimate)
}
b_estimater()
       x 
40.48417 

1.2 (5 points)

Based on the dataset you generated in the previous question, find the OLS estimator \(\hat\beta_1\) and the proposed estimator \(\tilde\beta_1\).

btilde<- function(){
x = runif(100,min = 0,max = 80)
normal = rnorm(100,mean = 0,sd = 100)
y = 180 + 40*x + normal

model1<-lm(y[1:50]~x[1:50])
model2<- lm(y[51:100]~x[51:100])

b_A_Estimate = coef(model1)[2]

b_B_Estimate = coef(model2)[2]

btilde_estimate = (b_A_Estimate+b_B_Estimate)/2
return(btilde_estimate)
}

btilde()
 x[1:50] 
40.37868 
b_estimater()
       x 
40.71562 

1.3 (5 points)

Repeat this procedure 1000 times to obtain 1000 realizations of \(\hat\beta_1\) and \(\tilde\beta_1\). Each time, you will first generate a new sample of \((X,Y)\), and then calculate \(\hat\beta_1\) and \(\tilde\beta_1\). You may find replicate() useful.

#Created functions on previous parts to calculate it each time
bhatdata<- replicate(1000, b_estimater())
btildedata<- replicate(1000,btilde())

1.4 (5 points)

Visualize your 1000 realizations of \(\hat\beta_1\) and \(\tilde\beta_1\) in a single plot which compares the two estimators. What will be your answer to the question “Which estimator will be better, the OLS estimator \(\hat\beta_1\), or the new estimator \(\tilde\beta_1=\frac{\tilde\beta_1^A+\tilde\beta_1^B}{2}\)”?

library(ggplot2)
data<- data.frame(c(bhatdata,btildedata),rep(c("OSLEstimator", "NewEstimator"), each = 1000))

ggplot(data) +
  aes(
    x = c.bhatdata..btildedata.,
    fill = rep.c..OSLEstimator....NewEstimator....each...1000.
  ) +
  
  geom_density(adjust = 1L,alpha = .5) +
  scale_fill_hue(direction = 1) +
  labs(
    x = "Value",
    title = "OSL vs New estimator density Graph",
    fill = "Estimator"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 17L))

The new created estimator is better because of shape of the graph representing a more normal curve than the OSL estimator. The estimator should represent a normal curve, and the new estimator better represents a normal curve






Question 2

30 points

Regression with categorical covariate and \(t\)-Test

2.1 (5 points)

For this problem, we will be using the Wine Quality dataset from the UCI Machine Learning Repository. The dataset consists of red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.

Read the wine quality datasets from the specified URLs and store them in data frames df1 and df2.

url1 <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

url2 <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"


df1 <- read.csv(url1,sep = ";") 
df2 <- read.csv(url2,sep = ";")

2.2 (5 points)

Perform the following tasks to prepare the data frame df for analysis:

  1. Combine the two data frames into a single data frame df, adding a new column called type to indicate whether each row corresponds to white or red wine.
  2. Rename the columns of df to replace spaces and dots with underscores
  3. Remove the columns fixed_acidity and free_sulfur_dioxide
  4. Convert the type column to a factor
  5. Remove rows (if any) with missing values.
df1<- df1%>%mutate("Type" = factor("white"))
df2<- df2%>%mutate("Type" = factor("red"))
df <- rbind(df1,df2) %>%rename_with(~ gsub(".","_",colnames(df1),fixed = TRUE))%>%
  select(-c(fixed_acidity,free_sulfur_dioxide))
df<-na.omit(df)
dim(df)
[1] 6497   11

Your output to dim(df) should be

[1] 6497   11

2.3 (5 points)

Recall from STAT 200, the method to compute the \(t\) statistic for the difference in means (with the equal variance assumption)

  1. Using df compute the mean of quality for red and white wine separately, and then store the difference in means as a variable called diff_mean.

  2. Compute the pooled sample variance and store the value as a variable called sp_squared.

  3. Using sp_squared and diff_mean, compute the \(t\) Statistic, and store its value in a variable called t1.

diff_mean <- mean(df1$quality)-mean(df2$quality) # Insert your code here
sp_squared <-  var(df$quality)
t1 <- diff_mean/((sp_squared)*(1/nrow(df1)+1/nrow(df2)))^.5
t1
[1] 9.61719

2.4 (5 points)

Equivalently, R has a function called t.test() which enables you to perform a two-sample \(t\)-Test without having to compute the pooled variance and difference in means.

Perform a two-sample t-test to compare the quality of white and red wines using the t.test() function with the setting var.equal=TRUE. Store the t-statistic in t2.

t_test <- t.test(quality ~ Type, data = df,var.equal = TRUE)
t2 <-  9.6856 #Found t stat
t2
[1] 9.6856

2.5 (5 points)

Fit a linear regression model to predict quality from type using the lm() function, and extract the \(t\)-statistic for the type coefficient from the model summary. Store this \(t\)-statistic in t3.

fit <- lm(quality~Type, data=df) # Insert your here
summary(fit)

Call:
lm(formula = quality ~ Type, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8779 -0.8779  0.1221  0.3640  3.1221 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.87791    0.01239 474.429   <2e-16 ***
Typered     -0.24189    0.02497  -9.686   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8671 on 6495 degrees of freedom
Multiple R-squared:  0.01424,   Adjusted R-squared:  0.01409 
F-statistic: 93.81 on 1 and 6495 DF,  p-value: < 2.2e-16
t3 <- 9.686
fit

Call:
lm(formula = quality ~ Type, data = df)

Coefficients:
(Intercept)      Typered  
     5.8779      -0.2419  

2.6 (5 points)

Print a vector containing the values of t1, t2, and t3. What can you conclude from this? Why?

print(c(t1,t2,t3)) # Insert your code here
[1] 9.61719 9.68560 9.68600










Question 3

20 points

Collinearity

We will continue working on the wine quality dataset. Specifically, the df data frame you created in Question 2.2.


3.1 (5 points)

Fit a linear regression model with all predictors against the response variable quality. Use the broom::tidy() function to print a summary of the fitted model. Round all numbers to three decimal places. What can we conclude from the model summary?

library(dplyr)
library(broom)


model <- lm(quality ~ ., data = df)

model_summary <- tidy(model) %>%
  mutate(across(where(is.numeric), round, 3))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, 3)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
print(model_summary)
# A tibble: 11 × 5
   term                 estimate std.error statistic p.value
   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)            57.1       9.30      6.15    0    
 2 volatile_acidity       -1.61      0.081   -20.0     0    
 3 citric_acid             0.027     0.078     0.347   0.728
 4 residual_sugar          0.045     0.004    10.8     0    
 5 chlorides              -0.964     0.333    -2.90    0.004
 6 total_sulfur_dioxide    0         0        -1.25    0.21 
 7 density               -55.2       9.32     -5.92    0    
 8 pH                      0.188     0.066     2.85    0.004
 9 sulphates               0.662     0.076     8.73    0    
10 alcohol                 0.277     0.014    19.5     0    
11 Typered                 0.386     0.055     7.02    0    

3.2 (10 points)

Fit two simple linear regression models using lm(): one with only citric_acid as the predictor, and another with only total_sulfur_dioxide as the predictor. In both models, use quality as the response variable. How does your model summary compare to the summary from the previous question?

model_citric <- lm(quality ~ citric_acid, data = df)

summary_citric <- tidy(model_citric) %>%
  mutate(across(where(is.numeric), round, 3))

print(summary_citric)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    5.66      0.026    217.         0
2 citric_acid    0.514     0.074      6.92       0
model_sulfur <- lm(quality ~ total_sulfur_dioxide, data = df)

summary_sulfur <- tidy(model_sulfur) %>%
  mutate(across(where(is.numeric), round, 3))

print(summary_sulfur)
# A tibble: 2 × 5
  term                 estimate std.error statistic p.value
  <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)             5.89      0.025    239.     0    
2 total_sulfur_dioxide   -0.001     0         -3.34   0.001

3.3 (5 points)

Visualize the correlation matrix of all numeric columns in df using corrplot() from ggcorrplot package.

library(ggcorrplot)
Warning: package 'ggcorrplot' was built under R version 4.3.3
cor_matrix <- df %>% 
  select(where(is.numeric)) %>% 
  cor()

ggcorrplot(cor_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           title = "Correlation Matrix of Numeric Variables",
           ggtheme = theme_minimal())


3.4 (5 points)

Compute the variance inflation factor (VIF) for each predictor in the full model using vif() function. What can we conclude from this?

library(car)

full_model <- lm(quality ~ ., data = df)

vif_values <- vif(full_model)

print(round(vif_values, 3))
    volatile_acidity          citric_acid       residual_sugar 
               2.104                1.549                4.680 
           chlorides total_sulfur_dioxide              density 
               1.625                2.629                9.339 
                  pH            sulphates              alcohol 
               1.352                1.523                3.420 
                Type 
               6.695 










Question 4

30 points

Variable selection, modeling interactions, variance-bias tradeoff

We will continue working on the wine quality dataset. Specifically, the df data frame you created in Question 2.2.


4.1 (5 points)

Excluding quality from df we have \(10\) possible predictors as the covariates. How many different interactions between two predictors would be possible?

There are 45 different two way interactions between 10 predictors


4.2 (5 points)

Fit a linear regression model which includes all possible predictors and their two-way interactions, and store the model object as full_model. You can use the following function to create the formula you need as the first argument of lm().

library(tidyverse)


full_model_formula <- function(df) {
  predictors <- setdiff(colnames(df), 'quality')  
  

  interactions <- apply(combn(predictors, 2), 2, function(x) paste(x, collapse = ":"))
  
  
  formula_str <- paste("quality ~", paste(c(predictors, interactions), collapse = " + "))
  
  return(as.formula(formula_str)) 
}


full_model <- lm(full_model_formula(df), data = df)


summary(full_model) 

Call:
lm(formula = full_model_formula(df), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5964 -0.4637 -0.0182  0.4443  3.2444 

Coefficients:
                                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                           -1.621e+01  2.094e+02  -0.077 0.938325
volatile_acidity                       7.179e+00  6.275e+01   0.114 0.908920
citric_acid                            2.354e+01  6.225e+01   0.378 0.705298
residual_sugar                        -1.033e+00  6.276e-01  -1.646 0.099849
chlorides                              4.470e+01  3.668e+02   0.122 0.903013
total_sulfur_dioxide                  -3.906e-01  2.552e-01  -1.530 0.125960
density                                2.176e+01  2.092e+02   0.104 0.917183
pH                                    -1.359e+01  6.091e+01  -0.223 0.823520
sulphates                             -2.531e+01  7.484e+01  -0.338 0.735211
alcohol                                2.283e+01  5.404e+00   4.225 2.42e-05
Typered                               -1.544e+02  3.842e+01  -4.019 5.92e-05
volatile_acidity:citric_acid           9.006e-01  5.519e-01   1.632 0.102798
volatile_acidity:residual_sugar        7.869e-03  3.343e-02   0.235 0.813904
volatile_acidity:chlorides             2.482e+00  2.487e+00   0.998 0.318305
volatile_acidity:total_sulfur_dioxide  6.039e-03  2.181e-03   2.769 0.005637
volatile_acidity:density              -1.796e+01  6.275e+01  -0.286 0.774731
volatile_acidity:pH                    1.173e+00  5.703e-01   2.056 0.039829
volatile_acidity:sulphates            -1.029e+00  6.310e-01  -1.630 0.103120
volatile_acidity:alcohol               4.015e-01  1.048e-01   3.831 0.000129
volatile_acidity:Typered               1.597e+00  3.663e-01   4.360 1.32e-05
citric_acid:residual_sugar            -6.939e-06  2.910e-02   0.000 0.999810
citric_acid:chlorides                  2.719e+00  2.276e+00   1.194 0.232372
citric_acid:total_sulfur_dioxide      -2.427e-03  2.020e-03  -1.201 0.229704
citric_acid:density                   -2.962e+01  6.214e+01  -0.477 0.633654
citric_acid:pH                         1.555e+00  4.977e-01   3.124 0.001791
citric_acid:sulphates                 -2.837e-01  6.709e-01  -0.423 0.672363
citric_acid:alcohol                    1.080e-01  9.240e-02   1.169 0.242458
citric_acid:Typered                   -9.269e-01  4.026e-01  -2.302 0.021348
residual_sugar:chlorides              -2.166e-01  2.043e-01  -1.060 0.289140
residual_sugar:total_sulfur_dioxide   -2.629e-04  1.123e-04  -2.341 0.019237
residual_sugar:density                 1.303e+00  6.103e-01   2.135 0.032790
residual_sugar:pH                     -7.219e-02  2.608e-02  -2.768 0.005649
residual_sugar:sulphates               1.885e-03  3.522e-02   0.054 0.957328
residual_sugar:alcohol                 6.951e-03  3.589e-03   1.937 0.052782
residual_sugar:Typered                -6.045e-02  2.226e-02  -2.716 0.006635
chlorides:total_sulfur_dioxide        -4.587e-03  1.258e-02  -0.365 0.715493
chlorides:density                     -1.372e+01  3.672e+02  -0.037 0.970184
chlorides:pH                          -6.311e+00  3.392e+00  -1.861 0.062802
chlorides:sulphates                   -4.559e+00  1.728e+00  -2.639 0.008337
chlorides:alcohol                     -1.013e+00  5.779e-01  -1.753 0.079717
chlorides:Typered                      1.744e-01  2.410e+00   0.072 0.942302
total_sulfur_dioxide:density           4.110e-01  2.561e-01   1.605 0.108558
total_sulfur_dioxide:pH               -6.871e-03  1.890e-03  -3.636 0.000280
total_sulfur_dioxide:sulphates        -1.271e-02  2.018e-03  -6.300 3.16e-10
total_sulfur_dioxide:alcohol           1.152e-03  4.025e-04   2.861 0.004232
total_sulfur_dioxide:Typered          -4.231e-03  1.132e-03  -3.738 0.000187
density:pH                             1.373e+01  6.078e+01   0.226 0.821320
density:sulphates                      2.247e+01  7.495e+01   0.300 0.764360
density:alcohol                       -2.316e+01  5.467e+00  -4.236 2.31e-05
density:Typered                        1.612e+02  3.836e+01   4.203 2.67e-05
pH:sulphates                           1.886e+00  4.941e-01   3.817 0.000136
pH:alcohol                             2.551e-02  9.754e-02   0.262 0.793698
pH:Typered                            -2.396e+00  3.840e-01  -6.240 4.65e-10
sulphates:alcohol                     -1.844e-02  1.115e-01  -0.165 0.868623
sulphates:Typered                     -2.344e-01  4.170e-01  -0.562 0.574134
alcohol:Typered                        2.404e-01  6.852e-02   3.509 0.000453
                                         
(Intercept)                              
volatile_acidity                         
citric_acid                              
residual_sugar                        .  
chlorides                                
total_sulfur_dioxide                     
density                                  
pH                                       
sulphates                                
alcohol                               ***
Typered                               ***
volatile_acidity:citric_acid             
volatile_acidity:residual_sugar          
volatile_acidity:chlorides               
volatile_acidity:total_sulfur_dioxide ** 
volatile_acidity:density                 
volatile_acidity:pH                   *  
volatile_acidity:sulphates               
volatile_acidity:alcohol              ***
volatile_acidity:Typered              ***
citric_acid:residual_sugar               
citric_acid:chlorides                    
citric_acid:total_sulfur_dioxide         
citric_acid:density                      
citric_acid:pH                        ** 
citric_acid:sulphates                    
citric_acid:alcohol                      
citric_acid:Typered                   *  
residual_sugar:chlorides                 
residual_sugar:total_sulfur_dioxide   *  
residual_sugar:density                *  
residual_sugar:pH                     ** 
residual_sugar:sulphates                 
residual_sugar:alcohol                .  
residual_sugar:Typered                ** 
chlorides:total_sulfur_dioxide           
chlorides:density                        
chlorides:pH                          .  
chlorides:sulphates                   ** 
chlorides:alcohol                     .  
chlorides:Typered                        
total_sulfur_dioxide:density             
total_sulfur_dioxide:pH               ***
total_sulfur_dioxide:sulphates        ***
total_sulfur_dioxide:alcohol          ** 
total_sulfur_dioxide:Typered          ***
density:pH                               
density:sulphates                        
density:alcohol                       ***
density:Typered                       ***
pH:sulphates                          ***
pH:alcohol                               
pH:Typered                            ***
sulphates:alcohol                        
sulphates:Typered                        
alcohol:Typered                       ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7158 on 6441 degrees of freedom
Multiple R-squared:  0.3337,    Adjusted R-squared:  0.328 
F-statistic: 58.66 on 55 and 6441 DF,  p-value: < 2.2e-16

4.3 (5 points)

There are many insignificant predictors and interaction terms in the full model. Choose a reduced model by performing a stepwise regression in both directions (at each iteration, a predictor can be either added or removed). You can use step() and set trace=0 to avoid printing the very lengthy information during the running of step. Store the new model in an object named reduced_model. Use the broom::tidy() function to print a summary of the reduced model. Round all numbers to three decimal places.

reduced_model <- step(full_model, trace = 0)


library(broom)
tidy_reduced_model <- broom::tidy(reduced_model)


tidy_reduced_model_rounded <- tidy_reduced_model %>%
  mutate(across(where(is.numeric), round, 3))


print(tidy_reduced_model_rounded)
# A tibble: 40 × 5
   term                 estimate std.error statistic p.value
   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)           -62.3      67.6      -0.922   0.357
 2 volatile_acidity      -10.5       1.86     -5.63    0    
 3 citric_acid            -5.85      1.49     -3.92    0    
 4 residual_sugar         -0.997     0.317    -3.14    0.002
 5 chlorides              33.6       8.31      4.04    0    
 6 total_sulfur_dioxide   -0.349     0.245    -1.43    0.154
 7 density                67.6      68.2       0.991   0.322
 8 pH                      0.27      0.362     0.747   0.455
 9 sulphates              -3.96      1.50     -2.63    0.009
10 alcohol                23.0       5.14      4.47    0    
# ℹ 30 more rows

4.4 (5 points)

Compare the \(R^2\) and adjusted \(R^2\) of the full model and the reduced model.

full_model_summary <- summary(full_model)
reduced_model_summary <- summary(reduced_model)


full_model_r2 <- full_model_summary$r.squared
full_model_adj_r2 <- full_model_summary$adj.r.squared


reduced_model_r2 <- reduced_model_summary$r.squared
reduced_model_adj_r2 <- reduced_model_summary$adj.r.squared


cat("Full Model: R² =", round(full_model_r2, 3), ", Adjusted R² =", round(full_model_adj_r2, 3), "\n")
Full Model: R² = 0.334 , Adjusted R² = 0.328 
cat("Reduced Model: R² =", round(reduced_model_r2, 3), ", Adjusted R² =", round(reduced_model_adj_r2, 3), "\n")
Reduced Model: R² = 0.333 , Adjusted R² = 0.329 

4.6 (5 points)

Since the full model includes more predictors than the reduced model, do you expect the full model to achieve a larger or smaller variance than the reduced model, when it comes to prediction? What about bias?