library(tidyverse)
library(car)Warning: package 'car' was built under R version 4.3.3
Warning: package 'carData' was built under R version 4.3.3
Please read the instructions carefully before submitting your assignment.
PDF file on Canvas⚠️ Please add your names to the author information in the frontmatter before submitting your assignment ⚠️
For this assignment, we will be using the following packages:
library(tidyverse)
library(car)Warning: package 'car' was built under R version 4.3.3
Warning: package 'carData' was built under R version 4.3.3
Simulation
Suppose you are given a sample of 100 observations on two variables, \(X,Y\). You want to estimate the relationship: \(Y=\beta_0+\beta_1 X+\epsilon\). Your main interest is in estimating the slope coefficient, \(\beta_1\).
A simple linear regression model on the 100 observations gives the ordinary least squares (OLS) estimator for \(\beta_1\) as \(\hat{\beta}_1\). It has been suggested that it might be better to split the sample into two groups (denoted Group A and Group B) of 50 observations, and estimate \(\beta_1\) using linear regression on each group. The new estimator for \(\beta_1\) is obtained as \(\tilde\beta_1=\frac{\hat\beta_1^A+\hat\beta_1^B}{2}\).
Which estimator will be better, the OLS estimator \(\hat\beta_1\), or the new estimator \(\tilde\beta_1=\frac{\hat\beta_1^A+\hat\beta_1^B}{2}\)? Make your choice by doing simulation with the following steps.
Generate 100 independent observations of \((X,Y)\) under the following setup:
\[ \begin{aligned} X&\sim\operatorname{Uniform}(0,80),\\ \epsilon&\sim\mathcal{N}(0,100^2),\\ Y&=180+40X+\epsilon. \end{aligned} \]
You can use runif() and rnorm() to generate random numbers from uniform distribution and normal distribution, respectively.
b_estimater<- function(){
x = runif(100,min = 0,max = 80)
normal = rnorm(100,mean = 0,sd = 100)
y = 180 + 40*x + normal
model<- lm(y~x)
b_estimate <- coef(model)[2]
return(b_estimate)
}
b_estimater() x
40.48417
Based on the dataset you generated in the previous question, find the OLS estimator \(\hat\beta_1\) and the proposed estimator \(\tilde\beta_1\).
btilde<- function(){
x = runif(100,min = 0,max = 80)
normal = rnorm(100,mean = 0,sd = 100)
y = 180 + 40*x + normal
model1<-lm(y[1:50]~x[1:50])
model2<- lm(y[51:100]~x[51:100])
b_A_Estimate = coef(model1)[2]
b_B_Estimate = coef(model2)[2]
btilde_estimate = (b_A_Estimate+b_B_Estimate)/2
return(btilde_estimate)
}
btilde() x[1:50]
40.37868
b_estimater() x
40.71562
Repeat this procedure 1000 times to obtain 1000 realizations of \(\hat\beta_1\) and \(\tilde\beta_1\). Each time, you will first generate a new sample of \((X,Y)\), and then calculate \(\hat\beta_1\) and \(\tilde\beta_1\). You may find replicate() useful.
#Created functions on previous parts to calculate it each time
bhatdata<- replicate(1000, b_estimater())
btildedata<- replicate(1000,btilde())Visualize your 1000 realizations of \(\hat\beta_1\) and \(\tilde\beta_1\) in a single plot which compares the two estimators. What will be your answer to the question “Which estimator will be better, the OLS estimator \(\hat\beta_1\), or the new estimator \(\tilde\beta_1=\frac{\tilde\beta_1^A+\tilde\beta_1^B}{2}\)”?
library(ggplot2)
data<- data.frame(c(bhatdata,btildedata),rep(c("OSLEstimator", "NewEstimator"), each = 1000))
ggplot(data) +
aes(
x = c.bhatdata..btildedata.,
fill = rep.c..OSLEstimator....NewEstimator....each...1000.
) +
geom_density(adjust = 1L,alpha = .5) +
scale_fill_hue(direction = 1) +
labs(
x = "Value",
title = "OSL vs New estimator density Graph",
fill = "Estimator"
) +
theme_minimal() +
theme(plot.title = element_text(size = 17L))The new created estimator is better because of shape of the graph representing a more normal curve than the OSL estimator. The estimator should represent a normal curve, and the new estimator better represents a normal curve
Regression with categorical covariate and \(t\)-Test
For this problem, we will be using the Wine Quality dataset from the UCI Machine Learning Repository. The dataset consists of red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.
Read the wine quality datasets from the specified URLs and store them in data frames df1 and df2.
url1 <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
url2 <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df1 <- read.csv(url1,sep = ";")
df2 <- read.csv(url2,sep = ";")Perform the following tasks to prepare the data frame df for analysis:
df, adding a new column called type to indicate whether each row corresponds to white or red wine.df to replace spaces and dots with underscoresfixed_acidity and free_sulfur_dioxidetype column to a factordf1<- df1%>%mutate("Type" = factor("white"))
df2<- df2%>%mutate("Type" = factor("red"))
df <- rbind(df1,df2) %>%rename_with(~ gsub(".","_",colnames(df1),fixed = TRUE))%>%
select(-c(fixed_acidity,free_sulfur_dioxide))
df<-na.omit(df)
dim(df)[1] 6497 11
Your output to dim(df) should be
[1] 6497 11
Recall from STAT 200, the method to compute the \(t\) statistic for the difference in means (with the equal variance assumption)
Using df compute the mean of quality for red and white wine separately, and then store the difference in means as a variable called diff_mean.
Compute the pooled sample variance and store the value as a variable called sp_squared.
Using sp_squared and diff_mean, compute the \(t\) Statistic, and store its value in a variable called t1.
diff_mean <- mean(df1$quality)-mean(df2$quality) # Insert your code here
sp_squared <- var(df$quality)
t1 <- diff_mean/((sp_squared)*(1/nrow(df1)+1/nrow(df2)))^.5
t1[1] 9.61719
Equivalently, R has a function called t.test() which enables you to perform a two-sample \(t\)-Test without having to compute the pooled variance and difference in means.
Perform a two-sample t-test to compare the quality of white and red wines using the t.test() function with the setting var.equal=TRUE. Store the t-statistic in t2.
t_test <- t.test(quality ~ Type, data = df,var.equal = TRUE)
t2 <- 9.6856 #Found t stat
t2[1] 9.6856
Fit a linear regression model to predict quality from type using the lm() function, and extract the \(t\)-statistic for the type coefficient from the model summary. Store this \(t\)-statistic in t3.
fit <- lm(quality~Type, data=df) # Insert your here
summary(fit)
Call:
lm(formula = quality ~ Type, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.8779 -0.8779 0.1221 0.3640 3.1221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.87791 0.01239 474.429 <2e-16 ***
Typered -0.24189 0.02497 -9.686 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8671 on 6495 degrees of freedom
Multiple R-squared: 0.01424, Adjusted R-squared: 0.01409
F-statistic: 93.81 on 1 and 6495 DF, p-value: < 2.2e-16
t3 <- 9.686
fit
Call:
lm(formula = quality ~ Type, data = df)
Coefficients:
(Intercept) Typered
5.8779 -0.2419
Print a vector containing the values of t1, t2, and t3. What can you conclude from this? Why?
print(c(t1,t2,t3)) # Insert your code here[1] 9.61719 9.68560 9.68600
Collinearity
We will continue working on the wine quality dataset. Specifically, the df data frame you created in Question 2.2.
Fit a linear regression model with all predictors against the response variable quality. Use the broom::tidy() function to print a summary of the fitted model. Round all numbers to three decimal places. What can we conclude from the model summary?
library(dplyr)
library(broom)
model <- lm(quality ~ ., data = df)
model_summary <- tidy(model) %>%
mutate(across(where(is.numeric), round, 3))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, 3)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
print(model_summary)# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 57.1 9.30 6.15 0
2 volatile_acidity -1.61 0.081 -20.0 0
3 citric_acid 0.027 0.078 0.347 0.728
4 residual_sugar 0.045 0.004 10.8 0
5 chlorides -0.964 0.333 -2.90 0.004
6 total_sulfur_dioxide 0 0 -1.25 0.21
7 density -55.2 9.32 -5.92 0
8 pH 0.188 0.066 2.85 0.004
9 sulphates 0.662 0.076 8.73 0
10 alcohol 0.277 0.014 19.5 0
11 Typered 0.386 0.055 7.02 0
Fit two simple linear regression models using lm(): one with only citric_acid as the predictor, and another with only total_sulfur_dioxide as the predictor. In both models, use quality as the response variable. How does your model summary compare to the summary from the previous question?
model_citric <- lm(quality ~ citric_acid, data = df)
summary_citric <- tidy(model_citric) %>%
mutate(across(where(is.numeric), round, 3))
print(summary_citric)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.66 0.026 217. 0
2 citric_acid 0.514 0.074 6.92 0
model_sulfur <- lm(quality ~ total_sulfur_dioxide, data = df)
summary_sulfur <- tidy(model_sulfur) %>%
mutate(across(where(is.numeric), round, 3))
print(summary_sulfur)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.89 0.025 239. 0
2 total_sulfur_dioxide -0.001 0 -3.34 0.001
Visualize the correlation matrix of all numeric columns in df using corrplot() from ggcorrplot package.
library(ggcorrplot)Warning: package 'ggcorrplot' was built under R version 4.3.3
cor_matrix <- df %>%
select(where(is.numeric)) %>%
cor()
ggcorrplot(cor_matrix,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
title = "Correlation Matrix of Numeric Variables",
ggtheme = theme_minimal())Compute the variance inflation factor (VIF) for each predictor in the full model using vif() function. What can we conclude from this?
library(car)
full_model <- lm(quality ~ ., data = df)
vif_values <- vif(full_model)
print(round(vif_values, 3)) volatile_acidity citric_acid residual_sugar
2.104 1.549 4.680
chlorides total_sulfur_dioxide density
1.625 2.629 9.339
pH sulphates alcohol
1.352 1.523 3.420
Type
6.695
Variable selection, modeling interactions, variance-bias tradeoff
We will continue working on the wine quality dataset. Specifically, the df data frame you created in Question 2.2.
Excluding quality from df we have \(10\) possible predictors as the covariates. How many different interactions between two predictors would be possible?
There are 45 different two way interactions between 10 predictors
Fit a linear regression model which includes all possible predictors and their two-way interactions, and store the model object as full_model. You can use the following function to create the formula you need as the first argument of lm().
library(tidyverse)
full_model_formula <- function(df) {
predictors <- setdiff(colnames(df), 'quality')
interactions <- apply(combn(predictors, 2), 2, function(x) paste(x, collapse = ":"))
formula_str <- paste("quality ~", paste(c(predictors, interactions), collapse = " + "))
return(as.formula(formula_str))
}
full_model <- lm(full_model_formula(df), data = df)
summary(full_model)
Call:
lm(formula = full_model_formula(df), data = df)
Residuals:
Min 1Q Median 3Q Max
-3.5964 -0.4637 -0.0182 0.4443 3.2444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.621e+01 2.094e+02 -0.077 0.938325
volatile_acidity 7.179e+00 6.275e+01 0.114 0.908920
citric_acid 2.354e+01 6.225e+01 0.378 0.705298
residual_sugar -1.033e+00 6.276e-01 -1.646 0.099849
chlorides 4.470e+01 3.668e+02 0.122 0.903013
total_sulfur_dioxide -3.906e-01 2.552e-01 -1.530 0.125960
density 2.176e+01 2.092e+02 0.104 0.917183
pH -1.359e+01 6.091e+01 -0.223 0.823520
sulphates -2.531e+01 7.484e+01 -0.338 0.735211
alcohol 2.283e+01 5.404e+00 4.225 2.42e-05
Typered -1.544e+02 3.842e+01 -4.019 5.92e-05
volatile_acidity:citric_acid 9.006e-01 5.519e-01 1.632 0.102798
volatile_acidity:residual_sugar 7.869e-03 3.343e-02 0.235 0.813904
volatile_acidity:chlorides 2.482e+00 2.487e+00 0.998 0.318305
volatile_acidity:total_sulfur_dioxide 6.039e-03 2.181e-03 2.769 0.005637
volatile_acidity:density -1.796e+01 6.275e+01 -0.286 0.774731
volatile_acidity:pH 1.173e+00 5.703e-01 2.056 0.039829
volatile_acidity:sulphates -1.029e+00 6.310e-01 -1.630 0.103120
volatile_acidity:alcohol 4.015e-01 1.048e-01 3.831 0.000129
volatile_acidity:Typered 1.597e+00 3.663e-01 4.360 1.32e-05
citric_acid:residual_sugar -6.939e-06 2.910e-02 0.000 0.999810
citric_acid:chlorides 2.719e+00 2.276e+00 1.194 0.232372
citric_acid:total_sulfur_dioxide -2.427e-03 2.020e-03 -1.201 0.229704
citric_acid:density -2.962e+01 6.214e+01 -0.477 0.633654
citric_acid:pH 1.555e+00 4.977e-01 3.124 0.001791
citric_acid:sulphates -2.837e-01 6.709e-01 -0.423 0.672363
citric_acid:alcohol 1.080e-01 9.240e-02 1.169 0.242458
citric_acid:Typered -9.269e-01 4.026e-01 -2.302 0.021348
residual_sugar:chlorides -2.166e-01 2.043e-01 -1.060 0.289140
residual_sugar:total_sulfur_dioxide -2.629e-04 1.123e-04 -2.341 0.019237
residual_sugar:density 1.303e+00 6.103e-01 2.135 0.032790
residual_sugar:pH -7.219e-02 2.608e-02 -2.768 0.005649
residual_sugar:sulphates 1.885e-03 3.522e-02 0.054 0.957328
residual_sugar:alcohol 6.951e-03 3.589e-03 1.937 0.052782
residual_sugar:Typered -6.045e-02 2.226e-02 -2.716 0.006635
chlorides:total_sulfur_dioxide -4.587e-03 1.258e-02 -0.365 0.715493
chlorides:density -1.372e+01 3.672e+02 -0.037 0.970184
chlorides:pH -6.311e+00 3.392e+00 -1.861 0.062802
chlorides:sulphates -4.559e+00 1.728e+00 -2.639 0.008337
chlorides:alcohol -1.013e+00 5.779e-01 -1.753 0.079717
chlorides:Typered 1.744e-01 2.410e+00 0.072 0.942302
total_sulfur_dioxide:density 4.110e-01 2.561e-01 1.605 0.108558
total_sulfur_dioxide:pH -6.871e-03 1.890e-03 -3.636 0.000280
total_sulfur_dioxide:sulphates -1.271e-02 2.018e-03 -6.300 3.16e-10
total_sulfur_dioxide:alcohol 1.152e-03 4.025e-04 2.861 0.004232
total_sulfur_dioxide:Typered -4.231e-03 1.132e-03 -3.738 0.000187
density:pH 1.373e+01 6.078e+01 0.226 0.821320
density:sulphates 2.247e+01 7.495e+01 0.300 0.764360
density:alcohol -2.316e+01 5.467e+00 -4.236 2.31e-05
density:Typered 1.612e+02 3.836e+01 4.203 2.67e-05
pH:sulphates 1.886e+00 4.941e-01 3.817 0.000136
pH:alcohol 2.551e-02 9.754e-02 0.262 0.793698
pH:Typered -2.396e+00 3.840e-01 -6.240 4.65e-10
sulphates:alcohol -1.844e-02 1.115e-01 -0.165 0.868623
sulphates:Typered -2.344e-01 4.170e-01 -0.562 0.574134
alcohol:Typered 2.404e-01 6.852e-02 3.509 0.000453
(Intercept)
volatile_acidity
citric_acid
residual_sugar .
chlorides
total_sulfur_dioxide
density
pH
sulphates
alcohol ***
Typered ***
volatile_acidity:citric_acid
volatile_acidity:residual_sugar
volatile_acidity:chlorides
volatile_acidity:total_sulfur_dioxide **
volatile_acidity:density
volatile_acidity:pH *
volatile_acidity:sulphates
volatile_acidity:alcohol ***
volatile_acidity:Typered ***
citric_acid:residual_sugar
citric_acid:chlorides
citric_acid:total_sulfur_dioxide
citric_acid:density
citric_acid:pH **
citric_acid:sulphates
citric_acid:alcohol
citric_acid:Typered *
residual_sugar:chlorides
residual_sugar:total_sulfur_dioxide *
residual_sugar:density *
residual_sugar:pH **
residual_sugar:sulphates
residual_sugar:alcohol .
residual_sugar:Typered **
chlorides:total_sulfur_dioxide
chlorides:density
chlorides:pH .
chlorides:sulphates **
chlorides:alcohol .
chlorides:Typered
total_sulfur_dioxide:density
total_sulfur_dioxide:pH ***
total_sulfur_dioxide:sulphates ***
total_sulfur_dioxide:alcohol **
total_sulfur_dioxide:Typered ***
density:pH
density:sulphates
density:alcohol ***
density:Typered ***
pH:sulphates ***
pH:alcohol
pH:Typered ***
sulphates:alcohol
sulphates:Typered
alcohol:Typered ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7158 on 6441 degrees of freedom
Multiple R-squared: 0.3337, Adjusted R-squared: 0.328
F-statistic: 58.66 on 55 and 6441 DF, p-value: < 2.2e-16
There are many insignificant predictors and interaction terms in the full model. Choose a reduced model by performing a stepwise regression in both directions (at each iteration, a predictor can be either added or removed). You can use step() and set trace=0 to avoid printing the very lengthy information during the running of step. Store the new model in an object named reduced_model. Use the broom::tidy() function to print a summary of the reduced model. Round all numbers to three decimal places.
reduced_model <- step(full_model, trace = 0)
library(broom)
tidy_reduced_model <- broom::tidy(reduced_model)
tidy_reduced_model_rounded <- tidy_reduced_model %>%
mutate(across(where(is.numeric), round, 3))
print(tidy_reduced_model_rounded)# A tibble: 40 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -62.3 67.6 -0.922 0.357
2 volatile_acidity -10.5 1.86 -5.63 0
3 citric_acid -5.85 1.49 -3.92 0
4 residual_sugar -0.997 0.317 -3.14 0.002
5 chlorides 33.6 8.31 4.04 0
6 total_sulfur_dioxide -0.349 0.245 -1.43 0.154
7 density 67.6 68.2 0.991 0.322
8 pH 0.27 0.362 0.747 0.455
9 sulphates -3.96 1.50 -2.63 0.009
10 alcohol 23.0 5.14 4.47 0
# ℹ 30 more rows
Compare the \(R^2\) and adjusted \(R^2\) of the full model and the reduced model.
full_model_summary <- summary(full_model)
reduced_model_summary <- summary(reduced_model)
full_model_r2 <- full_model_summary$r.squared
full_model_adj_r2 <- full_model_summary$adj.r.squared
reduced_model_r2 <- reduced_model_summary$r.squared
reduced_model_adj_r2 <- reduced_model_summary$adj.r.squared
cat("Full Model: R² =", round(full_model_r2, 3), ", Adjusted R² =", round(full_model_adj_r2, 3), "\n")Full Model: R² = 0.334 , Adjusted R² = 0.328
cat("Reduced Model: R² =", round(reduced_model_r2, 3), ", Adjusted R² =", round(reduced_model_adj_r2, 3), "\n")Reduced Model: R² = 0.333 , Adjusted R² = 0.329
Since the full model includes more predictors than the reduced model, do you expect the full model to achieve a larger or smaller variance than the reduced model, when it comes to prediction? What about bias?