The R code is hidden by default. Please click on Show to view all codes. Thank you.
Import Data and load libraries from URL: https://bgreenwell.github.io/uc-bana7052/data/alumni.csv
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni)
attach(alumni)
library(tidyverse)
library(investr)
library(here)
a) What is the estimated slope? Is it significant at the \(\alpha = 0.05\) level? Clearly write out the null and alternative hypotheses, observed t-statistic, p-value, and interpret the estimated and test results.
# Fit an SLR model to the data
fit <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
summary(fit)
Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
data = alumni)
Residuals:
Min 1Q Median 3Q Max
-21.053 -7.158 -1.660 6.734 29.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.3861 6.5655 -1.125 0.266
percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.38 on 46 degrees of freedom
Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
alpha <- 0.05
n <- 48
quantile <- qt(1 - alpha/2, df = n-2)
print(paste("The cutoff value is :", quantile))
[1] "The cutoff value is : 2.01289559891943"
From the above result, we learn that the estimated slope is 0.6578. It is significant with p-value lesson than 0.05.
The null hypothesis \(H_0\) is that
there is no linear relationship between
percent_of_classes_under_20 and
alumni_giving_rate, \(H_0 :
\beta_1 = 0\). The alternative hypothesis \(H_1\) is that there is a linear
relationship between percent_of_classes_under_20 and
alumni_giving_rate, \(H_1:
\beta_1 \neq 0\).
Since slope \(\beta_1\) is 0.6578,
which is not equal to zero, and the p value is \(7.23e^-07\), which is less than the \(\alpha = 0.05\) level, we’d reject the null
hypothesis at the \(\alpha = 0.05\)
level. From the summary statistic, we can see that the t value is
5.7314, which is greater than 2.01 (\(t_{n-2},
1-\alpha / 2\)), which again proved that we reject the null
hypothesis. We estimate that the mean alumni_giving_rate
incrases about 0.65 percent for every one percent
increased in percent_of_classes_under_20.
b) Repeat part a. above using the equivalent F-test.
# Compute the AVOBA table for the fitted model
anova(fit)
Analysis of Variance Table
Response: alumni_giving_rate
Df Sum Sq Mean Sq F value Pr(>F)
percent_of_classes_under_20 1 3539.8 3539.8 32.884 7.228e-07 ***
Residuals 46 4951.7 107.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above ANOVA test result, we can see the F value is 32.884. ANOVA test and t-test are equivalent. F value is almost equal to \(t-value^2\).
c) What is the value of \(R^2\)? Please interpret.
R2 <- summary(fit)$r.squared
print(paste("R Squared value is:", R2))
[1] "R Squared value is: 0.416864463660243"
From the code result above, we can see that the \(R^2\) value of the alumni data regression fit is 0.4169. This means that 41.69% of the variance in the data can be explained by the model.
d) What is the correlation coefficient r between
percent_of_classes_under_20 and
alumni_giving_rate? What is the relationship between \(r\) and \(R^2\)?
r2 <- sqrt(R2)
print(paste("r squared value is", r2))
[1] "r squared value is 0.645650419081598"
From the code result above, we can learn that the \(R^2\) equals to 0.4169 and \(r\) is \(\pm\sqrt{R^2}\), which equals to 0.6457. In
simple linear regression, \(r^2 =
R^2\). Since the slope is 0.6578, which is positive, we will
consider the r value is also positive. \(r\) provides both direction and strength of
the linear relationship where \(R^2\)
measures the strength. From the result, we can see that there is a
stronge positive linear relationship between two variables, when
percent_of_classes_under_20 increases,
alumni_giving_rate will also increase.
e) Plot the training data with the fitted regression line and include a 95% confidence band for the mean responses.
plotFit(fit, interval = "confidence", shade = TRUE, xlim = c(0, 80), xlab = 'Percent of Classes Under 20', ylab = 'Alumni Giving Rate')
xbar <- mean(alumni$percent_of_classes_under_20)
ybar <- mean(alumni$alumni_giving_rate)
points(xbar, ybar, pch = 19, col = "blue", cex = 1.5)
text(xbar, ybar, label = expression(paste("(", xbar, ",", ybar, ")")), pos = 2, col = "blue")
I used mean() function to get the \(\bar{X}\) and \(\bar{Y}\), which is (55.7, 29.3). From the
graph above, we can see that the confidence band is narrower around the
\(\bar{X}\) and \(\bar{Y}\), and it goes wider towards both
directions.
a) Generate data with \(X \sim N(\mu = 2, \sigma = 0.1)\), sample size \(n = 100\), and error term \(\epsilon \sim N(\mu = 0, \sigma = 0.5)\).
set.seed(7052)
x <- rnorm(100, mean = 2, sd = 0.1)
error <- rnorm(100, mean = 0, sd = 0.5)
y <- 10 + 5 * x + error
data <- data.frame(x, y)
head(data, n=3)
b) Fit a simple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?
data_fit <- lm(y ~ x, data = data)
summary(data_fit)
Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.2073 -0.3029 0.0093 0.3033 1.3545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.0218 0.8336 10.82 <2e-16 ***
x 5.5652 0.4155 13.39 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4509 on 98 degrees of freedom
Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
mse <- mean(data_fit$residuals^2)
print(paste("MSE:", mse))
[1] "MSE: 0.19922755610719"
By using the lm() linear regression model function, and
run the summary() to the fitted data, we can see that the
intercept is 9.02 and the slope is 5.57, which is the coefficient
values. The standard error is 0.83 and 0.42.
Given the intercept and slope, we can write the estimated prediction equation to be \(\hat{Y} = 9.02 + 5.57X\).
The null hypothesis \(H_0\) is that there is no linear relationship between x and y, \(H_0 : \beta_1 = 0\). The alternative hypothesis \(H_1\) is that there is a linear relationship between x and y, \(H_1: \beta_1 \neq 0\).
For intercept \(\beta_0\), the t value is 10.82, p value is less than \(2e^-16\). For slope \(\beta_1\), t value is 13.39, and p value is also less than \(2e^-16\).
From the summary statistics, we can see that the p value is about \(2e^-16\), which is less than the \(\alpha = 0.05\) threshold. Also, the slope is 5.57, which is greatly differ from 0. Therefore, I am going to reject the null hypothesis, there is a statistically significant linear relationship between X and Y.
The MSE value is around 0.2, which indicating a reasonable fit of the model to the data.
c) repeat part b, but re-simulate the data and change the error term to \(\epsilon \sim N(0, \sigma = 1)\).
Code to re-simulate the data is below:
set.seed(7052)
x <- rnorm(100, mean = 2, sd = 0.1)
error <- rnorm(100, mean = 0, sd = 1)
y <- 10 + 5 * x + error
data2 <- data.frame(x, y)
head(data2, n=3)
Fit the linear regression model and get the summary statistics and MSE.
data_fit2 <- lm(y ~ x, data = data2)
summary(data_fit2)
Call:
lm(formula = y ~ x, data = data2)
Residuals:
Min 1Q Median 3Q Max
-2.4146 -0.6058 0.0186 0.6066 2.7090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.0436 1.6673 4.824 5.16e-06 ***
x 6.1303 0.8309 7.378 5.25e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9018 on 98 degrees of freedom
Multiple R-squared: 0.3571, Adjusted R-squared: 0.3505
F-statistic: 54.43 on 1 and 98 DF, p-value: 5.253e-11
mse2 <- mean(data_fit2$residuals^2)
print(paste("MSE of data_fit2:", mse2))
[1] "MSE of data_fit2: 0.796910224428767"
From the re-simulation data and the linear regression model summary statistics, we can learn:
Given the intercept 8.04 and slope 6.13, we can write the estimated prediction equation to be \(\hat{Y} = 8.04 + 6.13X\).
The null hypothesis \(H_0\) is that there is no linear relationship between x and y, \(H_0 : \beta_1 = 0\). The alternative hypothesis \(H_1\) is that there is a linear relationship between x and y, \(H_1: \beta_1 \neq 0\).
For intercept \(\beta_0\), the t value is 4.82, p value is \(5.16e^-06\). For slope \(\beta_1\), t value is 7.38, and p value is \(5.25e^-11\) .
From the summary statistics, we can see that the p value for the slope is \(5.25e^-11\), which is less than the \(\alpha = 0.05\) threshold. Also, the slope is 6.13, which is greatly differ from 0. Therefore, I am going to reject the null hypothesis, there is a statistically significant linear relationship between X and Y.
The MSE value is around 0.8, its value has increased because the erro term variance increased. This indicates that it may be a worse fit compare to the previous MSE.
d) Repeat parts a)–c) using n=400. What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?
I’ll name the data with N=400, error variance = 0.5 to
data3 and N=400, error variance = 1 to
data4
set.seed(7052)
x <- rnorm(400, mean = 2, sd = 0.1)
error <- rnorm(400, mean = 0, sd = 0.5)
y <- 10 + 5 * x + error
data3 <- data.frame(x, y)
head(data3, n=3)
set.seed(7052)
x <- rnorm(400, mean = 2, sd = 0.1)
error <- rnorm(400, mean = 0, sd = 1)
y <- 10 + 5 * x + error
data4 <- data.frame(x, y)
head(data4, n=3)
Perform the linear regression model for data3 and data4:
data_fit3 <- lm(y ~ x, data = data3)
summary(data_fit3)
Call:
lm(formula = y ~ x, data = data3)
Residuals:
Min 1Q Median 3Q Max
-1.76214 -0.33740 0.03615 0.32077 1.63021
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.7466 0.5015 19.44 <2e-16 ***
x 5.1177 0.2490 20.55 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4887 on 398 degrees of freedom
Multiple R-squared: 0.5149, Adjusted R-squared: 0.5137
F-statistic: 422.5 on 1 and 398 DF, p-value: < 2.2e-16
mse3 <- mean(data_fit3$residuals^2)
print(paste("MSE of data_fit3:", mse3))
[1] "MSE of data_fit3: 0.237632790738204"
data_fit4 <- lm(y ~ x, data = data4)
summary(data_fit4)
Call:
lm(formula = y ~ x, data = data4)
Residuals:
Min 1Q Median 3Q Max
-3.5243 -0.6748 0.0723 0.6415 3.2604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4933 1.0029 9.466 <2e-16 ***
x 5.2355 0.4979 10.514 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9774 on 398 degrees of freedom
Multiple R-squared: 0.2174, Adjusted R-squared: 0.2154
F-statistic: 110.5 on 1 and 398 DF, p-value: < 2.2e-16
mse4 <- mean(data_fit4$residuals^2)
print(paste("MSE of data_fit4:", mse4))
[1] "MSE of data_fit4: 0.950531162952817"
From the above summary statistics for data3 and
data4, we can learn:
For data3, given the intercept 9.75 and slope 5.12, we
can write the estimated prediction equation to be \(\hat{Y} = 9.75 + 5.12X\). For
data4, given the intercept 9.49 and slope 5.23, we can
write the estimated prediction equation to be \(\hat{Y} = 9.49 + 5.23X\).
The null hypothesis \(H_0\) is that
there is no linear relationship between x and y, \(H_0 : \beta_1 = 0\). The alternative
hypothesis \(H_1\) is that there is a
linear relationship between x and y, \(H_1:
\beta_1 \neq 0\). It would be the same for both
data3 and data4.
For data3, intercept \(\beta_0\), the t value is 19.44, p value is
\(2e^-16\); for slope \(\beta_1\), t value is 20.55, and p value is
also \(2e^-16\). For
data4, intercept \(\beta_0\), the t value is 9.46, p value is
\(2e^-16\); for slope \(\beta_1\), t value is 10.51, and p value is
also \(2e^-16\).
From the summary statistics, we can see that in both data sets, the p value for the slope is significantly small, which is less than the \(\alpha = 0.05\) threshold. Also, the slope is 9.75 and 9.49, which are greatly differ from 0. Therefore, we can to reject the null hypothesis for both data sets, there is a statistically significant linear relationship between X and Y.
Also, from this simulation, we can see some patterns when the sample size increase and when error term variance increase. See the image below:
When sample size increases, the estimates of the coefficients get more precise, and the standard erros are smaller. T value will get larger and p value will get lower. Higher t value means that the coefficients are more likely to be statistically significant, and lower p-value makes us likely to reject the null hypothesis. On the other hand, when the error term variance increases, the variability in the response variable increases, and we see higher standard errors. The t value will decrease and p value will increase, which means the coefficient will be less likely significant.
e) What about the MSE from each model?
Again, from the table above, we can see that:
When sample size increases, the MSE does not change much while the error variance is the same. However, when the error variance increases, the MSE will also increase, which makes the residuals larger. So with error variance increase, we may see a worse fit of the data.
a) What are the bias and variance of the OLS estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\)?
The OLS estimates are unbiased, the expected value are equal to the true parameters.
So, Bias of \(\hat{\beta_0}\): \(E(\hat{\beta_0}) = \beta_0\), and Bias of \(\hat{\beta_1}\): \(E(\hat{\beta_1}) = \beta_1\).
\(\hat{\beta_0} = \sum_{i=1}^n aiYi\),
\(Var(\hat{\beta_0}) = Var(\sum_{i=1}^n aiYi) = \sigma^2 (1/n + \bar{x^2}/S_{xx})\), where \(S_{xx}\) is the total sum of squared deviations of x from its mean.
\(\hat{\beta_1} = \sum_{i=1}^n wiYi\),
\(Var(\hat{\beta_1}) = Var(\sum_{i=1}^n wiYi) = \sigma^2 /S_{xx})\), where \(S_{xx}\) is the total sum of squared deviations of x from its mean.
b) What do you expect to happen to the variances of the OLS estimates of \(\beta_0\) and \(\beta_1\) when the sample size n increases? What do you expect when the error variance \(\sigma^2\) increases?
When the sample size increases, the estimates of the coefficients get more precise and less influenced by random variability, meaning the variances of both \(\beta_0\) and \(\beta_1\) decrease.
When thee error variance increases, the variability in the \(\beta_0\) and \(\beta_1\) increase. Because larger \(\sigma^2\) cause more noise and random variability. With \(\beta_0\) and \(\beta_1\) increasing, make the estimate less precise.
c) What is the bias of the model’s MSE? What about the ML estimate of σ^2? What is the difference between these two estimates of σ^2? Why do we use MSE instead of the ML estimate?
The ML estimation \(\hat{\sigma_{MLE}^2} = 1/n\sum_{i=1}^n e_{i}^2\), which is a biased estimate. We use the adjusted estimate of \(\hat{\sigma^2} = 1/(n-2)\sum_{i=1}^n e_{i}^2\) to represent MSE. Where the 2 is that with simple linear regression, we have two variables.
The difference is that MLE divides by n and MSE divides by n-p, where p is the number of parameters estimated, which will be 2 in simple linear regression. MSE accounts for the degrees of freedom where MLE does not.
We would choose adjusted MSE is because, first, it is less biased, and because it accounts for the degree of freedom, it won’t underestimate the true variance.