The R code is hidden by default. Please click on Show to view all
codes. Thank you.
Assignment 1
Question 1.
Import Data and load libraries from URL: https://bgreenwell.github.io/uc-bana7052/data/alumni.csv
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni)
attach(alumni)
library(tidyverse)
a) Start with a basic exploratory data analysis. Show summary
statistics of the response variable and predictor variable.
The summary statistics of variable Percent_of_Classes_Under_20 (X
variable) is displayed below - the predictor variable
(percent_of_classes_under_20) has a median of 59.5 and a mean of 55.73.
Standard Deviation is 13.19. Minimum value is 29 and the maximum value
is 77.
summary(percent_of_classes_under_20)
Min. 1st Qu. Median Mean 3rd Qu. Max.
29.00 44.75 59.50 55.73 66.25 77.00
sd(percent_of_classes_under_20)
[1] 13.19371
The summary statistics of variable Alumni_Giving_Rate (Y variable) is
displayed below - the response variable (alumni_giving_rate) has a
median of 29 and a mean of 29.27. Standard deviation of 13.44. The
minimum value is 7 and the maximum value is 67.
summary(alumni_giving_rate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 18.75 29.00 29.27 38.50 67.00
sd(alumni_giving_rate)
[1] 13.44135
b) What is the nature of the variables X and Y? Are there
“outliers” in the data; how might you define an outlier in this case?
What is the correlation coefficient? Draw a scatter plot. Any major
comments about the data?
cor.test(percent_of_classes_under_20, alumni_giving_rate)
Pearson's product-moment correlation
data: percent_of_classes_under_20 and alumni_giving_rate
t = 5.7344, df = 46, p-value = 7.228e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4427365 0.7856553
sample estimates:
cor
0.6456504
Based on the Pearson’s correlation test, we can see that X and Y has
a positive liner correlation. The correlation value is about 0.6457.
Also the p-value is less than the type 1 error threshold of 0.05, so we
can see that X and Y are strongly associated.
ggplot(alumni, aes(percent_of_classes_under_20, alumni_giving_rate)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Correlation between Percent of Class Under 20 and Giving Rate",
x = "Percent of Class Under 20",
y = "Alumni Giving Rate")

From the graphic above, I used a scatter point and a linear model to
identify if there are outliers. we can see that there are outliers in
this data. I would identify the outliers as points that are strongly
deviated away from the trend line.
c) Fit a simple linear regression to the data. What is your
estimated regression equation?
To fit a simple linear regression, we will need \(\beta_0\) and \(\beta_1\) as the Simple Linear Regression
is \(Y\) = \(\beta_0\) + \(\beta_1\)\(X\) + \(\epsilon\).
To find \(\beta_0\) and \(\beta_1\), I will use the lm()
in R.
fit <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
summary(fit)
Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
data = alumni)
Residuals:
Min 1Q Median 3Q Max
-21.053 -7.158 -1.660 6.734 29.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.3861 6.5655 -1.125 0.266
percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.38 on 46 degrees of freedom
Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
We can see that the intercept \(\beta_0\) is 37.1786 and the slope \(\beta_1\) is 0.6338. So the estimated
regression equation could be \(\hat{Y}\) = -7.3861 + 0.66\(X\)
d) Interpret your results (e.g., how would you interpret the
slope in this application?).
There is a positive linear relationship between the two variables.
The average giving rate will increase by an estimated 0.6338 percent for
every percentage increase on the Percent of Class Under 20.
Question 2: A Simulation Study (Simple Linear Regression). Assuming
the mean response is E(Y|X)=10+5X:
**a) Generate data with X∼N(μ=2,σ=0.1), sample size n=100, and error
term ϵ∼N(μ=0,σ=0.5).
set.seed(7052)
x <- rnorm(100, mean = 2, sd = 0.1)
error <- rnorm(100, mean = 0, sd = 0.5)
y <- 10 + 5 * x + error
data <- data.frame(x, y)
head(data)
b) Show summary statistics of the response variable and
predictor variable. Are there outliers? What is the correlation
coefficient? Draw a scatter plot.
Below are the summary statistics of response variable y and predictor
variable x.
summary(data)
x y
Min. :1.725 Min. :18.09
1st Qu.:1.923 1st Qu.:19.67
Median :2.001 Median :20.11
Mean :2.004 Mean :20.17
3rd Qu.:2.070 3rd Qu.:20.70
Max. :2.243 Max. :21.80
ggplot(data, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm")

From the graph above, we can see there are a few outliers, some
points are far away from the trend line.
cor(x, y, method = "pearson")
[1] 0.8042198
The correlation coefficient is 0.8042.
c) Fit a simple linear regression. What is the estimated
model? Report the estimated coefficients. What is the model mean squared
error (MSE)?
The fitted SLR model and coefficients are below. The coefficients
intercept is 9.022, and the slope is 5.565.
fit2 <- lm( y ~ x, data)
print(fit2)
Call:
lm(formula = y ~ x, data = data)
Coefficients:
(Intercept) x
9.022 5.565
coef(fit2)
(Intercept) x
9.021796 5.565160
sigma(fit2)^2
[1] 0.2032934
The MSE is 0.203.
d) What is the sample mean of both X and Y? Plot the fitted
regression line and the point \((\bar{X},
\bar{Y})\). What do you find?
summary(data)
x y
Min. :1.725 Min. :18.09
1st Qu.:1.923 1st Qu.:19.67
Median :2.001 Median :20.11
Mean :2.004 Mean :20.17
3rd Qu.:2.070 3rd Qu.:20.70
Max. :2.243 Max. :21.80
From the summary statistics, we can see the sample mean of X is
2.004, and the sample mean for Y is 20.17.
plot(x, y, pch = 19)
abline(fit2, col = "darkorange", lwd = 2)
xbar <- mean(data$x)
ybar <- mean(data$y)
points(xbar, ybar, pch = 19, col = "blue", cex = 1.5)
text(xbar, ybar, label = expression(paste("(", xbar, ",", ybar, ")")), pos = 4, col = "blue")

From the graph above, we can find that the LS regression line passes
through the point (xbar, ybar).
Question 3. Ordinary least squares (OLS) is typically used to
estimate the regression coefficients β_0 and β_1 in the simple linear
regression model by minimizing the residual sum of squares (RSS)
a) How about minimizing \(\sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i) =
\sum_{i-1}^n \epsilon_i\)
Minimizing the sum of residuals doesn’t work well for regression
because the residuals can cancel each other out. For example, when one
of the residuals is positive and another one is positive, it can cancel
each other out.
b) How about minimizing \(\sum_{i=1}^n |Y_i - \beta_0 - \beta_1 X_i| =
\sum_{i-1}^n |\epsilon_i|\)
While it is an alternative to OLS, it’s not recommended because it
leads to non-differentiable objective functions and makes the
optimization process more complex.
c) Why is OLS a popular choice for estimating \(\beta_0\) and \(\beta_1\)?
The reason to use the OLS for estimating \(\beta_0\) and \(\beta_1\) is that it’s the best unbiased
linear estimator, we also call it BLUE.
Question 4: Establish the following relationships for the simple
linear regression model.
a) The fitted line passes through the point \((\bar{X},\bar{Y})\)
substitude \(X = \bar{X}\) to the
fitted regression line: \(\hat{Y_i} =
\hat{\beta_0} + \hat{\beta_1} X_i\).
$ = + X = + _{X}
\(\hat{Y} = \bar{Y}\)
b) \(\sum_{i=1}^n e_i = 0\)
Fitted regression line: \(\hat{Y_i} =
\hat{\beta_0} + \hat{\beta_1} X_i\)
each point i is: \(e_i = Y_i - \hat{Y_i} =
Y_i - (\hat{\beta_0} + \hat{\beta_1} X_i)\)
to show if the sum of \(e_i\) is
zero, substitute the expression for \(\hat{Y_i}\)
\(\sum_{i=1}^n e_i = \sum_{i=1}^n Y_i - n
\hat{\beta_0} - \hat{\beta_1} n \bar{X}\) \(\hat{\beta_0} = \bar{Y} - \hat{\beta_1}
\bar{X}\)
so: \(\sum_{i=1}^n e_i =n \bar{Y} - n
\bar{Y} + \hat{\beta_1} n \bar{X} - \hat{\beta_1} n \bar{X} =
0\)
c) \(\sum_{i=1}^n Y_i =
\sum_{i=1}^n \hat{Y_i}\)
Fitted regression line: \(\hat{Y_i} =
\hat{\beta_0} + \hat{\beta_1} X_i\)
\(\sum_{i=1}^n Y_i = \sum_{i=1}^n
(\hat{\beta_0} + \hat{\beta_1} X+i) = \hat{\beta_0} \sum_{i=1}^n 1 +
\hat{\beta_1} \sum_{i=1}^n X_i\)
\(\sum_{i=1}^n Y_i = n \hat{\beta_0} +
\hat{\beta_1} n \bar{X}\)
\(\hat{\beta_0} = \bar{Y} - \hat{\beta_1}
\bar{X}\)
\(\sum_{i=1}^n Y_i = n \bar{Y} -
\hat{\beta_1} n \bar{X} + \hat{\beta_1} n \bar{X} = n
\bar{Y}\)
since \(\bar{Y}\) is the mean of the
observed \(Y_i\) values, so:
\(\sum_{i=1}^n Y_i = n \bar{Y} =
\sum_{i=1}^n \bar{Y_i}\)
d) \(\sum_{i=1}^n X_i e_i =
0\)
We already approved that \(\sum_{i=1}^n e_i
= 0\).
\(\sum_{i=1}^n X_i e_i = X_i \sum_{i=1}^n
e_i = X_i * 0 = 0\)
e) \(\sum_{i=1}^n \hat{Y_i} e_i =
0\)
We already approved that \(\sum_{i=1}^n e_i
= 0\). \(\sum_{i=1}^n Y_i e_i = Y_i
\sum_{i=1}^n e_i = Y_i * 0 = 0\)
f) \(\sum_{i=1}^n e_i^2\) is
minimized
The regression line is the best linear approximation of the data, the
least squares approach provides the best linear unbiased estimators
under the assumption of the linear regression model.
