url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
DT::datatable(alumni)
summary(alumni)
## school percent_of_classes_under_20 student_faculty_ratio
## Length:48 Min. :29.00 Min. : 3.00
## Class :character 1st Qu.:44.75 1st Qu.: 8.00
## Mode :character Median :59.50 Median :10.50
## Mean :55.73 Mean :11.54
## 3rd Qu.:66.25 3rd Qu.:13.50
## Max. :77.00 Max. :23.00
## alumni_giving_rate private
## Min. : 7.00 Min. :0.0000
## 1st Qu.:18.75 1st Qu.:0.0000
## Median :29.00 Median :1.0000
## Mean :29.27 Mean :0.6875
## 3rd Qu.:38.50 3rd Qu.:1.0000
## Max. :67.00 Max. :1.0000
summary(alumni$percent_of_classes_under_20)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 44.75 59.50 55.73 66.25 77.00
summary(alumni$alumni_giving_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 18.75 29.00 29.27 38.50 67.00
Response Variable (Y: Alumni Giving Rate):
Min: 7
1st Quartile: 18.75
Median: 29
Mean: 29.27
3rd Quartile: 38.50
Max: 67
Predictor Variable (X: % of Classes Under 20):
Min: 29
1st Quartile: 44.75
Median: 59.50
Mean: 55.73
3rd Quartile: 66.25
Max: 77
x <- alumni$percent_of_classes_under_20
y <- alumni$alumni_giving_rate
ggplot(alumni, aes(x = x, y = y)) +
geom_point(size = 3, alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, lwd = 1.5) +
labs(
x = "Percentage of Classes with Fewer Than 20",
y = "Alumni Giving Rate"
)
## `geom_smooth()` using formula = 'y ~ x'
Correlation coefficient
cor.test(alumni$alumni_giving_rate, y = alumni$percent_of_classes_under_20)
##
## Pearson's product-moment correlation
##
## data: alumni$alumni_giving_rate and alumni$percent_of_classes_under_20
## t = 5.7344, df = 46, p-value = 7.228e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4427365 0.7856553
## sample estimates:
## cor
## 0.6456504
X (Percent of Classes Under 20): Continuous variable, representing a
percentage.
Y (Alumni Giving Rate): Continuous variable,
representing a percentage.
Outliers: Outliers may be universities
with extreme values in either variable. For example, universities with
very low percentages of small classes (25%) or very high alumni giving
rates (up to 53%) could be considered outliers.
Correlation Coefficient: 0.65
This shows a moderate positive
correlation between the percentage of classes with fewer than 20
students and the alumni giving rate, indicating that higher percentages
of small classes are associated with higher alumni giving rates.
My estimated regression equation is Y=−7.386+.6578X
ggplot(alumni, aes(x = x, y = y)) +
geom_point(size = 3, alpha = .3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Percentage of Classes with Fewer Than 20",
y = "Alumni Giving Rate"
)
## `geom_smooth()` using formula = 'y ~ x'
Linearmodel <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
summary(Linearmodel)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
Regression coefficient
coef(Linearmodel)
## (Intercept) percent_of_classes_under_20
## -7.3860676 0.6577687
Result observations:
The regression coefficient output states that for every 1% increase in the percentage of classes under 20, the alumni giving rate increases by 0.658%.
set.seed(7052)
x <- rnorm(100, mean = 2, sd = .1)
y <- rnorm(100, mean = 10 + 5*x, sd = 0.5)
lmline <- cbind(x,y)
summary(lmline)
## x y
## Min. :1.725 Min. :18.09
## 1st Qu.:1.923 1st Qu.:19.67
## Median :2.001 Median :20.11
## Mean :2.004 Mean :20.17
## 3rd Qu.:2.070 3rd Qu.:20.70
## Max. :2.243 Max. :21.80
cor.test(x,y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 13.395, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7218233 0.8641361
## sample estimates:
## cor
## 0.8042198
plot(x,y,pch=20)
abline(lm(y ~ x), lwd=1)
Estimate Coefficents are down below The MSE is .2032
fit <- lm(y ~ x)
df <- data.frame(cbind(x, y))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## x 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
sigma(fit)
## [1] 0.4508807
sigma(fit)^2
## [1] 0.2032934
I found that the average x and y is in the middle of the graph and the regression line.
averagex <- mean(x)
averagey <- mean(y)
df2 <- data.frame(cbind(averagex, averagey))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)+
geom_point(aes(x= averagex, y = averagey, color = "red"))
## Warning in geom_point(aes(x = averagex, y = averagey, color = "red")): All aesthetics have length 1, but the data has 100 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## `geom_smooth()` using formula = 'y ~ x'
Minimizing the sum of residuals, would not be effective because positive and negative residuals can cancel each other out, leading to a sum close to zero even if the fit is poor. OLS minimizes the sum of squared residuals instead, ensuring all residuals contribute to the total error regardless of their sign, which provides a more accurate reflection of the overall error.
Minimizing the sum of absolute residuals, also known as the least absolute deviations method, is another approach but is less commonly used because it lacks the desirable mathematical properties of OLS. Specifically, the solution to minimizing squared residuals is linear and straightforward to compute, while minimizing absolute residuals requires more complex optimization and can result in a less stable fit.
OLS is popular because: It provides the best linear unbiased estimator (BLUE) under the Gauss-Markov theorem, meaning it minimizes the variance among all unbiased linear estimators. The resulting estimators for 𝛽0 and 𝛽1 have nice statistical properties, such as being efficient and normally distributed (assuming errors are normally distributed). OLS is computationally efficient and has a closed-form solution, making it easier and faster to compute than other methods.