BANA 7052 Module 1 Assignment

url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
DT::datatable(alumni)

summary(alumni)

##     school          percent_of_classes_under_20 student_faculty_ratio
##  Length:48          Min.   :29.00               Min.   : 3.00        
##  Class :character   1st Qu.:44.75               1st Qu.: 8.00        
##  Mode  :character   Median :59.50               Median :10.50        
##                     Mean   :55.73               Mean   :11.54        
##                     3rd Qu.:66.25               3rd Qu.:13.50        
##                     Max.   :77.00               Max.   :23.00        
##  alumni_giving_rate    private      
##  Min.   : 7.00      Min.   :0.0000  
##  1st Qu.:18.75      1st Qu.:0.0000  
##  Median :29.00      Median :1.0000  
##  Mean   :29.27      Mean   :0.6875  
##  3rd Qu.:38.50      3rd Qu.:1.0000  
##  Max.   :67.00      Max.   :1.0000

1a.

summary(alumni$percent_of_classes_under_20)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00

summary(alumni$alumni_giving_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00

Response Variable (Y: Alumni Giving Rate):

Min: 7
1st Quartile: 18.75
Median: 29
Mean: 29.27
3rd Quartile: 38.50
Max: 67

Predictor Variable (X: % of Classes Under 20):

Min: 29
1st Quartile: 44.75
Median: 59.50
Mean: 55.73
3rd Quartile: 66.25
Max: 77

1b.

x <- alumni$percent_of_classes_under_20  
y <- alumni$alumni_giving_rate

ggplot(alumni, aes(x = x, y = y)) + 
  geom_point(size = 3, alpha = 0.3) + 
  geom_smooth(method = "lm", se = FALSE, lwd = 1.5) + 
  labs(
    x = "Percentage of Classes with Fewer Than 20",
    y = "Alumni Giving Rate"
  )

## `geom_smooth()` using formula = 'y ~ x'

Correlation coefficient

cor.test(alumni$alumni_giving_rate, y = alumni$percent_of_classes_under_20)

## 
##  Pearson's product-moment correlation
## 
## data:  alumni$alumni_giving_rate and alumni$percent_of_classes_under_20
## t = 5.7344, df = 46, p-value = 7.228e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4427365 0.7856553
## sample estimates:
##       cor 
## 0.6456504

X (Percent of Classes Under 20): Continuous variable, representing a percentage.
Y (Alumni Giving Rate): Continuous variable, representing a percentage.
Outliers: Outliers may be universities with extreme values in either variable. For example, universities with very low percentages of small classes (25%) or very high alumni giving rates (up to 53%) could be considered outliers.

Correlation Coefficient: 0.65
This shows a moderate positive correlation between the percentage of classes with fewer than 20 students and the alumni giving rate, indicating that higher percentages of small classes are associated with higher alumni giving rates.

1c.

My estimated regression equation is Y=−7.386+.6578X

ggplot(alumni, aes(x = x, y = y)) +
  geom_point(size = 3, alpha = .3) + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percentage of Classes with Fewer Than 20",
    y = "Alumni Giving Rate"
  )

## `geom_smooth()` using formula = 'y ~ x'

Linearmodel <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
summary(Linearmodel)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

Regression coefficient

coef(Linearmodel)

##                 (Intercept) percent_of_classes_under_20 
##                  -7.3860676                   0.6577687

1d.

Result observations:

The regression coefficient output states that for every 1% increase in the percentage of classes under 20, the alumni giving rate increases by 0.658%.

2a,b

set.seed(7052)
x <- rnorm(100, mean = 2, sd = .1)
y <- rnorm(100, mean = 10 + 5*x, sd = 0.5)
lmline <- cbind(x,y)
summary(lmline)

##        x               y        
##  Min.   :1.725   Min.   :18.09  
##  1st Qu.:1.923   1st Qu.:19.67  
##  Median :2.001   Median :20.11  
##  Mean   :2.004   Mean   :20.17  
##  3rd Qu.:2.070   3rd Qu.:20.70  
##  Max.   :2.243   Max.   :21.80

cor.test(x,y)

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 13.395, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7218233 0.8641361
## sample estimates:
##       cor 
## 0.8042198

plot(x,y,pch=20)
abline(lm(y ~ x), lwd=1)

2c.

Estimate Coefficents are down below The MSE is .2032

fit <- lm(y ~ x)
df <- data.frame(cbind(x, y))
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## x             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16

sigma(fit)

## [1] 0.4508807

sigma(fit)^2

## [1] 0.2032934

2d.

I found that the average x and y is in the middle of the graph and the regression line.

averagex <- mean(x)
averagey <- mean(y)

df2 <- data.frame(cbind(averagex, averagey))

ggplot(df, aes(x = x, y = y)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)+
  geom_point(aes(x= averagex, y = averagey, color = "red"))

## Warning in geom_point(aes(x = averagex, y = averagey, color = "red")): All aesthetics have length 1, but the data has 100 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## `geom_smooth()` using formula = 'y ~ x'

3a.

Minimizing the sum of residuals, would not be effective because positive and negative residuals can cancel each other out, leading to a sum close to zero even if the fit is poor. OLS minimizes the sum of squared residuals instead, ensuring all residuals contribute to the total error regardless of their sign, which provides a more accurate reflection of the overall error.

3b.

Minimizing the sum of absolute residuals, also known as the least absolute deviations method, is another approach but is less commonly used because it lacks the desirable mathematical properties of OLS. Specifically, the solution to minimizing squared residuals is linear and straightforward to compute, while minimizing absolute residuals requires more complex optimization and can result in a less stable fit.

3c.

OLS is popular because: It provides the best linear unbiased estimator (BLUE) under the Gauss-Markov theorem, meaning it minimizes the variance among all unbiased linear estimators. The resulting estimators for 𝛽0 and 𝛽1 have nice statistical properties, such as being efficient and normally distributed (assuming errors are normally distributed). OLS is computationally efficient and has a closed-form solution, making it easier and faster to compute than other methods.

BANA 7052 Module 1 Assignment

Silpa Prakash Rao

2024-10-19

1a.

1b.

1c.

1d.

2a,b

2c.

2d.

3a.

3b.

3c.

4.