Homework Module 1 BANA 7052

Question 1

1a Start with a basic exploratory data analysis. Show summary statistics of the responsive variable and predictor variable

summary(alumni$alumni_giving_rate) #responsive variable

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00

summary(alumni$percent_of_classes_under_20) # predictor variable

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00

1b What is the nature of the variables X and Y? Are there outliers in the data. How might you define an outlier in this case?

X is a continuous variable same as y. The only outliers we can find in the plot will be below 35 and above 70 on the x-axis (percent of classes under 20)

What is the correlation coefficient? Draw a scatter plot. Any major comments about the data?

The correlation coefficient is 0.65. It is a positive correlation between the percentage of classes with fewer than 20 students and alumni giving rate.

cor.test(alumni$alumni_giving_rate, y = alumni$percent_of_classes_under_20)

## 
##  Pearson's product-moment correlation
## 
## data:  alumni$alumni_giving_rate and alumni$percent_of_classes_under_20
## t = 5.7344, df = 46, p-value = 7.228e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4427365 0.7856553
## sample estimates:
##       cor 
## 0.6456504

1c Fit a simple linear regression to the data. What is your estimated regression equation?

The esimated regression equation is Y=-7.386+.6758x

linear <- lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
   data = alumni)
summary(linear)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

coef(linear)

##                 (Intercept) percent_of_classes_under_20 
##                  -7.3860676                   0.6577687

1d Interpret your results(e.g., how would you interpret the slope in this application?)

The percent of classes under 20 is a very small rate (0.6578). This means the small rate won’t increase over time and the slope won’t be steep but barely an increase

2 A Simulation Study (Simple Linear Regression). Assuming the mean response is E(Y|X) = 10 + 5x

set.seed(7052)
x <- rnorm(100, mean = 2, sd = .1)
y <- rnorm(100, mean = 10 + 5*x, sd = 0.5)
lmline <- cbind(x,y)
summary(lmline)

##        x               y        
##  Min.   :1.725   Min.   :18.09  
##  1st Qu.:1.923   1st Qu.:19.67  
##  Median :2.001   Median :20.11  
##  Mean   :2.004   Mean   :20.17  
##  3rd Qu.:2.070   3rd Qu.:20.70  
##  Max.   :2.243   Max.   :21.80

Show summary statistics of the response variable and predictor variable. Are there outliers? What is the correlation coefficient? Draw a scatter plot

Only a few outliers around the y value 21. The correlation coefficient is .8042

cor.test(x, y)

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 13.395, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7218233 0.8641361
## sample estimates:
##       cor 
## 0.8042198

plot(x,y,pch=20)
abline(lm(y ~ x), lwd = 1)

Fit a simple linear regression. What is the estimated model? Report the estimated coefficients. What is the model mean sqaured(MSE)?

MSE estimate coefficient is .2032

fit <- lm(y ~ x)
df <- data.frame(cbind(x, y))
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## x             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16

sigma(fit)

## [1] 0.4508807

sigma(fit)^2

## [1] 0.2032934

What is the sample mean of both X and Y? Plot the fitted regresion line and the point (X, Y). What do you find?

Both x and y (average) are in the middle of the regression line

meanx <- mean(x)
meany <- mean(y)

dataframe.mean2 <- data.frame(cbind(meanx, meany))

ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(aes(x = meanx, y = meany, color = "red"))

## Warning in geom_point(aes(x = meanx, y = meany, color = "red")): All aesthetics have length 1, but the data has 100 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## `geom_smooth()` using formula = 'y ~ x'

3 Ordinary least sqaures (OLS) is typically used to estimate the regression coefficient B0 and B1 in the simple linear regression model by minimizing the residual sum of squares (RSS)

This doesn’t work for minimizing the sum of residuals because postive and negative cancel each other out. This will result in zero so it won’t have a fit in the regression line.
This is a least absolute deviation. The problem with this method is you can minimize it and find in the regression line however it is hard to find the coefficients. Instead of using simple calculations, we will require complex operations to find the coefficients in the fit regression line.
The reasons why OLS is popular because it gives the best linear unbiased estimators with small variance making reliable and accurate. You can also have OLS find data vectors and plots easily on the fitted line. This makes a very easy computational method than the rest of the methods.

4 Establish the following relationships for the simple linear regression model. a. b. c.

knitr::include_graphics("C:/Users/maens/OneDrive/Desktop/BANA 7052/4 answers a, b, c.jpg")

knitr::include_graphics("C:/Users/maens/OneDrive/Desktop/BANA 7052/answer d.jpg")

knitr::include_graphics("C:/Users/maens/OneDrive/Desktop/BANA 7052/answer e.jpg")

f. Based on the defintion of SSE, the equation is already minimized. The sum of the sqaures shown here is the difference between the yi values and yhati values. It already meeting the correct creteria to show the best fitting line in the equation.

Homework Module 1 BANA 7052

Matthew Ensor

2025-10-20