Student Name: Grace Bare

Course Name: BANA 7052

Assignment: Module 1 Assignment

library(readr)
library(tidyverse)
library(ggplot2)

Question 1. (10 points) Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increases in the percentage of alumni who make a donation, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their contact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to increases in the percentage of alumni who make a donation. The attached data alumni.xls shows data for 48 national universities (America’s Best Colleges, Year 2000 Edition). The column labeled % of Classes Under 20 shows the percentage of classes offered with fewer than 20 students. The column labeled Student/Faculty Ratio is the number of students enrolled divided by the total number of faculty. Finally, the column labeled Alumni Giving Rate is the percentage of alumni that made a donation to the university. Use R to analyze the given data and answer the following questions. Consider the alumni giving rate as the response variable Y and the percentage of classes with fewer than 20 students as the predictor variable X.

a) Start with a basic exploratory data analysis. Show summary statistics of the response variable and predictor variable.

Url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(Url)
str(alumni)  # print structure of the alumni data frame
'data.frame':   48 obs. of  5 variables:
 $ school                     : chr  "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
 $ percent_of_classes_under_20: int  39 68 60 65 67 52 45 69 72 61 ...
 $ student_faculty_ratio      : int  13 8 8 3 10 8 12 7 13 10 ...
 $ alumni_giving_rate         : int  25 33 40 46 28 31 27 31 35 53 ...
 $ private                    : int  1 1 1 1 1 1 1 1 1 1 ...
fit <- lm(alumni_giving_rate ~ percent_of_classes_under_20 , data = alumni)
print(fit)

Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
    data = alumni)

Coefficients:
                (Intercept)  percent_of_classes_under_20  
                    -7.3861                       0.6578  
summary(fit)

Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
    data = alumni)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.053  -7.158  -1.660   6.734  29.658 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -7.3861     6.5655  -1.125    0.266    
percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.38 on 46 degrees of freedom
Multiple R-squared:  0.4169,    Adjusted R-squared:  0.4042 
F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

b) What is the nature of the variables X and Y? Are there “outliers” in the data; how might you define an outlier in this case? What is the correlation coefficient? Draw a scatter plot. Any major comments about the data?

The nature of the variables are pretty scattered, but there is a correlation. The slope of the line is .6578, so it is a more positive correlation. The correlation coefficient is .646. The data is not clumped together that much, which makes sense because it is .35 away from a perfect correlation. There is definitely some correlation still with higher the class size under 20, the more they gave. I would identify outliers by making a linear regression line and see which ones deviate from the line substantially.

x <- alumni$percent_of_classes_under_20
y <- alumni$alumni_giving_rate
cor.test(alumni$alumni_giving_rate, y = alumni$percent_of_classes_under_20)

    Pearson's product-moment correlation

data:  alumni$alumni_giving_rate and alumni$percent_of_classes_under_20
t = 5.7344, df = 46, p-value = 7.228e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4427365 0.7856553
sample estimates:
      cor 
0.6456504 
ggplot(alumni, aes(x = x, y = y)) +
  geom_point(size = 3, alpha = .3) + 
  labs(
    x = "Percentage of Classes with Fewer Than 20",
    y = "Alumni Giving Rate"
  )

c) Fit a simple linear regression to the data. What is your estimated regression equation?

My estimated regression equation is \(Y = -7.386 + .6578 X\)

ggplot(alumni, aes(x = x, y = y)) +
  geom_point(size = 3, alpha = .3) + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percentage of Classes with Fewer Than 20",
    y = "Alumni Giving Rate"
  )

lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)

Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
    data = alumni)

Coefficients:
                (Intercept)  percent_of_classes_under_20  
                    -7.3861                       0.6578  

d) Interpret your results (e.g., how would you interpret the slope in this application?).

The slope isn’t very steep, this means that the giving rate doesn’t increase that much every time the percentage increases for the class is fewer than 20.

Question 2. (10 points) A Simulation Study (Simple Linear Regression). Assuming the mean response is E(Y|X)=10+5X:

Generate data with X∼N(μ=2,σ=0.1), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5). Hint: You can use rnorm(n = 50, mean = 5, sd = 3) to simulate n=50 observations from a N(μ=5,σ=3) distribution, but note that rnorm() specifies the standard deviation (σ), rather than the variance (σ^2), of the normal distribution. It is also good practice to specify the random seed via set.seed() whenever generating random data (otherwise, your results will differ from run to run!). For this exercise, use set.seed(7052) to ensure reproducibility.

set.seed(7052)
x <- rnorm(100, mean = 2, sd = .1)
y <- rnorm(100, mean = 10 + 5*x, sd = 0.5)
lmline <- cbind(x,y)

b) Show summary statistics of the response variable and predictor variable. Are there outliers? What is the correlation coefficient? Draw a scatter plot

Yes, there are a few outliers at the bottom left hand side and when the y value goes about the 21 mark at 2.02. The correlation coefficient is .8042

summary(lmline)
       x               y        
 Min.   :1.725   Min.   :18.09  
 1st Qu.:1.923   1st Qu.:19.67  
 Median :2.001   Median :20.11  
 Mean   :2.004   Mean   :20.17  
 3rd Qu.:2.070   3rd Qu.:20.70  
 Max.   :2.243   Max.   :21.80  
cor.test(x,y)

    Pearson's product-moment correlation

data:  x and y
t = 13.395, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7218233 0.8641361
sample estimates:
      cor 
0.8042198 
plot(x,y)

c) Fit a simple linear regression. What is the estimated model? Report the estimated coefficients. What is the model mean squared error (MSE)?

Estimate Coefficents are down below The MSE is .2032

fit <- lm(y ~ x)
df <- data.frame(cbind(x, y))
ggplot(df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)


summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2073 -0.3029  0.0093  0.3033  1.3545 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.0218     0.8336   10.82   <2e-16 ***
x             5.5652     0.4155   13.39   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4509 on 98 degrees of freedom
Multiple R-squared:  0.6468,    Adjusted R-squared:  0.6432 
F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16
sigma(fit)
[1] 0.4508807
sigma(fit)^2
[1] 0.2032934

What is the sample mean of both X and Y? Plot the fitted regression line and the point (X ‾,Y ‾ ). What do you find?

I found that the average x and y is in the middle of the graph and the regression line.

averagex <- mean(x)
averagey <- mean(y)

df2 <- data.frame(cbind(averagex, averagey))

ggplot(df, aes(x = x, y = y)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)+
  geom_point(aes(x= averagex, y = averagey, color = "red"))

Question 3. (8 points) Ordinary least squares (OLS) is typically used to estimate the regression coefficients β_0 and β_1 in the simple linear regression model by minimizing the residual sum of squares (RSS)

a)How about minimizing this equation?

This can be minimized with straightforward calculus and Least Squares estimation.

b) How about minimizing this equation?

When minimizing absolute values you have to use numerical techniques and computers with OLS and RSS it is more straightforward calculus and easier to compute.

c) Why is OLS a popular choice for estimating \(β_0\) and \(β_1\)?

Computers can be used with OLS so it’s more straightforward and easier to do it can also be understood with calculus and doing math equations. Other ones you might need other tools for and can be harder to understand.

Question 4. (12 points) Establish the following relationships for the simple linear regression model. (Some are trivial to show.)

a)

\(\overline{Y}\) = \(β_0\)+\(β_1\) \(\overline{X}\)

b)

knitr::include_graphics("4b answer.jpg")

c)

knitr::include_graphics("4c answer.jpg")

d)

knitr::include_graphics("4d answer.jpg")

e)

knitr::include_graphics("4e answer.jpg")

f)

The definition of SSE is minimizing the sum of squares, which is exactly what that equation is, it is the end result of minimizing an equation. Therefore, it is already minimized.

