tuition = read.csv("tuition_final.csv")

Creating an acceptance rate variable and filtering UNC-CH row from the dataset.

tuition$Acc.Rate = (tuition$Accepted/tuition$Applied)*100

tuition[tuition$Name == "University of North Carolina at Chapel Hill",]
##       ID                                        Name State Public Avg.SAT
## 682 2974 University of North Carolina at Chapel Hill    NC      1    1121
##     Avg.ACT Applied Accepted  Size Out.Tuition Spending Acc.Rate
## 682      NA   14596     5985 14609        8400    15893 41.00438

Plotting a simple linear regression with Tuition Price on SAT Scores.

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "College Tuition Based on SAT Score", xlab = "SAT Score", ylab = "Tuition Price", pch = 20, cex = 1, col = "blue")

After some reserach, we learn that we can model a linear regression by our two formulas, B1 & B0, found here. We then can proceed with writing functions within R to simulate these.

beta1 <- function(r, Sy, Sx){
  B1 = r * (Sy/Sx)
  return(B1)
}

beta0 <- function(r, Sy, Sx, y_bar, x_bar){
  if (Sx > 0){
  B1 = r * (Sy/Sx)
  B0 = y_bar - B1*x_bar
  } else {
    B0 = NA
    print("Sx = 0. Can't calculate B1!")
  return(B0)
  }
}

Now we can provide input to these functions from our dataset. Using the respective columns for each variable.:

# Removing all NA values from the entire data set so as to be able to extract a total average. We otherwise receive an error after R is unable to properly process an average with NA values existing in the columns. 
TuitionKeep = subset(tuition, select = c("Avg.SAT", "Out.Tuition"))
EntireTuitionNoNA = na.omit(TuitionKeep)
xBar = mean(EntireTuitionNoNA$Avg.SAT)
yBar = mean(EntireTuitionNoNA$Out.Tuition)
Sx = sd(EntireTuitionNoNA$Avg.SAT)
Sy = sd(EntireTuitionNoNA$Out.Tuition)
Rxy = cor(EntireTuitionNoNA$Avg.SAT, EntireTuitionNoNA$Out.Tuition)

b0 = beta0(Rxy, Sy, Sx, yBar, xBar)
b0
## [1] -9173.796
b1 = beta1(Rxy,Sy, Sx)
b1
## [1] 19.7977

Now that we have a Y-Intercept and a Slope, we can include our regression line into the graph from before by using the abline() function in R. This line optimally reduces the residuals for the datapoints.

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "College Tuition Based on SAT Score", xlab = "SAT Score", ylab = "Tuition Price", pch = 20, cex = 1, col = "blue")

abline(b0, b1)

Writing yet another function that takes input explanatory variables x and y as well as a new variable, x_new, that we want to predict.

predict_yval <- function(X, Y, x_new){
  
  xBar = mean(X)
  yBar = mean(Y)
  Sx = sd(X)
  Sy = sd(Y)
  Rxy = cor(X,Y)
  
  if (Sx > 0){
    b0 = beta0(Rxy, Sy, Sx, yBar, xBar)
    b1 = beta1(Rxy, Sy, Sx)
    y_predict = b0 + b1 * x_new
  } else {
    y_predict = NA
    print("Sx is <= 0, can't calculate.")
  }
  
  
  return( y_predict )
}

Using the function just created, we can see whether UNC provides a good education for the price:

CH = tuition[tuition$Name == "University of North Carolina at Chapel Hill",]
CHTuition = CH$Out.Tuition

CHTuitionPredict = predict_yval(EntireTuitionNoNA$Avg.SAT, EntireTuitionNoNA$Out.Tuition, CH$Avg.SAT)

#Is the actual UNC tuition cheaper than the predicted? 
CHTuitionPredict
## [1] 13019.43
CHTuition
## [1] 8400
CHTuition < CHTuitionPredict
## [1] TRUE
#We can see that UNC provides a great education at a great price.

We’ve seen how to do manually create a linear regression. R also has the lm() function that allows us to speed up this process as well as include other variables within it. Here, we have a multiple linear regression model where we adjust the public variable into a categorical variable using the factor() function.

These included variables can help adjust expectations due to how a college’s student population, spending per student, and average SAT Score can provide information about the kinds of students that attend as well as direction into what kinds of people may be currently donating to the institution.

Mult <- lm(Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + factor(Public), data = tuition)
summary(Mult)
## 
## Call:
## lm(formula = Out.Tuition ~ Size + Avg.SAT + Avg.ACT + Spending + 
##     factor(Public), data = tuition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9998.4 -1398.7    50.9  1355.7 10607.1 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.936e+03  1.002e+03  -5.921 6.21e-09 ***
## Size            -3.002e-03  2.912e-02  -0.103 0.917941    
## Avg.SAT          8.122e+00  2.210e+00   3.675 0.000265 ***
## Avg.ACT          1.193e+02  9.639e+01   1.238 0.216389    
## Spending         3.075e-01  2.747e-02  11.194  < 2e-16 ***
## factor(Public)2  3.380e+03  3.351e+02  10.087  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2288 on 466 degrees of freedom
##   (830 observations deleted due to missingness)
## Multiple R-squared:  0.6672, Adjusted R-squared:  0.6637 
## F-statistic: 186.9 on 5 and 466 DF,  p-value: < 2.2e-16