Get the data

We will be attempting to find a linear regression that models college tuition rates, based on a dataset from US News and World Report. Alas, this data is from 1995, so it is very outdated; still, we will see what we can learn from it.


Question 1:

  1. The dataset is located at http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv; figure out how to use the code you were given last time for read.csv( ) and read.table( ) to read the data into R and call it tuition. Use the functions we learned last time to familiarize yourself with the data in tuition.
# Read Data
tuition = read.csv('http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv')
summary(tuition)
##        ID                       Name          State         Public     
##  Min.   : 1002   Bethel College   :   4   NY     :101   Min.   :1.000  
##  1st Qu.: 1874   Concordia College:   4   PA     : 83   1st Qu.:1.000  
##  Median : 2650   Trinity College  :   4   CA     : 70   Median :2.000  
##  Mean   : 3126   Columbia College :   3   TX     : 60   Mean   :1.639  
##  3rd Qu.: 3431   Union College    :   3   MA     : 56   3rd Qu.:2.000  
##  Max.   :30431   Augustana College:   2   OH     : 52   Max.   :2.000  
##                  (Other)          :1282   (Other):880                  
##     Avg.SAT          Avg.ACT         Applied           Accepted      
##  Min.   : 600.0   Min.   :11.00   Min.   :   35.0   Min.   :   35.0  
##  1st Qu.: 884.5   1st Qu.:20.25   1st Qu.:  695.8   1st Qu.:  554.5  
##  Median : 957.0   Median :22.00   Median : 1470.0   Median : 1095.0  
##  Mean   : 968.0   Mean   :22.12   Mean   : 2752.1   Mean   : 1870.7  
##  3rd Qu.:1038.0   3rd Qu.:24.00   3rd Qu.: 3314.2   3rd Qu.: 2303.0  
##  Max.   :1410.0   Max.   :31.00   Max.   :48094.0   Max.   :26330.0  
##  NA's   :523      NA's   :588     NA's   :10        NA's   :11       
##       Size        Out.Tuition       Spending    
##  Min.   :   59   Min.   : 1044   Min.   : 1834  
##  1st Qu.:  966   1st Qu.: 6111   1st Qu.: 6116  
##  Median : 1812   Median : 8670   Median : 7729  
##  Mean   : 3693   Mean   : 9277   Mean   : 8988  
##  3rd Qu.: 4540   3rd Qu.:11659   3rd Qu.:10054  
##  Max.   :31643   Max.   :25750   Max.   :62469  
##  NA's   :3       NA's   :20      NA's   :39
  1. Make a new variable in tuition called Acc.Rate that contains the acceptance rate for each university.You may find the variables “Accepted” and “Applied” useful.
# Acceptance Rate
tuition$Acc.Rate = tuition$Accepted/tuition$Applied
  1. Find which line corresponds to UNC (“University of North Carolina at Chapel Hill”).
tuition[tuition$Name == "University of North Carolina at Chapel Hill", ]
##       ID                                        Name State Public Avg.SAT
## 682 2974 University of North Carolina at Chapel Hill    NC      1    1121
##     Avg.ACT Applied Accepted  Size Out.Tuition Spending  Acc.Rate
## 682      NA   14596     5985 14609        8400    15893 0.4100438

Writing functions

We have seen many examples of using functions in R, like summary( ) or t.test( ). Now you will learn how to write your own functions. Defining a function means writing code that looks something like this:

my_function <- function(VAR_1, VAR_2){
  
  # do some stuff with VAR_1 and VAR_2
  return(result)
  
}

Then you run the code in R to “teach” it how your function works, and after that, you can use it like you would any other pre-existing function. For example, try out the following:

add1 <- function(a, b){
  
  # add the variables
  c = a + b
  return(c)
  
}

add2 <- function(a, b = 3){
  
  # add the variables
  c = a + b
  return(c)
  
}

# Try adding 5 and 7
add1(5, 7)
## [1] 12
add2(5, 7)
## [1] 12
# Try adding one variable
try(add1(5))
add2(5)
## [1] 8

Question 2:

What was the effect of b = 3 in the definition of add2( )?

the value for 'b' defaults to 3 if not specified

Question 3:

  1. Recall that the equations for simple linear regression are: \[\beta_1 = r \frac{S_Y}{S_X} \hspace{0.5cm} \beta_0 = \bar{Y} - \beta_1 \bar{X}\]

Write your own functions, called beta1( ) and beta0( ) that take as input some combination of Sx, Sy, r, y_bar, and x_bar, and use that to calculate \(\beta_1\) and \(\beta_0\).

beta1 <- function(r, Sx, Sy){
  
  
  if(Sx > 0){
    b1 = r*Sy/Sx
  }else{
    b1 = NA
  }

  return(b1)

}

beta0 <- function(x_bar, y_bar, r, Sx, Sy){
  
  b1 = beta1(r, Sx, Sy)
  b0 = y_bar - b1*x_bar
  
  return(b0)
  
}
  1. Try your function with Sx = 0. Did it work? If not, fix your function code. Explain why it would be a problem to do linear regression with \(S_X = 0\).

    Divide by 0 error.  Sx = 0 suggests that all X-values are the same, so how can we possibly try to predict anything?  We have no information.

Linear Regression by hand

Use the code below to make a scatterplot of college tuition versus average SAT score.

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "title", xlab = "label", ylab = "label", pch = 7, cex = 2, col = "blue")


Question 4:

  1. Make your own scatterplot, but change the input of plot( ) so that it looks nice.
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition versus average SAT score for U.S. Colleges (1995)", xlab = "Average SAT score of students", ylab = "Out of state tuition", pch = 19, cex = 1)

  1. What do pch and cex do?

    pch = "plotting character", changes shape of points on plot
    cex = relative size of points on plot
  2. We have used the function abline( ) to add a vertical line or a horizontal line to a graph. However, it can also add lines by slope and intercept. Read the documentation of abline( ) until you understand how to do this. Then add a line with slope 10 and intercept 0 to your plot.

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition versus average SAT score for U.S. Colleges (1995)", xlab = "Average SAT score of students", ylab = "Out of state tuition", pch = 19, cex = 1)

abline(0, 10, lwd = 2, col = "blue")

  1. Does this line seem to fit the data well?

    Close - but not really a fit

Question 5:

  1. Use the functions you already know in R and the ones you created, beta1( ) and beta0( ), to find the slope and intercept for a regression line of Avg.SAT on Out.Tuition. Remake your scatterplot, and add the regression line.

(Hint: You may have some trouble finding the mean and sd because there is some missing data. Look at the documentation for the functions you use. What could we add to the function arguments to ignore values of NA?)

head(tuition)
##      ID                              Name State Public Avg.SAT Avg.ACT
## 1  1061         Alaska Pacific University    AK      2     972      20
## 2  1063 University of Alaska at Fairbanks    AK      1     961      22
## 3  1065    University of Alaska Southeast    AK      1      NA      NA
## 4 11462 University of Alaska at Anchorage    AK      1     881      20
## 5  1002       Alabama Agri. & Mech. Univ.    AL      1      NA      17
## 6  1003               Faulkner University    AL      2      NA      20
##   Applied Accepted Size Out.Tuition Spending  Acc.Rate
## 1     193      146  249        7560    10922 0.7564767
## 2    1852     1427 3885        5226    11935 0.7705184
## 3     146      117  492        5226     9584 0.8013699
## 4    2065     1598 6209        5226     8046 0.7738499
## 5    2817     1920 3958        3400     7043 0.6815761
## 6     345      320 1367        5600     3971 0.9275362
ybar = mean(tuition$Out.Tuition, na.rm = TRUE)
xbar = mean(tuition$Avg.SAT, na.rm = TRUE)

Sy = sd(tuition$Out.Tuition, na.rm = TRUE)
Sx = sd(tuition$Avg.SAT, na.rm = TRUE)

r = cor(tuition$Out.Tuition, tuition$Avg.SAT, use = "complete.obs")

b1 = beta1(r, Sx, Sy)
b0 = beta0(xbar, ybar, r, Sx, Sy)

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition versus average SAT score for U.S. Colleges (1995)", xlab = "Average SAT score of students", ylab = "Out of state tuition", pch = 19, cex = 1)

abline(b0, b1, lwd = 2, col = "blue")

  1. What do you conclude about the relationship between average SAT score and a college’s tuition?

    It seems like Tuition increases by $20 for every point higher on the SAT.

Question 6:

  1. Write a new function called predict_yval(X, Y, x_new) that takes as input a vector of explanatory variables (X), a vector of y-variables (Y), and a new x-value that we want to predict (x_new). The output of the function should be the predicted y-value for x_new from a regression line. (Hint: You can use functions inside functions.)
predict_yval <- function(X, Y, x_new){
  
  ybar = mean(Y, na.rm = TRUE)
  xbar = mean(X, na.rm = TRUE)

  Sy = sd(Y, na.rm = TRUE)
  Sx = sd(X, na.rm = TRUE)

  r = cor(X, Y, use = "complete.obs")

  b1 = beta1(r, Sx, Sy)
  b0 = beta0(xbar, ybar, r, Sx, Sy)

  
  pred_y = b0 + b1*x_new
  
  return(pred_y)
  
}
  1. Now find the average SAT score and tuition of UNC and of Duke, and compare their predicted values to the truth:
# Find UNC values
x_unc = tuition$Avg.SAT[tuition$Name == "University of North Carolina at Chapel Hill"]
y_unc = tuition$Out.Tuition[tuition$Name == "University of North Carolina at Chapel Hill"]

# Find Duke values
x_duke = tuition$Avg.SAT[tuition$Name == "Duke University"]
y_duke = tuition$Out.Tuition[tuition$Name == "Duke University"]


# Predict tuitions vs real
predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, x_unc)
## [1] 12410.22
y_unc
## [1] 8400
predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, x_duke)
## [1] 16116.43
y_duke
## [1] 18590
  1. Would you say you are getting a deal at UNC? How about at Duke?

    UNC is less expensive than the model predicts, while Duke is more expensive.