We will be attempting to find a linear regression that models college tuition rates, based on a dataset from US News and World Report. Alas, this data is from 1995, so it is very outdated; still, we will see what we can learn from it.
http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv; figure out how to use the code you were given last time for read.csv( ) and read.table( ) to read the data into R and call it tuition. Use the functions we learned last time to familiarize yourself with the data in tuition.# Read Data
tuition = read.csv('http://kbodwin.web.unc.edu/files/2016/09/tuition_final.csv')
summary(tuition)
## ID Name State Public
## Min. : 1002 Bethel College : 4 NY :101 Min. :1.000
## 1st Qu.: 1874 Concordia College: 4 PA : 83 1st Qu.:1.000
## Median : 2650 Trinity College : 4 CA : 70 Median :2.000
## Mean : 3126 Columbia College : 3 TX : 60 Mean :1.639
## 3rd Qu.: 3431 Union College : 3 MA : 56 3rd Qu.:2.000
## Max. :30431 Augustana College: 2 OH : 52 Max. :2.000
## (Other) :1282 (Other):880
## Avg.SAT Avg.ACT Applied Accepted
## Min. : 600.0 Min. :11.00 Min. : 35.0 Min. : 35.0
## 1st Qu.: 884.5 1st Qu.:20.25 1st Qu.: 695.8 1st Qu.: 554.5
## Median : 957.0 Median :22.00 Median : 1470.0 Median : 1095.0
## Mean : 968.0 Mean :22.12 Mean : 2752.1 Mean : 1870.7
## 3rd Qu.:1038.0 3rd Qu.:24.00 3rd Qu.: 3314.2 3rd Qu.: 2303.0
## Max. :1410.0 Max. :31.00 Max. :48094.0 Max. :26330.0
## NA's :523 NA's :588 NA's :10 NA's :11
## Size Out.Tuition Spending
## Min. : 59 Min. : 1044 Min. : 1834
## 1st Qu.: 966 1st Qu.: 6111 1st Qu.: 6116
## Median : 1812 Median : 8670 Median : 7729
## Mean : 3693 Mean : 9277 Mean : 8988
## 3rd Qu.: 4540 3rd Qu.:11659 3rd Qu.:10054
## Max. :31643 Max. :25750 Max. :62469
## NA's :3 NA's :20 NA's :39
tuition called Acc.Rate that contains the acceptance rate for each university.You may find the variables “Accepted” and “Applied” useful.# Acceptance Rate
tuition$Acc.Rate = tuition$Accepted/tuition$Applied
tuition[tuition$Name == "University of North Carolina at Chapel Hill", ]
## ID Name State Public Avg.SAT
## 682 2974 University of North Carolina at Chapel Hill NC 1 1121
## Avg.ACT Applied Accepted Size Out.Tuition Spending Acc.Rate
## 682 NA 14596 5985 14609 8400 15893 0.4100438
We have seen many examples of using functions in R, like summary( ) or t.test( ). Now you will learn how to write your own functions. Defining a function means writing code that looks something like this:
my_function <- function(VAR_1, VAR_2){
# do some stuff with VAR_1 and VAR_2
return(result)
}
Then you run the code in R to “teach” it how your function works, and after that, you can use it like you would any other pre-existing function. For example, try out the following:
add1 <- function(a, b){
# add the variables
c = a + b
return(c)
}
add2 <- function(a, b = 3){
# add the variables
c = a + b
return(c)
}
# Try adding 5 and 7
add1(5, 7)
## [1] 12
add2(5, 7)
## [1] 12
# Try adding one variable
try(add1(5))
add2(5)
## [1] 8
What was the effect of b = 3 in the definition of add2( )?
the value for 'b' defaults to 3 if not specified
Write your own functions, called beta1( ) and beta0( ) that take as input some combination of Sx, Sy, r, y_bar, and x_bar, and use that to calculate \(\beta_1\) and \(\beta_0\).
beta1 <- function(r, Sx, Sy){
if(Sx > 0){
b1 = r*Sy/Sx
}else{
b1 = NA
}
return(b1)
}
beta0 <- function(x_bar, y_bar, r, Sx, Sy){
b1 = beta1(r, Sx, Sy)
b0 = y_bar - b1*x_bar
return(b0)
}
Try your function with Sx = 0. Did it work? If not, fix your function code. Explain why it would be a problem to do linear regression with \(S_X = 0\).
Divide by 0 error. Sx = 0 suggests that all X-values are the same, so how can we possibly try to predict anything? We have no information.Use the code below to make a scatterplot of college tuition versus average SAT score.
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "title", xlab = "label", ylab = "label", pch = 7, cex = 2, col = "blue")
plot( ) so that it looks nice.plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition versus average SAT score for U.S. Colleges (1995)", xlab = "Average SAT score of students", ylab = "Out of state tuition", pch = 19, cex = 1)
What do pch and cex do?
pch = "plotting character", changes shape of points on plot
cex = relative size of points on plotWe have used the function abline( ) to add a vertical line or a horizontal line to a graph. However, it can also add lines by slope and intercept. Read the documentation of abline( ) until you understand how to do this. Then add a line with slope 10 and intercept 0 to your plot.
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition versus average SAT score for U.S. Colleges (1995)", xlab = "Average SAT score of students", ylab = "Out of state tuition", pch = 19, cex = 1)
abline(0, 10, lwd = 2, col = "blue")
Does this line seem to fit the data well?
Close - but not really a fitbeta1( ) and beta0( ), to find the slope and intercept for a regression line of Avg.SAT on Out.Tuition. Remake your scatterplot, and add the regression line.(Hint: You may have some trouble finding the mean and sd because there is some missing data. Look at the documentation for the functions you use. What could we add to the function arguments to ignore values of NA?)
head(tuition)
## ID Name State Public Avg.SAT Avg.ACT
## 1 1061 Alaska Pacific University AK 2 972 20
## 2 1063 University of Alaska at Fairbanks AK 1 961 22
## 3 1065 University of Alaska Southeast AK 1 NA NA
## 4 11462 University of Alaska at Anchorage AK 1 881 20
## 5 1002 Alabama Agri. & Mech. Univ. AL 1 NA 17
## 6 1003 Faulkner University AL 2 NA 20
## Applied Accepted Size Out.Tuition Spending Acc.Rate
## 1 193 146 249 7560 10922 0.7564767
## 2 1852 1427 3885 5226 11935 0.7705184
## 3 146 117 492 5226 9584 0.8013699
## 4 2065 1598 6209 5226 8046 0.7738499
## 5 2817 1920 3958 3400 7043 0.6815761
## 6 345 320 1367 5600 3971 0.9275362
ybar = mean(tuition$Out.Tuition, na.rm = TRUE)
xbar = mean(tuition$Avg.SAT, na.rm = TRUE)
Sy = sd(tuition$Out.Tuition, na.rm = TRUE)
Sx = sd(tuition$Avg.SAT, na.rm = TRUE)
r = cor(tuition$Out.Tuition, tuition$Avg.SAT, use = "complete.obs")
b1 = beta1(r, Sx, Sy)
b0 = beta0(xbar, ybar, r, Sx, Sy)
plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "Tuition versus average SAT score for U.S. Colleges (1995)", xlab = "Average SAT score of students", ylab = "Out of state tuition", pch = 19, cex = 1)
abline(b0, b1, lwd = 2, col = "blue")
What do you conclude about the relationship between average SAT score and a college’s tuition?
It seems like Tuition increases by $20 for every point higher on the SAT.
predict_yval(X, Y, x_new) that takes as input a vector of explanatory variables (X), a vector of y-variables (Y), and a new x-value that we want to predict (x_new). The output of the function should be the predicted y-value for x_new from a regression line. (Hint: You can use functions inside functions.)predict_yval <- function(X, Y, x_new){
ybar = mean(Y, na.rm = TRUE)
xbar = mean(X, na.rm = TRUE)
Sy = sd(Y, na.rm = TRUE)
Sx = sd(X, na.rm = TRUE)
r = cor(X, Y, use = "complete.obs")
b1 = beta1(r, Sx, Sy)
b0 = beta0(xbar, ybar, r, Sx, Sy)
pred_y = b0 + b1*x_new
return(pred_y)
}
X = c(0,1,2,3)
Y = c(1,2,3,4)
predict_yval(X,Y, 3)
## [1] 4
# Find UNC values
x_unc = tuition$Avg.SAT[tuition$Name == "University of North Carolina at Chapel Hill"]
y_unc = tuition$Out.Tuition[tuition$Name == "University of North Carolina at Chapel Hill"]
# Find Duke values
x_duke = tuition$Avg.SAT[tuition$Name == "Duke University"]
y_duke = tuition$Out.Tuition[tuition$Name == "Duke University"]
# Predict tuitions vs real
predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, x_unc)
## [1] 12410.22
y_unc
## [1] 8400
predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, x_duke)
## [1] 16116.43
y_duke
## [1] 18590
Would you say you are getting a deal at UNC? How about at Duke?
UNC is less expensive than the model predicts, while Duke is more expensive.
lm() and diagnosticsYou now have functions to calculate the slope and intercept of a linear regression, and to predict values. As you might expect, R was already able to do this, using the function lm( ). In class, you saw how to read the output of lm( ). Run the following regression of Avg.SAT on Out.Tuition, and refamiliarize yourself with the output.
# Make linear model
my_lm = lm(Out.Tuition ~ Avg.SAT, data = tuition)
summary(my_lm)
##
## Call:
## lm(formula = Out.Tuition ~ Avg.SAT, data = tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9809.9 -2234.6 204.6 2309.3 12336.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9173.7964 914.3482 -10.03 <2e-16 ***
## Avg.SAT 19.7977 0.9379 21.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3183 on 765 degrees of freedom
## (535 observations deleted due to missingness)
## Multiple R-squared: 0.3681, Adjusted R-squared: 0.3673
## F-statistic: 445.6 on 1 and 765 DF, p-value: < 2.2e-16
Check out names(my_lm). This will give you a list of information we can access using $. For example, compare my_lm$coefficents to your beta1 and beta0 outputs.
The output of lm( ) is made to play nicely with other functions in R. For example, try adding abline(my_lm) to your scatterplot. We can also use lm( ) to check some common diagnostics, to see if the linear model is a good fit for the data. Try plot(my_lm), and familiarize yourself with the first three plots that are automatically generated. (The fourth is not covered in this course, so you do not need to worry about it for now.)
Spending contains the expenditure of the school per student. Suppose we want to make a regression that predicts tuition cost from the expenditure per student. Make a linear model for Spending versus Out.Tuition. Comment on the summary of the model, and plot it on an appropriate scatter plot. Does the model seem to be a good fit for this data? my_lm_2 = lm(Out.Tuition ~ Spending, data = tuition)
summary(my_lm_2)
##
## Call:
## lm(formula = Out.Tuition ~ Spending, data = tuition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22023.7 -2205.4 -99.9 2020.6 11804.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.477e+03 1.792e+02 24.98 <2e-16 ***
## Spending 5.481e-01 1.752e-02 31.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3126 on 1242 degrees of freedom
## (58 observations deleted due to missingness)
## Multiple R-squared: 0.4409, Adjusted R-squared: 0.4404
## F-statistic: 979.3 on 1 and 1242 DF, p-value: < 2.2e-16
plot(tuition$Spending, tuition$Out.Tuition, main = "Tuition versus per-student spending for U.S. Colleges (1995)", xlab = "Spending per Student", ylab = "Out of state tuition", pch = 19, cex = 1)
abline(my_lm_2, lwd = 2, col = "blue")
(Interpret R-squared and intercept) Doesn't quite fit perfectly.
Spending. Do you notice any issues? Hint: Use your own function predict_yval() or the built-in function predict(my_lm). You will need to think about the problem of missing data (NAs).# Find nas
bad = is.na(tuition$Spending) | is.na(tuition$Out.Tuition)
# Make predicted values
y_pred = predict(my_lm_2)
# Plot residuals
plot(tuition$Spending[!bad], tuition$Out.Tuition[!bad] - y_pred)
There are clearly some major outliers and a discinct pattern.
plot() to look at the automatic diagnostics. What is each plot showing? What seems to be going wrong here? Which schools are marked as outliers?# Plot diagnostics
plot(my_lm_2)
Influential outliers. The biggest outlier schools are: CalTech, Johns Hopkins, Wake Forest.
my_lm_3 = lm(Out.Tuition ~ Spending, data = tuition[tuition$Spending < 20000,])
summary(my_lm_3)
##
## Call:
## lm(formula = Out.Tuition ~ Spending, data = tuition[tuition$Spending <
## 20000, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -12213.5 -1662.0 63.2 1774.8 8803.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1419.7058 212.6406 6.677 3.73e-11 ***
## Spending 0.9316 0.0241 38.663 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2634 on 1203 degrees of freedom
## (53 observations deleted due to missingness)
## Multiple R-squared: 0.5541, Adjusted R-squared: 0.5537
## F-statistic: 1495 on 1 and 1203 DF, p-value: < 2.2e-16
plot(tuition$Spending, tuition$Out.Tuition, main = "Tuition versus Spending for U.S. Colleges (1995)", xlab = "Per-Student Spending", ylab = "Out of state tuition", pch = 19, cex = 1)
abline(my_lm_3, lwd = 2, col = "blue")