College Enrollment and Median Family Income

Introduction

It’s no surprise that going to college is expensive. I live in South Carolina, where the median family income is about $44,929 (2014 Census). If I were to send my child to the in-state university Clemson, it would cost me $14,708 a year. That’s 33% percent of my income. If I lived in North Carolina, where median income is $46,784, Clemson’s out-of-state costs would be $34,590 a year. That’s 74% percent of my income. This can easily explain the increasing amount of student loan debt that many students have. The question I want to answer is, “Does the percentage of the family income that has to be used towards tuition determine if a high school graduate directly enrolls to college after graduation?” (This is assuming no scholarships, grants, or government aid was received.)

United States Census Bureau

Clemson

Data

Data from the median family income was collected by the United States Census Bureau. Data about the college-going rates of high school graduates were collected from The National Center for Higher Education Management Systems(NCHEMS). They collected from the sources of the National Center for Education Statistics, the Integrated Postsecondary Education Data System Enrollment Survey, and the Western Interstate Commission for Higher Education. Data about the average tuition fees by state was collected from CollegeBoard. They collected from sources College Board, Annual Survey of Colleges; NCES, IPEDS Fall 2015 Enrollment data.

Each source was downloaded as a .csv file or tab-delimited text file.

Each case represents a state for that year. These data sets contain all 50 states over 6 different years, totaling 300 observations. The response variable is the college-going rate of high school graduates and is numerical. The explanatory variable is the percentage of tuition that goes towards tuition and is numerical.

This type of study was observational based on the fact that none of the variables were changed and compared to a control variable to observe the response.

The scope of inference involves high school graduates that are directly enrolling in college in the United States. Since each year for each state includes many thousands of students, this sample can be a good generalization of the population. It needs to be noted that the data from NCES about college going rates doesn’t include where the student went to college. It could be in-state or out-of-state. This data should establish casaul links between the number of HS graduates enrolling in college and the average median family income. We could expect that the more money that is made by the family the more likely a HS grad would be encourage to enroll in college.

A problem that may affect the outcome of the analysis was that the data that was collected for college enrollment percentage didn’t specify whether the student enrolled to a college in-state or out-of-state. Since the percentage of income is based on the in-state tuition, this hidden figure might have an unknown effect.

Data Import and Tidy

There was a separate text file for each year of the college enrollment rates. These files were tab-delimited. The median family income by state throughout the years was in a single csv file as well as the average tuition cost per state. These files are stored on GitHub where they can be imported.

#Remove scientific notation for the plots
options(scipen=1000000)

#Data files are stored on my GitHub for the following URLs. 
#I had to pull them by separate years for every year they had available for analysis

#-----------------Read in data---------------------------------
college_rates_2004 <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/HS%20to%20College%20rates/2004.txt", header = T, stringsAsFactors = F, sep = "\t", row.names = NULL)

college_rates_2006 <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/HS%20to%20College%20rates/2006.txt", header = T, stringsAsFactors = F, sep = "\t", row.names = NULL)

college_rates_2008 <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/HS%20to%20College%20rates/2008.txt", header = T, stringsAsFactors = F, sep = "\t", row.names = NULL)

college_rates_2010 <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/HS%20to%20College%20rates/2010.txt", header = T, stringsAsFactors = F, sep = "\t", row.names = NULL)

college_rates_2012 <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/HS%20to%20College%20rates/2012.txt", header = T, stringsAsFactors = F, sep = "\t", row.names = NULL)

college_rates_2014 <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/HS%20to%20College%20rates/2014.txt", header = T, stringsAsFactors = F, sep = "\t", row.names = NULL)

median_income_by_state <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/h08.xls%20-%20h08.csv", stringsAsFactors = F)

avg_tuition_cost <- read.csv("https://raw.githubusercontent.com/smithchad17/Class606/master/Project/Average%20Tuition%20fees.csv", header = T, stringsAsFactors = F)

head(college_rates_2014)

##    row.names
## 1    Alabama
## 2     Alaska
## 3    Arizona
## 4   Arkansas
## 5 California
## 6   Colorado
##   Percent.of.High.School.Graduates.Going.Directly.to.College....
## 1                                                           62.1
## 2                                                           44.0
## 3                                                           52.3
## 4                                                           63.5
## 5                                                           60.9
## 6                                                           58.2
##   High.School.Graduates...2014
## 1                        49242
## 2                         7964
## 3                        68060
## 4                        30947
## 5                       449202
## 6                        54882
##   First.Time.Freshmen.Directly.from.High.School.Enrolled.Anywhere.in.the.US...Fall.2014
## 1                                                                                 30583
## 2                                                                                  3507
## 3                                                                                 35598
## 4                                                                                 19644
## 5                                                                                273635
## 6                                                                                 31939
##    X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA

Data for the college enrollment rates were only collected on even years between 2004 and 2014. To make it easier to merge and group later, the odd numbered years were removed from the other data frames. These college enrollment rate data frames were put into a list where they could be looped through to simplify the column names and remove columns without any data. Once they were renamed the data frames were joined together into one data frame.

The data was transformed from a wide format to a long format where the years were gathered into a single column called ‘year’.

Commas were removed from the columns with currency and column types with number values were converted from string to numeric.

#-------TIDY AVG_TUITION_COST DATAFRAME----------------------------

#Clean up column names
colnames(avg_tuition_cost) <- c("State", "2004", "2006", "2008", "2010", "2012", "2014")

#Convert from wide to long format
avg_tuition_cost <- gather(avg_tuition_cost, year, avg_tuition, "2004":"2014")


#-------TIDY COLLEGE_RATES DATAFRAME----------------------------


#Put data frames in a list so we can loop through them and tidy the data sets.
college_rates <- list(college_rates_2004, college_rates_2006, college_rates_2008, college_rates_2010, college_rates_2012, college_rates_2014)

#Since it was tab delimited, there were some left-over headers that started with 'X' to remove.
#Some data sets have a column with changes from the previous year that started with 'C'.
#Added a 'year' column so we could merge with other data sets later.
clean_columns <- function(x){
  j <- 1
  year <- 2004
  mylist <- list()
  for(i in x){
      i <- i %>%
        select(-starts_with("X")) %>%
        select(-starts_with("C")) 
      i["year"] <- year
      colnames(i) <- c("State", "grads_to_college_percentage", "hs_grads", "grads_enrolled_in_college", "year")
      mylist[[j]] <- i
      j <- j + 1
      year <- year + 2
  }
  return(mylist)
}

college_rates <- clean_columns(college_rates)

#Join data frames vertically for a long format
num <- length(college_rates)
cr_bind <- rbind(college_rates[[1]], college_rates[[2]])
i <- 3
while(i <= num){
  cr_bind <- rbind(cr_bind, college_rates[[i]])
  i <- i + 1
}

#Each data set had a 'Nation' row with the averages. I removed that because I was going to analyze it myself.
cr_bind <- filter(cr_bind, State != "Nation")



#-------TIDY MEDIAN_INCOME_BY_STATE DATAFRAME----------------------------

#Remove the rows we don't need
#Keep the rows that use Current Dollars and remove the District of Columbia since
#it's not in the college_rate data frames
median_income_by_state <- median_income_by_state[c(4:57), ]
median_income_by_state <- median_income_by_state[-c(2,3), ]
median_income_by_state <- median_income_by_state[-c(10), ]

#Find empty column headers and remove those columns
#Those columns held the standard deviation and mean that we didn't need.
i <- 1
cv <- c()
while(i <= dim(median_income_by_state)[2]){
  if(median_income_by_state[1,i] == ""){cv <- c(cv, i)}
  i <- i + 1
}
median_income_by_state <- median_income_by_state[,-c(cv)]

#Clean up column names and remove columns we don't need (Odd numbered years & years before 2004).
#We needed the years to match the other data sets for comparison

median_income_by_state <- median_income_by_state[,-c(16:35)]

v <- c(2,3,5,6,8,10,12,14)
median_income_by_state <- median_income_by_state[,-c(v)]
median_income_by_state <- median_income_by_state[-c(1),]

#Cleaned up column names
colnames(median_income_by_state) <- c("State", "2014", "2012", "2010", "2008", 
                                      "2006", "2004")

#Convert from wide to long format
median_income_by_state <- gather(median_income_by_state, year, median_income, "2014":"2004")

#Converted to numeric type
median_income_by_state$year <- as.numeric(median_income_by_state$year)

#removed commas before converting the column to numeric
median_income_by_state$median_income <- gsub(",", "", median_income_by_state$median_income)
median_income_by_state$median_income <- as.numeric(median_income_by_state$median_income)

colnames(cr_bind)

## [1] "State"                       "grads_to_college_percentage"
## [3] "hs_grads"                    "grads_enrolled_in_college"  
## [5] "year"

College enrollment rates, median family income and average tuition costs data frames are merged together by ‘State’ and ‘year’. A column was added that calculated the percentage of income required to pay for tuition.

#-------MERGE DATAFRAMES----------------------------

temp <- merge(cr_bind, median_income_by_state, by = c("State", "year"))


full <- merge(temp, avg_tuition_cost, by = c("State", "year"))

full <- full %>% mutate(perc_of_income = round(avg_tuition/median_income, 2))

head(full)

##     State year grads_to_college_percentage hs_grads
## 1 Alabama 2004                        60.6    41135
## 2 Alabama 2006                        62.7    43109
## 3 Alabama 2008                        66.7    45922
## 4 Alabama 2010                        63.2    48443
## 5 Alabama 2012                        59.1    50263
## 6 Alabama 2014                        62.1    49242
##   grads_enrolled_in_college median_income avg_tuition perc_of_income
## 1                     24913         36629    4510.000           0.12
## 2                     27019         37952    4906.000           0.13
## 3                     30616         44476    5968.000           0.13
## 4                     30616         40933    7373.000           0.18
## 5                     29704         43464    8734.000           0.20
## 6                     30583         42278    9480.009           0.22

#Remove excess data frames
remove(temp)
remove(college_rates)
remove(college_rates_2004)
remove(college_rates_2006)
remove(college_rates_2008)
remove(college_rates_2010)
remove(college_rates_2012)
remove(college_rates_2014)
remove(median_income_by_state)
remove(cr_bind)
remove(avg_tuition_cost)

Analysis

Is There A Trend?

The samples had a size of 50 and a total of 6 samples.

Compare College Enrollment Rates vs Percentage of Income for Tuition

For the nation as a whole, I found the mean of the average percent of income used for tuition and the average percent of direct college enrollees. The plot shows the percentage difference from year to year. You can see that while the difference in the college enrollment rate is changing relatively drastically, the different in the percentage of income is barely moving. There is not a noticeable linear coorelation between the two variables.

#Create a subset of the full data frame only and create two new columns 
#that have the nations average of the percent of income and college enrollment
#rate. Grouped by year. 
nation_avg <- subset(full, select = c(year, grads_to_college_percentage, perc_of_income))
nation_avg <- nation_avg %>%
  group_by(year) %>%
  mutate(avg_percentage_income = round(mean(perc_of_income),2)) %>%
  mutate(avg_percentage_grad = round(mean(grads_to_college_percentage),2)) 

#Break up into individual years
nation_2004 <- subset(nation_avg) %>%
  filter(year == 2004) %>%
  mutate(difference = grads_to_college_percentage - perc_of_income)

nation_2006 <- subset(nation_avg) %>%
  filter(year == 2006) %>%
  mutate(difference = grads_to_college_percentage - perc_of_income)

nation_2008 <- subset(nation_avg) %>%
  filter(year == 2008) %>%
  mutate(difference = grads_to_college_percentage - perc_of_income)

nation_2010 <- subset(nation_avg) %>%
  filter(year == 2010) %>%
  mutate(difference = grads_to_college_percentage - perc_of_income)

nation_2012 <- subset(nation_avg) %>%
  filter(year == 2012) %>%
  mutate(difference = grads_to_college_percentage - perc_of_income)

nation_2014 <- subset(nation_avg) %>%
  filter(year == 2014) %>%
  mutate(difference = grads_to_college_percentage - perc_of_income)



#Average across total years 
mean_income <- mean(full$perc_of_income)
mean_grad <- mean(nation_avg$grads_to_college_percentage)

#Create two columns that show the difference from the mean values for
#tuition income percentage and college enrollment rate
nation_avg <- nation_avg %>%
  mutate(diff_income = avg_percentage_income - mean_income) %>%
  mutate(diff_grad = avg_percentage_grad - mean_grad) %>%
  mutate(diff_gi = grads_to_college_percentage - perc_of_income) %>%
  gather(diff, perc_diff, select = c(diff_income, diff_grad))

#Plots
ggplot(nation_avg, aes(x = year, y = perc_diff, group = diff, color = diff), show.legend = F) +
  geom_line()+
  geom_point() +
  ggtitle("Difference in Enrollment Rates vs Income for Tuition as a Nation") +
  xlab("Year") +
  ylab("Percentage Difference") + 
  scale_y_continuous()

#Create another subset and transform the year column from numeric to character
#This stops the year values on the x-axis from being calculated and displayed differently
plot_full <- nation_avg %>%
  gather(type, percentage, select = c(avg_percentage_grad, avg_percentage_income)) 

plot_full$year <- as.character(plot_full$year)

ggplot(plot_full, aes(x = year, y = percentage, color = type, fill = type), show.legend = F) +
  geom_bar(stat = "identity", position = position_dodge()) +
  ggtitle("College Enrollment Rates vs Income for Tuition") +
  geom_text(aes(label=percentage,hjust=0.4, vjust=-0.1), color = "black",position=position_dodge(width = 1))

Histogram

Distribution of the graduation enrollment rates is normal without any noticeable skew. No outliers seem to be present. The red line shows the mean.

m_rate <- round(mean(nation_avg$diff_gi), 2)


hist(nation_avg$diff_gi, breaks = 25, 
     xlab = "College Enrollment % - Income Tuition %", 
     main = "All States from 2004 - 2014 (mean = 60.77)",
     xlim = c(40,80))
abline(v = m_rate, col = "red")

l <- lm(grads_to_college_percentage ~ perc_of_income, data = full)
summary(l)

## 
## Call:
## lm(formula = grads_to_college_percentage ~ perc_of_income, data = full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2790  -4.2764   0.6077   4.9033  17.9396 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      55.769      1.494  37.340 < 0.0000000000000002 ***
## perc_of_income   36.371     10.129   3.591             0.000385 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.276 on 298 degrees of freedom
## Multiple R-squared:  0.04147,    Adjusted R-squared:  0.03826 
## F-statistic: 12.89 on 1 and 298 DF,  p-value: 0.0003855

Confidence Interval

$H_0:$ Percentage of income for tuition does not influence college enrollment

$H_a:$ Percentage of income for tuition does influence college enrollment

#Prep data for plotting
#Separate by year and create columns for the mean and standard deviations for both the percentage of income for tuition and college enrollment.
ci_2004 <- data.frame(nation_2004$year, mean(nation_2004$perc_of_income), sd(nation_2004$perc_of_income),
                      mean(nation_2004$grads_to_college_percentage), sd(nation_2004$grads_to_college_percentage), 
                      nation_2004$difference)
ci_2006 <- data.frame(nation_2006$year, mean(nation_2006$perc_of_income), sd(nation_2006$perc_of_income),
                      mean(nation_2006$grads_to_college_percentage), sd(nation_2006$grads_to_college_percentage), 
                      nation_2006$difference)
ci_2008 <- data.frame(nation_2008$year, mean(nation_2008$perc_of_income), sd(nation_2008$perc_of_income),
                      mean(nation_2008$grads_to_college_percentage), sd(nation_2008$grads_to_college_percentage),
                      nation_2008$difference)
ci_2010 <- data.frame(nation_2010$year, mean(nation_2010$perc_of_income), sd(nation_2010$perc_of_income),
                      mean(nation_2010$grads_to_college_percentage), sd(nation_2010$grads_to_college_percentage), 
                      nation_2010$difference)
ci_2012 <- data.frame(nation_2012$year, mean(nation_2012$perc_of_income), sd(nation_2012$perc_of_income),
                      mean(nation_2012$grads_to_college_percentage), sd(nation_2012$grads_to_college_percentage),
                      nation_2012$difference)
ci_2014 <- data.frame(nation_2014$year, mean(nation_2014$perc_of_income), sd(nation_2014$perc_of_income),
                      mean(nation_2014$grads_to_college_percentage), sd(nation_2014$grads_to_college_percentage),
                      nation_2014$difference)



#Since all the rows are the same. Only one is needed
ci_2004 <- ci_2004[1,]
ci_2006 <- ci_2006[1,]
ci_2008 <- ci_2008[1,]
ci_2010 <- ci_2010[1,]
ci_2012 <- ci_2012[1,]
ci_2014 <- ci_2014[1,]

#Rename columns
colnames(ci_2004) <- c('year', 'income_mean', 'income_sd', 'grad_mean', 'grad_sd', 'difference')
colnames(ci_2006) <- c('year', 'income_mean', 'income_sd', 'grad_mean', 'grad_sd', 'difference')
colnames(ci_2008) <- c('year', 'income_mean', 'income_sd', 'grad_mean', 'grad_sd', 'difference')
colnames(ci_2010) <- c('year', 'income_mean', 'income_sd', 'grad_mean', 'grad_sd', 'difference')
colnames(ci_2012) <- c('year', 'income_mean', 'income_sd', 'grad_mean', 'grad_sd', 'difference')
colnames(ci_2014) <- c('year', 'income_mean', 'income_sd', 'grad_mean', 'grad_sd', 'difference')
  
#function to calculate the confidence interval and display the results for each year.
ci_print <- function(x){
   n <- 50
   sd1 <- x$income_sd
   sd2 <- x$grad_sd
   se <- round(sqrt((sd1^2/n)+(sd2^2/n)),2)
   m <- round(x$grad_mean - x$income_mean,2)
   hi <- m + (se * 1.96)
   lo <- m - (se * 1.96)
   text <- paste0("Confidence Interval for ", x$year, " is (", lo,",", hi,")")
   print(text)
   print(paste0("SD is ", se))
   print(paste0("Mean is ",m))
}

95% confidence interval years 2004 - 2014

#Put the data frames in a list and loop through them.
#Store the confidence interval results in a single dataframe to be plotted
list_ci <- list(ci_2004, ci_2006, ci_2008, ci_2010, ci_2012, ci_2014)

for(i in list_ci){
  ci_print(i)
}

## [1] "Confidence Interval for 2004 is (54.5284,58.2916)"
## [1] "SD is 0.96"
## [1] "Mean is 56.41"
## [1] "Confidence Interval for 2006 is (59.1036,63.3764)"
## [1] "SD is 1.09"
## [1] "Mean is 61.24"
## [1] "Confidence Interval for 2008 is (60.1912,64.2288)"
## [1] "SD is 1.03"
## [1] "Mean is 62.21"
## [1] "Confidence Interval for 2010 is (59.9828,64.1772)"
## [1] "SD is 1.07"
## [1] "Mean is 62.08"
## [1] "Confidence Interval for 2012 is (59.6376,63.3224)"
## [1] "SD is 0.94"
## [1] "Mean is 61.48"
## [1] "Confidence Interval for 2014 is (59.2504,63.2096)"
## [1] "SD is 1.01"
## [1] "Mean is 61.23"

df_ci <- data.frame()
df_list <- list()
j <- 1

for(i in list_ci){
   n <- 50
   
   sd1 <- i$income_sd
   sd2 <- i$grad_sd
   se <- round(sqrt((sd1^2/n)+(sd2^2/n)),2)
   m <- round(i$grad_mean - i$income_mean,2)
   hi <- m + (se * 1.96)
   lo <- m - (se * 1.96)
   
   df_ci <- data.frame(i$year, m, lo, hi, i$difference)
   df_list[[j]] <- df_ci
   j <- j + 1
}

#Bind the data frames in the list into a single data frame
df_ci <- bind_rows(df_list)
colnames(df_ci) <- c('year', 'mean', 'low', 'high', 'difference')
df_ci$year <- as.character(df_ci$year)

ggplot(df_ci, aes(year, y = mean, group = year)) +
  geom_boxplot(aes(ymin = min(difference), lower = low, middle = mean, upper = high, ymax = max(difference)), stat = 'identity') +
  ggtitle("Confidence Intervals for 2004 - 2014") +
  geom_hline(yintercept = mean(df_ci$mean), col = "red")

Red line is the mean across all the years. This plot shows the difference in percentage between the income for tuition and college enrollment. The ‘whiskers’ are the max and min for that year, the middle line in the box is the mean and the two halfs are 1.96 standard deviations away from the mean.

Sampling

Since I only had 6 samples (years). Here is a simulation that grabs a sample size of 50 random observations from the initial population of 300 and runs it 50 times.

Credit for this plotting function belongs to Dr. Jason Bryer

#CREDIT: Dr. Jason Bryer for plot_ci()


samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 50
for(i in 1:50){
samp <- sample(nation_avg$diff_gi, n) # obtain a sample of size n = 50 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(nation_avg$diff_gi))

For a 95% confidence interval the simulation is hovering on 5% failure, where the difference in the income-tuition percentage and college enrollment percentage fall outside of 1.96 standard deviations of the mean. The low r-squared value in the regression model helps justify that this model doesn’t help explain the data so we can fail to reject the null hypothesis. The median family income doesn’t affect the college enrollment rates.

Linear Regression

The response variable is the college enrollment rate and the predictor variable is the percentage of income towards tuition. The R-square value is very small which means the linear model doesn’t explain much of the data. Even though the low p-value suggests the predictor variable is relevant, more variables need to be analyzed for a better fit.

The histogram of residuals shows a unimodal, normal distrabution without any skew. This suggests there are not any outliers to be concerned about.

sum_l <- summary(l)
sum_l

## 
## Call:
## lm(formula = grads_to_college_percentage ~ perc_of_income, data = full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2790  -4.2764   0.6077   4.9033  17.9396 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      55.769      1.494  37.340 < 0.0000000000000002 ***
## perc_of_income   36.371     10.129   3.591             0.000385 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.276 on 298 degrees of freedom
## Multiple R-squared:  0.04147,    Adjusted R-squared:  0.03826 
## F-statistic: 12.89 on 1 and 298 DF,  p-value: 0.0003855

hist(sum_l$residuals, breaks = 25,
     main = "Residuals of College Enrollment Rates",
     xlab = "Displacement",
     xlim = c(-20,20))

Though the average family income didn’t influence the college enrollment rates for the nation as a whole, individuals states may provide a different result.

A regression model was processed for each state. The top and bottom 3 were pulled for analysis.

#Create a loop that makes a data frame subset for each state. Find the linear model for each state and add it to a list.

#Calculate the linear models for each state. Put the state, r.square and adjusted r.square value in a data frame. Sort the data frame by r.square and print the top and bottom 3.


#Make of vector of the state's names
s <- c(unique(full$State))
list_lm <- list()

#Create a subset for each state
#Print the linear model for each subset
for(i in s){
  lm_state <- subset(full) %>%
  filter(i == full$State)

  lm_state <- lm(grads_to_college_percentage ~ avg_tuition + median_income, data = lm_state)
  lm <- summary(lm_state)
  
  list_lm[[i]] <- lm
}

n <- names(list_lm)

r_squared <- c()
for(i in 1:50){r_squared <- c(r_squared, list_lm[[i]]$r.squared)}

adj_r_squared <- c()
for(i in 1:50){adj_r_squared <- c(adj_r_squared, list_lm[[i]]$adj.r.squared)}

adj_r_squared <- c()
for(i in 1:50){adj_r_squared <- c(adj_r_squared, list_lm[[i]]$adj.r.squared)}

df_r_squared <- data.frame(n, r_squared, adj_r_squared)

df_r_squared <- df_r_squared[order(r_squared, decreasing = T),]

top3 <- head(df_r_squared, 3)
top3

##           n r_squared adj_r_squared
## 25 Missouri 0.9668930     0.9448217
## 36 Oklahoma 0.9661470     0.9435783
## 15     Iowa 0.9279984     0.8799974

bottom3 <- tail(df_r_squared, 3)
bottom3

##                 n  r_squared adj_r_squared
## 28         Nevada 0.08354191    -0.5274302
## 40 South Carolina 0.07447976    -0.5425337
## 31     New Mexico 0.07326048    -0.5445659

Compare top 3 vs bottom 3

The top 3 shows a surprisingly well-fitted model. Though the there are only 6 observation and do not meet the conditions, when more data is available in the future, deeper analysis should be encouraged.

Top 3

print("Top 3")

## [1] "Top 3"

lm_mo <- list_lm$Missouri
lm_ok <- list_lm$Oklahoma
lm_ia <- list_lm$Iowa

lm_mo

## 
## Call:
## lm(formula = grads_to_college_percentage ~ avg_tuition + median_income, 
##     data = lm_state)
## 
## Residuals:
##        1        2        3        4        5        6 
## -0.73302  0.88583 -0.27812  0.68186 -0.65230  0.09574 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   37.5309726  3.6934254  10.162  0.00203 **
## avg_tuition    0.0070687  0.0011223   6.298  0.00809 **
## median_income -0.0006157  0.0001974  -3.119  0.05249 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8754 on 3 degrees of freedom
## Multiple R-squared:  0.9669, Adjusted R-squared:  0.9448 
## F-statistic: 43.81 on 2 and 3 DF,  p-value: 0.006024

lm_ok

## 
## Call:
## lm(formula = grads_to_college_percentage ~ avg_tuition + median_income, 
##     data = lm_state)
## 
## Residuals:
##       1       2       3       4       5       6 
## -0.1114  0.5924 -0.7602 -0.5292  0.9216 -0.1131 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   79.1242455  6.1162721  12.937 0.000997 ***
## avg_tuition    0.0052623  0.0006365   8.267 0.003707 ** 
## median_income -0.0011288  0.0002053  -5.498 0.011842 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8333 on 3 degrees of freedom
## Multiple R-squared:  0.9661, Adjusted R-squared:  0.9436 
## F-statistic: 42.81 on 2 and 3 DF,  p-value: 0.006229

lm_ia

## 
## Call:
## lm(formula = grads_to_college_percentage ~ avg_tuition + median_income, 
##     data = lm_state)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.3064 -1.1395  0.8821  0.3440 -0.7262  0.3332 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   51.1641367  5.1937694   9.851  0.00222 **
## avg_tuition    0.0035930  0.0009731   3.692  0.03446 * 
## median_income -0.0002166  0.0002018  -1.073  0.36186   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9878 on 3 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:   0.88 
## F-statistic: 19.33 on 2 and 3 DF,  p-value: 0.01932

Bottom 3

The Adjusted R-squared is negative for the bottom 3. This means the fit is actually worse than just fitting a horizontal line.

print("Bottom 3")

## [1] "Bottom 3"

lm_nv <- list_lm$Nevada
lm_sc <- list_lm$`South Carolina`
lm_nm <- list_lm$`New Mexico`

lm_nv

## 
## Call:
## lm(formula = grads_to_college_percentage ~ avg_tuition + median_income, 
##     data = lm_state)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.5674 -1.6638  2.1730 -1.6591  1.1259 -0.5434 
## 
## Coefficients:
##                  Estimate  Std. Error t value Pr(>|t|)  
## (Intercept)   59.47054144 16.37353500   3.632   0.0359 *
## avg_tuition   -0.00029787  0.00062302  -0.478   0.6652  
## median_income -0.00008647  0.00031010  -0.279   0.7985  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.011 on 3 degrees of freedom
## Multiple R-squared:  0.08354,    Adjusted R-squared:  -0.5274 
## F-statistic: 0.1367 on 2 and 3 DF,  p-value: 0.8773

lm_sc

## 
## Call:
## lm(formula = grads_to_college_percentage ~ avg_tuition + median_income, 
##     data = lm_state)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.5698 -2.7020  2.8491  1.2532 -2.1847  0.2146 
## 
## Coefficients:
##                  Estimate  Std. Error t value Pr(>|t|)
## (Intercept)   55.63113589 52.33432418   1.063    0.366
## avg_tuition   -0.00007579  0.00240725  -0.031    0.977
## median_income  0.00029206  0.00175350   0.167    0.878
## 
## Residual standard error: 2.716 on 3 degrees of freedom
## Multiple R-squared:  0.07448,    Adjusted R-squared:  -0.5425 
## F-statistic: 0.1207 on 2 and 3 DF,  p-value: 0.8904

lm_nm

## 
## Call:
## lm(formula = grads_to_college_percentage ~ avg_tuition + median_income, 
##     data = lm_state)
## 
## Residuals:
##        1        2        3        4        5        6 
## -5.52225  3.61480 -0.02095  3.25188  3.16439 -4.48788 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)   36.447883  63.888502   0.570    0.608
## avg_tuition   -0.002920   0.006793  -0.430    0.696
## median_income  0.001056   0.002178   0.485    0.661
## 
## Residual standard error: 5.301 on 3 degrees of freedom
## Multiple R-squared:  0.07326,    Adjusted R-squared:  -0.5446 
## F-statistic: 0.1186 on 2 and 3 DF,  p-value: 0.8921

#Create vectors to loop through
t3 <- c('Missouri', 'Oklahoma', 'Iowa')
b3 <- c('Nevada', 'South Carolina', 'New Mexico')
f6 <- c('Missouri', 'Oklahoma', 'Iowa', 'Nevada', 'South Carolina', 'New Mexico')

#Create a subset of the top3 and bottom3 states
pick6 <- subset(full) %>%
  filter(State == 'Missouri' |
           State == 'Oklahoma' |
           State == 'Iowa' | 
           State == 'Nevada' |
           State == 'South Carolina' | 
           State == 'New Mexico')

#Create a vector of state abbreviations for the plot labels
abbr <- c('MO', 'OK', 'IA', 'NV', 'SC','NM')

#Loop through and add a column of state abbr. that correspond to the correct state
j <- 1 
for(i in pick6$State){
  if(i == 'Missouri'){pick6$abbr[j] = 'MO'}
  if(i == 'Oklahoma'){pick6$abbr[j] = 'OK'}
  if(i == 'Iowa'){pick6$abbr[j] = 'IA'}
  if(i == 'Nevada'){pick6$abbr[j] = 'NV'}
  if(i == 'South Carolina'){pick6$abbr[j] = 'SC'}
  if(i == 'New Mexico'){pick6$abbr[j] = 'NM'}
  j <- j + 1
}

#Loop through and add another column labeling whether the state is in the bottom or top 3
#This is used primarily for the plot 
j <- 1 
for(i in pick6$State){
  if(i %in% t3){
    pick6$rank[j] <- "Top3"}
  if(i %in% b3){pick6$rank[j] <- "Bottom3"}
  j <- j + 1
}

#Plot 
ggplot(pick6, aes(x = year, y = perc_of_income, group = State, color = rank)) +
  geom_line(lwd = .7) +
  xlab("Year") +
  ylab("% Of Income For Tuition") +
  geom_dl(aes(label = abbr), method = "first.points") +
  ggtitle("Percentage of Income for Tuition")

Conclusion

Family income isn’t enough to justify whether a high school graduate directly enrolls after college. With the presence of grants, scholarships, financial aid, and other government assitance available, the amount of money a family makes shouldn’t have any influence. We are also only analyzing 10 years and the linear models for some states can make a case where family income can influence college enrollment rates. This will depend on the future when more data is available for the coming years.