Question 1 - For college.csv, how many variables and how many observations in the data?

Section 1.1

To begin this project, I will read the relevant file into my R Environment and assign it to the name “college” with a read.csv() function. Also, I will be needing the ggplot function within the tidyverse package in order to create some of the graphs later in this project, so I will use the library() function to load that package.

college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
library(tidyverse)

Now I’ll be able to work with the college dataset, and use ggplot() to create visualizations and explore this data.

Section 1.2

To figure out the number of variables in the college data, I will use the ncol() function. This will return the number of columns, with each column being a variable.

ncol(college)
## [1] 17

This output shows that the college data set has 17 columns, meaning it has 17 variables.

To get a bit more information about what those variables are, I’ll use the colnames() function, which will return a vector with each of the column header names.

colnames(college)
##  [1] "id"                 "name"               "city"              
##  [4] "state"              "region"             "highest_degree"    
##  [7] "control"            "gender"             "admission_rate"    
## [10] "sat_avg"            "undergrads"         "tuition"           
## [13] "faculty_salary_avg" "loan_default_rate"  "median_debt"       
## [16] "lon"                "lat"

This output shows that the college data set has two variables to identify colleges, one being an id variable and one being the name of the college. It also has variables pertaining to the location of respective colleges, such as the city, state, region, longitude, and latitude. The data set also has some variables about admission to the colleges, like gender, admission rate, average SAT score, number of undergrads, tuition, and loan default rate. There are also more variables about the colleges, like the highest degree, control, average faculty salary, and median debt. This information about the number and types of variables provides a good sense for what I will be dealing with as I explore relationships and trends in this data.

Section 1.3

To figure out the number of observations in the college data set, I will use the nrow() function. This will display the number of rows in the data, with each row corresponding to an observation.

nrow(college)
## [1] 1269

This output shows that there are 1269 observations in this data set.

Question 2 - Are there missing values in the data? If so, can you show/visualize how data is missing?

Section 2.1

When I read the data set into R using read.csv(), I added the specification na.strings = "" within the function. This specification means that if there are any blank values in the data, it will be changed to a value of “NA.” Thanks to this step, I can use an is.na() function within a sum() function to count the number of NA’s in the data.

sum(is.na(college))
## [1] 0

This output shows that there are no missing values in the form of an “NA” in the data (and therefore no blanks due to how we read the data in). This may mean that there are no missing values, but it’s possible that there may still be some that have values other than “NA” or a blank.

Section 2.2

To check if there are any, I will use a str() function to see if there are any odd structures to the data which we can further explore to find potential missing values.

str(college)
## 'data.frame':    1269 obs. of  17 variables:
##  $ id                : int  102669 101648 100830 101879 100858 100663 101480 102049 101709 100751 ...
##  $ name              : Factor w/ 1260 levels "Abilene Christian University",..: 9 522 44 1079 43 980 430 801 1065 938 ...
##  $ city              : Factor w/ 834 levels "Aberdeen","Abilene",..: 25 448 491 247 42 71 349 71 490 760 ...
##  $ state             : Factor w/ 51 levels "AK","AL","AR",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ region            : Factor w/ 4 levels "Midwest","Northeast",..: 4 3 3 3 3 3 3 3 3 3 ...
##  $ highest_degree    : Factor w/ 3 levels "Associate","Bachelor",..: 3 1 3 3 3 3 3 3 3 3 ...
##  $ control           : Factor w/ 2 levels "Private","Public": 1 2 2 2 2 2 2 1 2 2 ...
##  $ gender            : Factor w/ 3 levels "CoEd","Men","Women": 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_rate    : num  0.421 0.614 0.802 0.679 0.835 ...
##  $ sat_avg           : int  1054 1055 1009 1029 1215 1107 1041 1165 1070 1185 ...
##  $ undergrads        : int  275 433 4304 5485 20514 11383 7060 3033 2644 29851 ...
##  $ tuition           : int  19610 8778 9080 7412 10200 7510 7092 27324 10660 9826 ...
##  $ faculty_salary_avg: int  5804 5916 7255 7424 9487 9957 6801 8367 7437 9667 ...
##  $ loan_default_rate : Factor w/ 198 levels "0","0.002","0.003",..: 77 135 106 111 45 62 96 7 103 63 ...
##  $ median_debt       : num  23250 11500 21335 21500 21831 ...
##  $ lon               : num  -149.9 -87.3 -86.3 -87.7 -85.5 ...
##  $ lat               : num  61.2 32.6 32.4 34.8 32.6 ...

There are a couple things that stand out in this output. First, the “state” variable has 51 levels, which is odd because there are only 50 states. This could be the result of one or more records not having data on the state, and inputting some text (such as “none”) as a placeholder. I’ll use the levels() function to see what values there are in this variable.

levels(college$state)
##  [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VA" "VT" "WA" "WI" "WV" "WY"

This output shows that there is a level called “DC”, which is not a state but is a valid U.S. location. There are no missing values in this variable.

The other thing that stands out about the structure of the data is that the loan_default_rate variable is listed as a factor with 198 levels, but this should be a numeric variable. I’ll use the levels() function again to see if there are any values causing this to list as a factor.

levels(college$loan_default_rate)
##   [1] "0"     "0.002" "0.003" "0.004" "0.005" "0.006" "0.007" "0.008" "0.009"
##  [10] "0.01"  "0.011" "0.012" "0.013" "0.014" "0.015" "0.016" "0.017" "0.018"
##  [19] "0.019" "0.02"  "0.021" "0.022" "0.023" "0.024" "0.025" "0.026" "0.027"
##  [28] "0.028" "0.029" "0.03"  "0.031" "0.032" "0.033" "0.034" "0.035" "0.036"
##  [37] "0.037" "0.038" "0.039" "0.04"  "0.041" "0.042" "0.043" "0.044" "0.045"
##  [46] "0.046" "0.047" "0.048" "0.049" "0.05"  "0.051" "0.052" "0.053" "0.054"
##  [55] "0.055" "0.056" "0.057" "0.058" "0.059" "0.06"  "0.061" "0.062" "0.063"
##  [64] "0.064" "0.065" "0.066" "0.067" "0.068" "0.069" "0.07"  "0.071" "0.072"
##  [73] "0.073" "0.074" "0.075" "0.076" "0.077" "0.078" "0.079" "0.08"  "0.081"
##  [82] "0.082" "0.083" "0.084" "0.085" "0.086" "0.087" "0.088" "0.089" "0.09" 
##  [91] "0.091" "0.092" "0.093" "0.094" "0.095" "0.096" "0.097" "0.098" "0.099"
## [100] "0.1"   "0.101" "0.102" "0.103" "0.104" "0.105" "0.106" "0.107" "0.108"
## [109] "0.109" "0.11"  "0.111" "0.113" "0.114" "0.115" "0.116" "0.117" "0.118"
## [118] "0.119" "0.12"  "0.121" "0.122" "0.123" "0.124" "0.125" "0.126" "0.127"
## [127] "0.128" "0.129" "0.13"  "0.131" "0.132" "0.133" "0.134" "0.135" "0.136"
## [136] "0.137" "0.138" "0.139" "0.14"  "0.142" "0.143" "0.144" "0.147" "0.148"
## [145] "0.149" "0.151" "0.152" "0.153" "0.154" "0.155" "0.156" "0.157" "0.158"
## [154] "0.159" "0.16"  "0.164" "0.166" "0.167" "0.169" "0.171" "0.172" "0.174"
## [163] "0.175" "0.176" "0.179" "0.18"  "0.182" "0.184" "0.186" "0.187" "0.188"
## [172] "0.19"  "0.192" "0.193" "0.196" "0.197" "0.2"   "0.202" "0.204" "0.213"
## [181] "0.215" "0.217" "0.218" "0.22"  "0.222" "0.23"  "0.233" "0.236" "0.237"
## [190] "0.247" "0.259" "0.284" "0.298" "0.306" "0.311" "0.315" "0.334" "NULL"

This output shows that there is a level called “NULL”. This is not a numeric value, and therefore is likely a placeholder for missing values. To deal with this, I’ll re-read the data into R using read.csv(), with the specification na.strings = “NULL”. This should change these values to NA’s, and allow this variable to have a numeric structure.

college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "NULL")

Now, I’ll re-run the str() function for this variable to confirm that it is now numeric, and then use the is.na() function within a sum() function to count the number of missing values.

str(college$loan_default_rate)
##  num [1:1269] 0.077 0.136 0.106 0.111 0.045 0.062 0.096 0.007 0.103 0.063 ...
sum(is.na(college))
## [1] 2

These outputs show that the loan_default_rate is now numeric, as it should be, and that there are two missing values in this data set.

Question 3 - For each variable in the data, please describe what you observe, such as some summary statistics, their distributions, and etc.

The first thing I’ll do to get an overview of the variables is use a generic summary() function to see if there are any variables that stick out and warrant a deeper look.

summary(college)
##        id                          name                city          state    
##  Min.   :100654   Westminster College:   3   New York    :  15   PA     :101  
##  1st Qu.:153250   Anderson University:   2   Boston      :  11   NY     : 84  
##  Median :186283   Aquinas College    :   2   Chicago     :  10   CA     : 71  
##  Mean   :186988   Bethany College    :   2   Philadelphia:   9   TX     : 63  
##  3rd Qu.:215284   Bethel University  :   2   Cleveland   :   8   OH     : 52  
##  Max.   :484905   Emmanuel College   :   2   Los Angeles :   8   IL     : 47  
##                   (Other)            :1256   (Other)     :1208   (Other):851  
##        region      highest_degree    control      gender     admission_rate  
##  Midwest  :353   Associate:  20   Private:763   CoEd :1237   Min.   :0.0509  
##  Northeast:299   Bachelor : 200   Public :506   Men  :   4   1st Qu.:0.5339  
##  South    :459   Graduate :1049                 Women:  28   Median :0.6687  
##  West     :158                                               Mean   :0.6501  
##                                                              3rd Qu.:0.7859  
##                                                              Max.   :1.0000  
##                                                                              
##     sat_avg       undergrads       tuition      faculty_salary_avg
##  Min.   : 720   Min.   :   47   Min.   : 2732   Min.   : 1451     
##  1st Qu.: 973   1st Qu.: 1296   1st Qu.: 8970   1st Qu.: 6191     
##  Median :1040   Median : 2556   Median :20000   Median : 7272     
##  Mean   :1060   Mean   : 5629   Mean   :21025   Mean   : 7656     
##  3rd Qu.:1120   3rd Qu.: 6715   3rd Qu.:30364   3rd Qu.: 8671     
##  Max.   :1545   Max.   :52280   Max.   :51008   Max.   :20650     
##                                                                   
##  loan_default_rate  median_debt         lon               lat       
##  Min.   :0.00000   Min.   : 6056   Min.   :-157.92   Min.   :19.71  
##  1st Qu.:0.03500   1st Qu.:21250   1st Qu.: -94.17   1st Qu.:35.22  
##  Median :0.05500   Median :24589   Median : -84.89   Median :39.74  
##  Mean   :0.06558   Mean   :23483   Mean   : -88.29   Mean   :38.61  
##  3rd Qu.:0.08300   3rd Qu.:27000   3rd Qu.: -78.63   3rd Qu.:41.81  
##  Max.   :0.33400   Max.   :41000   Max.   : -68.59   Max.   :61.22  
##  NA's   :2

One thing that can be noticed from this output is that there are multiple college names that have more than one entry. This could mean that there are either multiple colleges with the same names, or that there are duplicate entries of the entire row for these colleges. To look into this, I’ll take a sum of the duplicated() function, which will return the number of rows that are duplicates.

sum(duplicated(college))
## [1] 0

This output shows that there are not any duplicated rows, meaning that there are different colleges with the same names.

Next, I’ll take a closer look at the summary statistics for the undergrads variable.

summary(college$undergrads)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      47    1296    2556    5629    6715   52280

These statistics show thaat the number of undergraduates at colleges is very positively skewed. The mean is quite a bit higher than the median, and the Max is significantly higher than the 3rd quartile, indicating that there are some colleges that have huge numbers of undergrads relative to the others, which is skewing this variable.

To get a better visualization of the distribution of the number of undergraduates, I’ll create a histogram.

ggplot(data = college, aes(x = undergrads))+
  geom_histogram(binwidth = 1000, fill = "#bcbddc", colour = "#756bb1")+
  labs(title="Distribution of Undergraduate Students")+
  scale_x_continuous(name = "Number of Undergraduates", breaks = c(0,5000,10000,15000,20000,25000,30000,35000,40000,45000,50000,55000))

This histogram confirms that the number of undergraduates at colleges is very positively skewed due to a small number of colleges having significantly larger undergraduate populations. Most colleges have somewhere between 0 and 10,000 undergraduates, with a smaller number of colleges having over 10,000 undergraduates.

Question 4 - Visualize the association between some variable pairs of your choice. For any two variables, you can check to see if there is an interesting relationship worth mentioning. If so, you can explore further and visualize what you have found.

Now I will be looking into the relationship between the SAT Averages of colleges and their admission rates. A potential correlation between the two variables could provide information about some of the factors that go into colleges being more or less selective in the admission process. To visualize this relationship, I will be creating a scatter plot, and adding a smooth line to better distinguish the trend between the two variables.

ggplot(data = college, aes(x = sat_avg, y = admission_rate)) +
  geom_point()+
  geom_smooth()+
  labs(title = "SAT Scores vs. Admission Rate", x = "Average SAT Score", y = "Admission Rate")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This scatter plot with the smooth line shows that there is a general relationship between average SAT scores and the admission rate of a college. The smooth line suggests that SAT score matters less in the admission rate for colleges with lower average SAT scores, but for colleges with average SAT scores above 900 to 1000, it becomes significantly more important to the admission rate of the college. It’s clear that SAT scores is a fairly good predictor of how selective a college is in their admission process, and I will use this finding shortly to look into the exclusiveness of colleges.

Question 5 - Visualize the association between some variable pairs (of your choice) conditional on some other variables (of your choice). This is similar to the previous questions but your visualization involves more than two variables.

To explore the relationship between a variable pair conditional on another variable, I will create a scatter plot of the relationship between SAT average and the number of undergraduates at colleges, with the control variable as a color layer. This will help to see the general trend between SAT scores and the size of the student population, while also comparing these variables for public vs. private colleges.

ggplot(data = college, aes(x = sat_avg, y = undergrads, color = control))+
  geom_point()+
  geom_smooth()+
  labs(title = "SAT Score vs. Undergraduates", subtitle = "Public vs. Private Colleges", x = "Average SAT Score", y = "Number of Undergraduates")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From this scatter plot we are able to tell that there doesn’t seem to be a strong relationship between SAT scores and the number of undergraduate students for private colleges. In fact, the number of undergraduates seems to be about the same no matter how many undergraduates there are in private colleges. For public colleges, there is more of a positive relationship between SAT scores and undergraduates. The visualization also shows that public colleges typically have more undergraduates attending, although the spread of SAT scores is somewhat identical for each college type, with private colleges having somewhat of a higher ceiling for average SAT score.

Question 6 - Propose five questions you are interested in about this data set, and then answer these questions using summary statistics and visualization.

Question One - Where in the country do tuition rates tend to be highest and lowest?

Something that would be very useful to know for students applying to colleges is whether certain parts of the country tend to have different tuition rates. To explore this and visualize whether there are differences, I will be creating boxplots of the tuition rates for each region of the country. These boxplots will display some of the key summary statistics for each region, which will help to identify differences in tuition by region.

ggplot(data = college, aes(x = region, y = tuition))+
  geom_boxplot()+
  stat_summary(fun = "mean")+
  labs(title = "Distribution of Tuition by Region", x = "Region", y = "Tuition")
## Warning: Removed 4 rows containing missing values (geom_segment).

This boxplot shows that the Northeast region tends to have the highest tuition rates compared to the other regions, because it has both the highest median and mean values of tuition rates. The South and West regions both have the lowest median tuition rates, with the medians of both of them being almost the same as the first quartiles of the Midwest and Northeast regions. This suggests that colleges in the South and West tend to have more colleges with tuition rates of $15,000 or less. These regions also, however, have higher means than medians, which shows that there are some colleges in the South and West with much higher tuition rates that are positively skewing their distributions.

Since the West region has such a higher mean than its median, I would like to look deeper into which states within this region have the highest tuition rates, and might be positively skewing this region. To do this, I will first make a new data set with only the entries for the West region, and then create a barplot to visualize the average tuition rates for every state in this region.

West_region <- college %>% filter(region == "West")
summary(West_region$tuition)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3950    7522   13880   21431   34197   48594
ggplot(data = West_region, aes(x = state, y = tuition))+
  geom_bar(stat = "summary", fun = "mean")+
  labs(title = "Tuition by State", subtitle = "For the West Region", x = "State", y = "Average Tuition")

The summary statistics for tuition in the West region show that the median tuition rate in the West is $13,380, while the mean is $21,431. Looking at the barplot of the average tuition rates for each state, Alaska, California, Oregon, and Washington are the four states that have noticeably higher average tuition rates than the rest of the states, meaning that the colleges in these states are causing the average tuition rates in the West region as a whole to be higher.

The visualizations of tuition rates by region suggest that the Midwest and Northeast regions tend to have more colleges with higher tuition rates, while the South and West typically have lower tuition. There are some states within the West region, however, which have significantly higher tuition rates than the rest of the region, indicating that someone who is looking for colleges with lower tuition rates might want to consider looking in the South and West regions, but would want to avoid certain states that pull the average up in these regions.

Question Two - How do the services differ based on how selective colleges are?

A valuable question is how the experience differs at colleges depending on how selective they are in their admission process. To explore, I’ll first look into the difference in admission rates based on the highest degree offered at the colleges with a boxplot. This will give insight into whether colleges offering higher degrees tend to be more exclusive.

ggplot(data = college, aes(x = highest_degree, y = admission_rate))+
  geom_boxplot()+
  stat_summary(fun = "mean")+
  labs(title = "Admission Rates for Highest Degree Offered", x = "Highest Degree", y = "Admission Rate")
## Warning: Removed 3 rows containing missing values (geom_segment).

This visualization shows that there isn’t much of a difference in the admission rates based on the highest degree offered. In fact, surprisingly colleges that offer Graduate degrees are just slightly less selective than colleges offering only Bachelor or Associate’s degrees as the highest level. There are quite a few outliers for colleges offering a Graduate’s degree, however, where those schools are very selective and have very low admission rates.

Another relationship that can distinguish whether more selective colleges offer improved services is the average faculty salary. Higher faculty salaries may suggest that colleges offer higher quality of education, or at least are more willing to invest more in their staffs. To visualize this relationship, I’ll create a scatter plot for the admission rates and average faculty salary.

ggplot(data = college, aes(x = faculty_salary_avg, y = admission_rate))+
  geom_point(alpha = 0.2)+
  geom_smooth()+
  labs(title = "Faculty Salaries for Admission Rates", x = "Average Faculty Salary", y = "Admission Rate")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Based on this scatter plot, there appears to not be a strong correlation between the admission rate and faculty salaries in the most densely packed salary range (between about $5,000 and $10,000). For average salaries above roughly $10,000, however, there begins to form a moderately strong relationship between higher faculty salaries and lower admission rates. This indicates that the colleges who are willing to invest the most in their faculty salaries tend to be more selective in their admission rates.

Question Three - Do larger colleges have higher expenses?

Another question that would provide some useful information about colleges is whether larger colleges tend to have higher expenses. I will use the “undergrads” variable, which is the number of undergraduates at the school, as the basis for the size of the colleges.

The first indication of expenses that I will use is average faculty salary, and I will create a scatter plot comparing the number of undergraduates and the faculty salary to see whether larger colleges pay their faculty higher salaries. I will be adding a smooth line to better identify the relationship, and will be scaling the x axis (undergrads) to a log10 scale due to a small number of colleges with significantly higher undergraduates.

ggplot(data = college, aes(x = undergrads, y = faculty_salary_avg))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()+
  labs(title = "Number of Undergraduates v. Faculty Salary", x = "Number of Undergraduates", y = "Average Faculty Salary")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This scatter plot shows that there does seem to be a relationship between the number of undergraduates and the average salary of faculty. Although colleges with higher amounts of undergraduates don’t pay a ton more than colleges with lower amounts, they do tend to pay slightly more to their faculty.

The next indication of expenses that I’ll use is the tuition rates of colleges. To see whether larger colleges charge more in tuition, I’ll create another scatter plot comparing undergraduates to tuition, and also add the control variable as a color aesthetic to see if private v. public colleges have different sizes and tuition rates. I’ll also scale the x axis to a log10 scale again.

ggplot(data = college, aes(x = undergrads, y = tuition, color = control))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()+
  labs(title = "Number of Undergraduates v. Tuition", subtitle = "For Public v. Private Colleges", x = "Number of Undergraduates", y = "Tuition")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This scatter plot shows that there are distinct differences in the tuition rates for private and public colleges. For public colleges, there doesn’t seem to be any relationship between the number of undergraduates and tuition, as the tuition rate roughly varies around the same value for all amounts of undergraduate students. Private colleges do seem to have a relationship, as the colleges with higher amounts of undergraduates seem to generally have higher tuition rates.

The visualizations of expenses versus the number of undergraduates suggest that generally, larger colleges seem to have higher expenses.

Question Four - What factors seem to have the greatest effect on the debt that students accumulate?

Now I’ll be looking at levels of debt accumulated by students at colleges to see if there are any patterns or factors that contribute to higher levels of debt. First, I’ll be comparing the median debt to the highest degree offered by colleges, as well as “control” to see if there’s a difference in debt levels for both the highest degree level of a college and for public vs. private colleges. To do this, I’ll create a barplot of the average debt levels for each degree level, and separated by private v. public colleges.

ggplot(data = college, aes(x = highest_degree, y = median_debt, fill = control))+
  geom_bar(position = "dodge", stat = "summary", fun = "mean")+
  labs(title = "Debt v. Highest Degree Offered", subtitle = "For Public v. Private Colleges", x = "Highest Degree Offered", y = "Median Debt")

This barplot shows that students attending colleges with a highest degree of Associate typically accumulate a lower median debt compared to the other two categories. There is not a significant difference between median debt levels for colleges offering Bachelor’s and Graduate’s degrees as the highest level, indicating that the only main effect of highest degree on debt level is whether the college only offers an Associate’s degree or has higher degree levels. At all three levels of degrees offered, students at private colleges typically see slightly higher levels of debt.

Other factors that could contribute to higher levels of debt are the tuition rate of a college, as well as the region the college is located in. To visualize whether these factors contribute to debt levels, I’ll create a scatter plot showing the relationship between median debt and tuition, with the region of the college added in as a shape aesthetic.

ggplot(data = college, aes(x = median_debt, y = tuition, shape = region))+
  geom_point()+
  geom_smooth(aes(color = region))+
  labs(title = "Debt v. Tuition Rates", subtitle = "For Each Region", x = "Median Debt", y = "Tuition")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This scatter plot does not show much relationship between the median debt level and tuition, for any of the regions. It doesn’t appear that these factors are very helpful in predicting the debt level of a college.

The final variable I’ll be exploring in relation to median debt level is the default loan rate. I would expect this to play a role in the level of debt for students at a college, so I will be visualizing this relationship with another scatter plot. Since there may be some points that are close together due to common loan rates, I’ll set the alpha to 0.1 to better visualize where points are located.

ggplot(data = college, aes(x = loan_default_rate, y = median_debt))+
  geom_point(alpha = 0.1)+
  geom_smooth()+
  labs(title = "Default Loan Rate v. Median Debt", x = "Default Loan Rate", y = "Median Debt")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

Although I expected the default loan rate to play a role in the debt levels of students, there doesn’t seem to be much correlation between the two variables. The debt levels get slightly higher as the default rate goes up, but it looks like there is a lot of variation and random noise, meaning loan rates likely aren’t a strong predictor of debt levels.

Question Five - What are the biggest differences between public and private colleges?

So far, visualizations have highlighted a couple differences in public and private colleges, such as higher tuition levels and slightly higher median debt levels at private colleges. Now I’ll explore some other variables to discover some key ways private colleges differ from public ones. To start, I’ll create boxplots to compare the distributions of average faculty salaries for private and public colleges.

ggplot(data = college, aes(x = control, y = faculty_salary_avg))+
  geom_boxplot()+
  labs(title = "Faculty Salary for Public v. Private Colleges", x = "College Type", y = "Average Faculty Salary")

These boxplots show that public colleges typically tend to pay their teachers higher average salaries, and are a bit more consistent in the salary range, with a smaller distribution spread. Private colleges, however, have many more outliers above the plot, so staff at private colleges have a better chance for much higher salaries.

Next, I will expand upon an earlier visualization from a different section (Question 4) where I compared the admission rates and average SAT scores of colleges. To look into whether private or public colleges are more selective and harder to get into, I will add the “control” variable in as a color aesthetic.

ggplot(data = college, aes(x = sat_avg, y = admission_rate, color = control))+
  geom_point()+
  geom_smooth()+
  labs(title = "SAT Average v. Admission Rate", subtitle = "For Public v. Private Colleges", x = "Average SAT Score", y = "Admission Rate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Based on this scatter plot, private and public colleges tend to follow similar trends regarding SAT scores and admission rates, with colleges with higher average SAT scores tending to have lower admission rates. Another takeaway from this visualization is that there are more private colleges that have higher SAT scores and lower admission rates, while public colleges are a bit less spread out and typically more consistent in their SAT scores and admission rates.