Question 1. How many variables and how many observations in the data?

To begin, I want to read the relevant file into R, and load the tidyverse.

college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

I will now be able to work with the data set and explore with data analysis.

I next want to find out how many columns there are in the dataset.

ncol(college)
## [1] 17

This shows me that there are 17 columns in the college data set. These columns represent the number of variables.

To discover the variable names for analysis, I will use the colnames function.

colnames(college)
##  [1] "id"                 "name"               "city"              
##  [4] "state"              "region"             "highest_degree"    
##  [7] "control"            "gender"             "admission_rate"    
## [10] "sat_avg"            "undergrads"         "tuition"           
## [13] "faculty_salary_avg" "loan_default_rate"  "median_debt"       
## [16] "lon"                "lat"

This output tells me the names of all the variables. These variables will help me make observations about the data set, and help me with my future analysis.

Finally, I want to find the number of observations (rows) in the data set. Here is the function I will use:

nrow(college)
## [1] 1269

This output tells me there are 1269 total observations in the data set.

Question 2. Are there any missing values in the data set? If so, can you show/visualize how data is missing?

First, I want to enter some code to see if there are any missing values in the data.

sum(is.na(college))
## [1] 0

This output tells me that there are no missing values in the data set. Using the previous code na.strings = "“, this makes any blank in the data become”NA". This allows me to search for any NA values, for which this data set has none.

I next want to look at the structure of the data.

str(college)
## 'data.frame':    1269 obs. of  17 variables:
##  $ id                : int  102669 101648 100830 101879 100858 100663 101480 102049 101709 100751 ...
##  $ name              : Factor w/ 1260 levels "Abilene Christian University",..: 9 522 44 1079 43 980 430 801 1065 938 ...
##  $ city              : Factor w/ 834 levels "Aberdeen","Abilene",..: 25 448 491 247 42 71 349 71 490 760 ...
##  $ state             : Factor w/ 51 levels "AK","AL","AR",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ region            : Factor w/ 4 levels "Midwest","Northeast",..: 4 3 3 3 3 3 3 3 3 3 ...
##  $ highest_degree    : Factor w/ 3 levels "Associate","Bachelor",..: 3 1 3 3 3 3 3 3 3 3 ...
##  $ control           : Factor w/ 2 levels "Private","Public": 1 2 2 2 2 2 2 1 2 2 ...
##  $ gender            : Factor w/ 3 levels "CoEd","Men","Women": 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_rate    : num  0.421 0.614 0.802 0.679 0.835 ...
##  $ sat_avg           : int  1054 1055 1009 1029 1215 1107 1041 1165 1070 1185 ...
##  $ undergrads        : int  275 433 4304 5485 20514 11383 7060 3033 2644 29851 ...
##  $ tuition           : int  19610 8778 9080 7412 10200 7510 7092 27324 10660 9826 ...
##  $ faculty_salary_avg: int  5804 5916 7255 7424 9487 9957 6801 8367 7437 9667 ...
##  $ loan_default_rate : Factor w/ 198 levels "0","0.002","0.003",..: 77 135 106 111 45 62 96 7 103 63 ...
##  $ median_debt       : num  23250 11500 21335 21500 21831 ...
##  $ lon               : num  -149.9 -87.3 -86.3 -87.7 -85.5 ...
##  $ lat               : num  61.2 32.6 32.4 34.8 32.6 ...

One variable that stands out is the college_default_rate. It is shown as a factor, however, all of the observations seem to be numbers, which would make it a numeric variable.

To investigate this further, I want to see all the values of the variable.

levels(college$loan_default_rate)
##   [1] "0"     "0.002" "0.003" "0.004" "0.005" "0.006" "0.007" "0.008" "0.009"
##  [10] "0.01"  "0.011" "0.012" "0.013" "0.014" "0.015" "0.016" "0.017" "0.018"
##  [19] "0.019" "0.02"  "0.021" "0.022" "0.023" "0.024" "0.025" "0.026" "0.027"
##  [28] "0.028" "0.029" "0.03"  "0.031" "0.032" "0.033" "0.034" "0.035" "0.036"
##  [37] "0.037" "0.038" "0.039" "0.04"  "0.041" "0.042" "0.043" "0.044" "0.045"
##  [46] "0.046" "0.047" "0.048" "0.049" "0.05"  "0.051" "0.052" "0.053" "0.054"
##  [55] "0.055" "0.056" "0.057" "0.058" "0.059" "0.06"  "0.061" "0.062" "0.063"
##  [64] "0.064" "0.065" "0.066" "0.067" "0.068" "0.069" "0.07"  "0.071" "0.072"
##  [73] "0.073" "0.074" "0.075" "0.076" "0.077" "0.078" "0.079" "0.08"  "0.081"
##  [82] "0.082" "0.083" "0.084" "0.085" "0.086" "0.087" "0.088" "0.089" "0.09" 
##  [91] "0.091" "0.092" "0.093" "0.094" "0.095" "0.096" "0.097" "0.098" "0.099"
## [100] "0.1"   "0.101" "0.102" "0.103" "0.104" "0.105" "0.106" "0.107" "0.108"
## [109] "0.109" "0.11"  "0.111" "0.113" "0.114" "0.115" "0.116" "0.117" "0.118"
## [118] "0.119" "0.12"  "0.121" "0.122" "0.123" "0.124" "0.125" "0.126" "0.127"
## [127] "0.128" "0.129" "0.13"  "0.131" "0.132" "0.133" "0.134" "0.135" "0.136"
## [136] "0.137" "0.138" "0.139" "0.14"  "0.142" "0.143" "0.144" "0.147" "0.148"
## [145] "0.149" "0.151" "0.152" "0.153" "0.154" "0.155" "0.156" "0.157" "0.158"
## [154] "0.159" "0.16"  "0.164" "0.166" "0.167" "0.169" "0.171" "0.172" "0.174"
## [163] "0.175" "0.176" "0.179" "0.18"  "0.182" "0.184" "0.186" "0.187" "0.188"
## [172] "0.19"  "0.192" "0.193" "0.196" "0.197" "0.2"   "0.202" "0.204" "0.213"
## [181] "0.215" "0.217" "0.218" "0.22"  "0.222" "0.23"  "0.233" "0.236" "0.237"
## [190] "0.247" "0.259" "0.284" "0.298" "0.306" "0.311" "0.315" "0.334" "NULL"

This output shows that there is a null value for the variable, and not NA. This is what caused the variable to be a factor and not numeric. To solve this, I will make the null values to appear as NA. I’ll do this by slightly modifying the read code.

college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "NULL")

This code has changed the college_default_rate variable from a factor into a numeric variable. Now I will check how many null values there are, as they are now listed as NA.

sum(is.na(college))
## [1] 2

This output tells me there are 2 missing values in th college_default_rate, making the data set total to 2.

Question 3. For each variable in the data set, please describe what you observe, such as some summary statistics, their distributions, and etc.

The first step I want to take in answering this question is to get a summary of the data set.

summary(college)
##        id                          name                city          state    
##  Min.   :100654   Westminster College:   3   New York    :  15   PA     :101  
##  1st Qu.:153250   Anderson University:   2   Boston      :  11   NY     : 84  
##  Median :186283   Aquinas College    :   2   Chicago     :  10   CA     : 71  
##  Mean   :186988   Bethany College    :   2   Philadelphia:   9   TX     : 63  
##  3rd Qu.:215284   Bethel University  :   2   Cleveland   :   8   OH     : 52  
##  Max.   :484905   Emmanuel College   :   2   Los Angeles :   8   IL     : 47  
##                   (Other)            :1256   (Other)     :1208   (Other):851  
##        region      highest_degree    control      gender     admission_rate  
##  Midwest  :353   Associate:  20   Private:763   CoEd :1237   Min.   :0.0509  
##  Northeast:299   Bachelor : 200   Public :506   Men  :   4   1st Qu.:0.5339  
##  South    :459   Graduate :1049                 Women:  28   Median :0.6687  
##  West     :158                                               Mean   :0.6501  
##                                                              3rd Qu.:0.7859  
##                                                              Max.   :1.0000  
##                                                                              
##     sat_avg       undergrads       tuition      faculty_salary_avg
##  Min.   : 720   Min.   :   47   Min.   : 2732   Min.   : 1451     
##  1st Qu.: 973   1st Qu.: 1296   1st Qu.: 8970   1st Qu.: 6191     
##  Median :1040   Median : 2556   Median :20000   Median : 7272     
##  Mean   :1060   Mean   : 5629   Mean   :21025   Mean   : 7656     
##  3rd Qu.:1120   3rd Qu.: 6715   3rd Qu.:30364   3rd Qu.: 8671     
##  Max.   :1545   Max.   :52280   Max.   :51008   Max.   :20650     
##                                                                   
##  loan_default_rate  median_debt         lon               lat       
##  Min.   :0.00000   Min.   : 6056   Min.   :-157.92   Min.   :19.71  
##  1st Qu.:0.03500   1st Qu.:21250   1st Qu.: -94.17   1st Qu.:35.22  
##  Median :0.05500   Median :24589   Median : -84.89   Median :39.74  
##  Mean   :0.06558   Mean   :23483   Mean   : -88.29   Mean   :38.61  
##  3rd Qu.:0.08300   3rd Qu.:27000   3rd Qu.: -78.63   3rd Qu.:41.81  
##  Max.   :0.33400   Max.   :41000   Max.   : -68.59   Max.   :61.22  
##  NA's   :2

The first thing that stands out to me for the summary is the undergrad variable. The 3rd quartile has a value of 6715, while the max has a value of 52280. This means that the variable is most likely very positively skewed. To get a better visualization, I would need to graph it. I believe the best method would be to use a histogram.

ggplot(data = college, aes(x = undergrads))+
  geom_histogram(binwidth = 1200, fill = "green", colour = "black")+
  labs(title="Undergraduate Distribution")+
  scale_x_continuous(name = "number of Undergraduates", breaks = c(0,5000,10000,15000,20000,25000,30000,35000,40000,45000,50000,55000))

This histogram confirms my suspicions of this variable being positively skewed. The histogram is showing that a small number of colleges have a much higher undergraduate populations than the rest of the schools. It also shows that most have an undergraduate population of 0-10,000 students.

I will next look at another variable to see if there is something that sticks out for me to investigate. One variable that caught my eye is the name variable. There are some names that have multiple entries. I’m not sure whether these colleges are double counted, or if they are different colleges with the same name. To find out the answer, I will run a code that shows if there are any duplicates.

sum(duplicated(college))
## [1] 0

This output tells me that there are no duplicate colleges in this data set, and they are indeed separate colleges with the same name.

Question 4. Visualize the association between some variable pairs of your choice. For any two variables, you can check to see if there is an interesting relationship worth mentioning. If so, you can explore further and visualize what you have found

A relationship I wanted to look into was faculty_salary_avg and admission_rate. I chose these as I wanted to see if the most exclusive colleges pay their professors the most. I predict this will most likely be the case, but I want to see if maybe there is more connections to be made.

I decided to make a scatterplot to visualize the data.

ggplot(data = college, aes(x = faculty_salary_avg, y = admission_rate)) +
  geom_point()+
  geom_smooth()+
  labs(title = "Faculty Average Salary vs. Admission Rate", x = "Average Salary", y = "Admission Rate" )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This scatter plot with the smoothing line tells me that there is a visible relationship between faculty salary and admission rate for the universities. The scatterplot suggests that the salary of a professor generally increases when the admission rate of a school is lower. It also shows me that admission rates matter less in the salary of faculty when the rates are near the median value.

Question 5. Visualize the association between some variable pairs (of your choice) conditional on some other variables (of your choice). This is similar to the previous questions but your visualization involves more than two variables.

I now want to see if SAT scores effect the admission rates of schools. In order to involve more than two variables, I will be comparing this relationship with Public and Private schools.

I will create another scatterplot to visualize this relationship.

ggplot(data = college, aes(x = sat_avg, y = admission_rate, color = control))+
  geom_point()+
  geom_smooth()+
  labs(title = "SAT Scores vs. Admission Rate", subtitle = "Public vs. Private Universities", x = "Average SAT Score", y = "Admission Rate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

After reviewing the results, it seems public and private universities are very similar in admissions. For both, SAT scores matter less in admission for colleges with lower averages of SAT scores. Similarly, SAT scores above 800 seem to directly correlate with admission rates for both private and public universities, as the SAT scores become more important in influencing admission rates. This visualization overall shows that SAT scores can be a good predictor on how selective a university is.

Question 6. Propose three questions you are interested in about this data set, and then answer these questions using summary statistics and visualization.

Question 6a. Are public universities harder to get into?

For this question, I want to gauge whether admission rates on average are higher or lower when comparing private and public universities. To see this, I want to create a boxplot comparing these two variables.

ggplot(data = college, aes(x = control, y = admission_rate))+
  geom_boxplot()+
  stat_summary(fun = "mean")+
  labs(title = "Admission Rates of Public vs Private", x = "School Style", y = "Admission Rate")
## Warning: Removed 2 rows containing missing values (geom_segment).

This output shows that in general, Private institutions have lower admission rates than its public counterparts. This is shown by private universities having lower mean and medium admission rates than public universities. Even though its close, one can make the argument that private colleges are more selective in admissions.

Question 6b. Do larger universities have higher expenses?

I want to see if the larger a university is, the higher the expenses are for the said university. To analyze this, I will compare the number of undergraduates and the faculty salary avg of the school. To visualize this relationship, I will use a scatter plot.

ggplot(data = college, aes(x = undergrads, y = faculty_salary_avg))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()+
  labs(title = "Number of Undergraduates v. Average Faculty Salary", x = "Number of Undergraduates", y = "Average Faculty Salary")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This output tells me that there is a relationship between the number of undergraduates and average faculty salary. The smooth line is not that steep, however, which means that universities with a higher number of undergraduates only pay their faculty slightly more than schools with a medium amount of undergraduates.

Question 6c. How much does highest degree choice offered effect student debt accumulation?

For my final question, I wanted to see if highest degree choice offered has an impact on debt accumulation. I also want to compare this between private and public universities to see if the type of school has an impact as well. To visualize this, I decided to make a barchart.

ggplot(data = college, aes(x = highest_degree, y = median_debt, fill = control))+
  geom_bar(position = "dodge", stat = "summary", fun = "mean")+
  labs(title = "Debt v. Highest Degree", subtitle = "Public v. Private Colleges", x = "Highest Degree", y = "Median Debt")

The barplot shows that students attending schools that offer an associates degree as its highest degree accumulate much less debt than students who attend schools that offer a bachelors or graduate degree as its highest degree. The barchart also shows that students who attend schools that offer a bachelors degree as its highest degree accumulate about the same debt as students who attend schools which offer a graduate degree as its highest degree. This means that the only real effect highest degree offered has on debt accumulation is whether or not the university only offers an associates degree as its highest degree. As for private vs. public, the barchart also shows that students who attend private schools tend to incur more debt in all three categories than students who attend its public equivalent.