To begin, I want to read the relevant file into R, and load the tidyverse.
college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
I will now be able to work with the data set and explore with data analysis.
I next want to find out how many columns there are in the dataset.
ncol(college)
## [1] 17
This shows me that there are 17 columns in the college data set. These columns represent the number of variables.
To discover the variable names for analysis, I will use the colnames function.
colnames(college)
## [1] "id" "name" "city"
## [4] "state" "region" "highest_degree"
## [7] "control" "gender" "admission_rate"
## [10] "sat_avg" "undergrads" "tuition"
## [13] "faculty_salary_avg" "loan_default_rate" "median_debt"
## [16] "lon" "lat"
This output tells me the names of all the variables. These variables will help me make observations about the data set, and help me with my future analysis.
Finally, I want to find the number of observations (rows) in the data set. Here is the function I will use:
nrow(college)
## [1] 1269
This output tells me there are 1269 total observations in the data set.
First, I want to enter some code to see if there are any missing values in the data.
sum(is.na(college))
## [1] 0
This output tells me that there are no missing values in the data set. Using the previous code na.strings = "“, this makes any blank in the data become”NA". This allows me to search for any NA values, for which this data set has none.
I next want to look at the structure of the data.
str(college)
## 'data.frame': 1269 obs. of 17 variables:
## $ id : int 102669 101648 100830 101879 100858 100663 101480 102049 101709 100751 ...
## $ name : Factor w/ 1260 levels "Abilene Christian University",..: 9 522 44 1079 43 980 430 801 1065 938 ...
## $ city : Factor w/ 834 levels "Aberdeen","Abilene",..: 25 448 491 247 42 71 349 71 490 760 ...
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 1 2 2 2 2 2 2 2 2 2 ...
## $ region : Factor w/ 4 levels "Midwest","Northeast",..: 4 3 3 3 3 3 3 3 3 3 ...
## $ highest_degree : Factor w/ 3 levels "Associate","Bachelor",..: 3 1 3 3 3 3 3 3 3 3 ...
## $ control : Factor w/ 2 levels "Private","Public": 1 2 2 2 2 2 2 1 2 2 ...
## $ gender : Factor w/ 3 levels "CoEd","Men","Women": 1 1 1 1 1 1 1 1 1 1 ...
## $ admission_rate : num 0.421 0.614 0.802 0.679 0.835 ...
## $ sat_avg : int 1054 1055 1009 1029 1215 1107 1041 1165 1070 1185 ...
## $ undergrads : int 275 433 4304 5485 20514 11383 7060 3033 2644 29851 ...
## $ tuition : int 19610 8778 9080 7412 10200 7510 7092 27324 10660 9826 ...
## $ faculty_salary_avg: int 5804 5916 7255 7424 9487 9957 6801 8367 7437 9667 ...
## $ loan_default_rate : Factor w/ 198 levels "0","0.002","0.003",..: 77 135 106 111 45 62 96 7 103 63 ...
## $ median_debt : num 23250 11500 21335 21500 21831 ...
## $ lon : num -149.9 -87.3 -86.3 -87.7 -85.5 ...
## $ lat : num 61.2 32.6 32.4 34.8 32.6 ...
One variable that stands out is the college_default_rate. It is shown as a factor, however, all of the observations seem to be numbers, which would make it a numeric variable.
To investigate this further, I want to see all the values of the variable.
levels(college$loan_default_rate)
## [1] "0" "0.002" "0.003" "0.004" "0.005" "0.006" "0.007" "0.008" "0.009"
## [10] "0.01" "0.011" "0.012" "0.013" "0.014" "0.015" "0.016" "0.017" "0.018"
## [19] "0.019" "0.02" "0.021" "0.022" "0.023" "0.024" "0.025" "0.026" "0.027"
## [28] "0.028" "0.029" "0.03" "0.031" "0.032" "0.033" "0.034" "0.035" "0.036"
## [37] "0.037" "0.038" "0.039" "0.04" "0.041" "0.042" "0.043" "0.044" "0.045"
## [46] "0.046" "0.047" "0.048" "0.049" "0.05" "0.051" "0.052" "0.053" "0.054"
## [55] "0.055" "0.056" "0.057" "0.058" "0.059" "0.06" "0.061" "0.062" "0.063"
## [64] "0.064" "0.065" "0.066" "0.067" "0.068" "0.069" "0.07" "0.071" "0.072"
## [73] "0.073" "0.074" "0.075" "0.076" "0.077" "0.078" "0.079" "0.08" "0.081"
## [82] "0.082" "0.083" "0.084" "0.085" "0.086" "0.087" "0.088" "0.089" "0.09"
## [91] "0.091" "0.092" "0.093" "0.094" "0.095" "0.096" "0.097" "0.098" "0.099"
## [100] "0.1" "0.101" "0.102" "0.103" "0.104" "0.105" "0.106" "0.107" "0.108"
## [109] "0.109" "0.11" "0.111" "0.113" "0.114" "0.115" "0.116" "0.117" "0.118"
## [118] "0.119" "0.12" "0.121" "0.122" "0.123" "0.124" "0.125" "0.126" "0.127"
## [127] "0.128" "0.129" "0.13" "0.131" "0.132" "0.133" "0.134" "0.135" "0.136"
## [136] "0.137" "0.138" "0.139" "0.14" "0.142" "0.143" "0.144" "0.147" "0.148"
## [145] "0.149" "0.151" "0.152" "0.153" "0.154" "0.155" "0.156" "0.157" "0.158"
## [154] "0.159" "0.16" "0.164" "0.166" "0.167" "0.169" "0.171" "0.172" "0.174"
## [163] "0.175" "0.176" "0.179" "0.18" "0.182" "0.184" "0.186" "0.187" "0.188"
## [172] "0.19" "0.192" "0.193" "0.196" "0.197" "0.2" "0.202" "0.204" "0.213"
## [181] "0.215" "0.217" "0.218" "0.22" "0.222" "0.23" "0.233" "0.236" "0.237"
## [190] "0.247" "0.259" "0.284" "0.298" "0.306" "0.311" "0.315" "0.334" "NULL"
This output shows that there is a null value for the variable, and not NA. This is what caused the variable to be a factor and not numeric. To solve this, I will make the null values to appear as NA. I’ll do this by slightly modifying the read code.
college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "NULL")
This code has changed the college_default_rate variable from a factor into a numeric variable. Now I will check how many null values there are, as they are now listed as NA.
sum(is.na(college))
## [1] 2
This output tells me there are 2 missing values in th college_default_rate, making the data set total to 2.
The first step I want to take in answering this question is to get a summary of the data set.
summary(college)
## id name city state
## Min. :100654 Westminster College: 3 New York : 15 PA :101
## 1st Qu.:153250 Anderson University: 2 Boston : 11 NY : 84
## Median :186283 Aquinas College : 2 Chicago : 10 CA : 71
## Mean :186988 Bethany College : 2 Philadelphia: 9 TX : 63
## 3rd Qu.:215284 Bethel University : 2 Cleveland : 8 OH : 52
## Max. :484905 Emmanuel College : 2 Los Angeles : 8 IL : 47
## (Other) :1256 (Other) :1208 (Other):851
## region highest_degree control gender admission_rate
## Midwest :353 Associate: 20 Private:763 CoEd :1237 Min. :0.0509
## Northeast:299 Bachelor : 200 Public :506 Men : 4 1st Qu.:0.5339
## South :459 Graduate :1049 Women: 28 Median :0.6687
## West :158 Mean :0.6501
## 3rd Qu.:0.7859
## Max. :1.0000
##
## sat_avg undergrads tuition faculty_salary_avg
## Min. : 720 Min. : 47 Min. : 2732 Min. : 1451
## 1st Qu.: 973 1st Qu.: 1296 1st Qu.: 8970 1st Qu.: 6191
## Median :1040 Median : 2556 Median :20000 Median : 7272
## Mean :1060 Mean : 5629 Mean :21025 Mean : 7656
## 3rd Qu.:1120 3rd Qu.: 6715 3rd Qu.:30364 3rd Qu.: 8671
## Max. :1545 Max. :52280 Max. :51008 Max. :20650
##
## loan_default_rate median_debt lon lat
## Min. :0.00000 Min. : 6056 Min. :-157.92 Min. :19.71
## 1st Qu.:0.03500 1st Qu.:21250 1st Qu.: -94.17 1st Qu.:35.22
## Median :0.05500 Median :24589 Median : -84.89 Median :39.74
## Mean :0.06558 Mean :23483 Mean : -88.29 Mean :38.61
## 3rd Qu.:0.08300 3rd Qu.:27000 3rd Qu.: -78.63 3rd Qu.:41.81
## Max. :0.33400 Max. :41000 Max. : -68.59 Max. :61.22
## NA's :2
The first thing that stands out to me for the summary is the undergrad variable. The 3rd quartile has a value of 6715, while the max has a value of 52280. This means that the variable is most likely very positively skewed. To get a better visualization, I would need to graph it. I believe the best method would be to use a histogram.
ggplot(data = college, aes(x = undergrads))+
geom_histogram(binwidth = 1200, fill = "green", colour = "black")+
labs(title="Undergraduate Distribution")+
scale_x_continuous(name = "number of Undergraduates", breaks = c(0,5000,10000,15000,20000,25000,30000,35000,40000,45000,50000,55000))
This histogram confirms my suspicions of this variable being positively skewed. The histogram is showing that a small number of colleges have a much higher undergraduate populations than the rest of the schools. It also shows that most have an undergraduate population of 0-10,000 students.
I will next look at another variable to see if there is something that sticks out for me to investigate. One variable that caught my eye is the name variable. There are some names that have multiple entries. I’m not sure whether these colleges are double counted, or if they are different colleges with the same name. To find out the answer, I will run a code that shows if there are any duplicates.
sum(duplicated(college))
## [1] 0
This output tells me that there are no duplicate colleges in this data set, and they are indeed separate colleges with the same name.
A relationship I wanted to look into was faculty_salary_avg and admission_rate. I chose these as I wanted to see if the most exclusive colleges pay their professors the most. I predict this will most likely be the case, but I want to see if maybe there is more connections to be made.
I decided to make a scatterplot to visualize the data.
ggplot(data = college, aes(x = faculty_salary_avg, y = admission_rate)) +
geom_point()+
geom_smooth()+
labs(title = "Faculty Average Salary vs. Admission Rate", x = "Average Salary", y = "Admission Rate" )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This scatter plot with the smoothing line tells me that there is a visible relationship between faculty salary and admission rate for the universities. The scatterplot suggests that the salary of a professor generally increases when the admission rate of a school is lower. It also shows me that admission rates matter less in the salary of faculty when the rates are near the median value.
I now want to see if SAT scores effect the admission rates of schools. In order to involve more than two variables, I will be comparing this relationship with Public and Private schools.
I will create another scatterplot to visualize this relationship.
ggplot(data = college, aes(x = sat_avg, y = admission_rate, color = control))+
geom_point()+
geom_smooth()+
labs(title = "SAT Scores vs. Admission Rate", subtitle = "Public vs. Private Universities", x = "Average SAT Score", y = "Admission Rate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
After reviewing the results, it seems public and private universities are very similar in admissions. For both, SAT scores matter less in admission for colleges with lower averages of SAT scores. Similarly, SAT scores above 800 seem to directly correlate with admission rates for both private and public universities, as the SAT scores become more important in influencing admission rates. This visualization overall shows that SAT scores can be a good predictor on how selective a university is.
For this question, I want to gauge whether admission rates on average are higher or lower when comparing private and public universities. To see this, I want to create a boxplot comparing these two variables.
ggplot(data = college, aes(x = control, y = admission_rate))+
geom_boxplot()+
stat_summary(fun = "mean")+
labs(title = "Admission Rates of Public vs Private", x = "School Style", y = "Admission Rate")
## Warning: Removed 2 rows containing missing values (geom_segment).
This output shows that in general, Private institutions have lower admission rates than its public counterparts. This is shown by private universities having lower mean and medium admission rates than public universities. Even though its close, one can make the argument that private colleges are more selective in admissions.
I want to see if the larger a university is, the higher the expenses are for the said university. To analyze this, I will compare the number of undergraduates and the faculty salary avg of the school. To visualize this relationship, I will use a scatter plot.
ggplot(data = college, aes(x = undergrads, y = faculty_salary_avg))+
geom_point()+
geom_smooth()+
scale_x_log10()+
labs(title = "Number of Undergraduates v. Average Faculty Salary", x = "Number of Undergraduates", y = "Average Faculty Salary")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This output tells me that there is a relationship between the number of undergraduates and average faculty salary. The smooth line is not that steep, however, which means that universities with a higher number of undergraduates only pay their faculty slightly more than schools with a medium amount of undergraduates.
For my final question, I wanted to see if highest degree choice offered has an impact on debt accumulation. I also want to compare this between private and public universities to see if the type of school has an impact as well. To visualize this, I decided to make a barchart.
ggplot(data = college, aes(x = highest_degree, y = median_debt, fill = control))+
geom_bar(position = "dodge", stat = "summary", fun = "mean")+
labs(title = "Debt v. Highest Degree", subtitle = "Public v. Private Colleges", x = "Highest Degree", y = "Median Debt")
The barplot shows that students attending schools that offer an associates degree as its highest degree accumulate much less debt than students who attend schools that offer a bachelors or graduate degree as its highest degree. The barchart also shows that students who attend schools that offer a bachelors degree as its highest degree accumulate about the same debt as students who attend schools which offer a graduate degree as its highest degree. This means that the only real effect highest degree offered has on debt accumulation is whether or not the university only offers an associates degree as its highest degree. As for private vs. public, the barchart also shows that students who attend private schools tend to incur more debt in all three categories than students who attend its public equivalent.