#Read CSV File
college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE)
summary(college)
## id name city state
## Min. :100654 Westminster College: 3 New York : 15 PA :101
## 1st Qu.:153250 Anderson University: 2 Boston : 11 NY : 84
## Median :186283 Aquinas College : 2 Chicago : 10 CA : 71
## Mean :186988 Bethany College : 2 Philadelphia: 9 TX : 63
## 3rd Qu.:215284 Bethel University : 2 Cleveland : 8 OH : 52
## Max. :484905 Emmanuel College : 2 Los Angeles : 8 IL : 47
## (Other) :1256 (Other) :1208 (Other):851
## region highest_degree control gender admission_rate
## Midwest :353 Associate: 20 Private:763 CoEd :1237 Min. :0.0509
## Northeast:299 Bachelor : 200 Public :506 Men : 4 1st Qu.:0.5339
## South :459 Graduate :1049 Women: 28 Median :0.6687
## West :158 Mean :0.6501
## 3rd Qu.:0.7859
## Max. :1.0000
##
## sat_avg undergrads tuition faculty_salary_avg
## Min. : 720 Min. : 47 Min. : 2732 Min. : 1451
## 1st Qu.: 973 1st Qu.: 1296 1st Qu.: 8970 1st Qu.: 6191
## Median :1040 Median : 2556 Median :20000 Median : 7272
## Mean :1060 Mean : 5629 Mean :21025 Mean : 7656
## 3rd Qu.:1120 3rd Qu.: 6715 3rd Qu.:30364 3rd Qu.: 8671
## Max. :1545 Max. :52280 Max. :51008 Max. :20650
##
## loan_default_rate median_debt lon lat
## 0.057 : 32 Min. : 6056 Min. :-157.92 Min. :19.71
## 0.04 : 23 1st Qu.:21250 1st Qu.: -94.17 1st Qu.:35.22
## 0.046 : 22 Median :24589 Median : -84.89 Median :39.74
## 0.027 : 21 Mean :23483 Mean : -88.29 Mean :38.61
## 0.035 : 19 3rd Qu.:27000 3rd Qu.: -78.63 3rd Qu.:41.81
## 0.038 : 19 Max. :41000 Max. : -68.59 Max. :61.22
## (Other):1133
#Load tidy verse
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ncol(college)
## [1] 17
ls(college)
## [1] "admission_rate" "city" "control"
## [4] "faculty_salary_avg" "gender" "highest_degree"
## [7] "id" "lat" "loan_default_rate"
## [10] "lon" "median_debt" "name"
## [13] "region" "sat_avg" "state"
## [16] "tuition" "undergrads"
NROW(college)
## [1] 1269
#Question 2 Are there any missing values in the data? If so, can you show/visulaize how the data is missing? ##I used the missing value function to find that nummber of N/A values in the data set. I used the sum function to avoid a long script.
sum(is.na(college))
## [1] 0
#Question 3 For each variable in the data, please describe what you observe, such as some summary statistics, their distributions,etc. # I am going to start off by creating and observing the summary of the data set. summary(college) # Something that I noticed are that a lot of the distrutions are difficult to understand jsut by looking at the statisics. An easier way for me to visualize the data for some of these variables would be to create a box plot or a histogram. I am going to create a box plot for the sat average and a histogram for the tuition. My prediction is that the SAT scores have a fairly even distribution and that the tuition has a negative skewness.
boxplot(college$sat_avg, col = "green", xlab = "All Schools", ylab = "AVG SAT Score", main = "Distribution of the Average ACT Score")
# From this Histogram, we can see that there is a fairly normal distrubtion on a box plot. It has a slight positive skewness. One thing that stuck out to me, is that there are way more out outliers above the maximum than above the minimum. This makes sense to me because I hear about a lot more over-achieving schools than under-achieving schools. Under achieving schools are usually forced to shut down. Next, we will look at the histogram of the tuition.
hist(college$tuition, xlab ="College Tuition in Dollars", ylab = "Number of Schools", main = "Distribuiton of College Tuition", col = "blue")
#This histogram shows a slighly positive skew.
#Question 4 Visualize the association between two variables. #To visualize two variables I will use a scatter plot. I will use a scatter plot to visualize tuition and median debt and see if there is a correlation between the two.
ggplot(data = college, aes(x = tuition, y = median_debt)) + geom_point() + geom_smooth () + labs(title = "Tuition vs Median Debt")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# This appeared:
geom_smooth() using method = ‘gam’ and formula ‘y ~ s(x, bs = “cs”)’
#The Scatter Plot shows the tution vs median debt. The results from this plot were suprising to me. I was expecting higher tuition to result in higher debt. This was not the case, however. The median debt was constant across the plot, showing no indicating that median debt has a correlation to tuition cost.
#Question 5 Visualize the association between some variable pairs (of your choice) conditional on some other variables (of your choice). This is similar to the previous questions but your visualization involves more than two variables. # I am going to create another scatter plot. This time I will be visualizing using three variables. I will be looking at the Average Tuition Cost, the average acceptance rate, and whether the college is public or private.
ggplot(data = college, aes(x = tuition, y = admission_rate, color = control)) + geom_point() + geom_smooth() + labs(title = "Tuition vs Median Debt")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#In this scatter plot we can see that most public colleges have lower tuition and a higher acceptance rate. We can see that public colleges are more expensive than public colleges. We can also see that when public colleges are very expensive, they have a very low acceptance rate.
#Question 6 Propose three questions you are interested in about this data set, and then answer these questions using summary statistics and visualization. Some example questions can be: Are private universities harder to get in? Is there an association between SAT and admission rate? Are west coast universities cheaper than east coast universities? Please propose your own questions and do not use the exactly same questions listed above.
ggplot(data = college, aes(x = region, y = sat_avg))+
geom_boxplot()+
stat_summary(fun = "mean")+
labs(title = "Distribution of SAT Average Per Region", x = "Region", y = "Sat Average")
## Warning: Removed 4 rows containing missing values (geom_segment).
# In this box plot, the different regions of each college are represented by a box. Each box shows the distribution of the SAT scores. There is definately a variance in SAT Average per region. Looking at the graph, we see a positive skew in both the midwest and the northeast, and we see a normal distribution in the south and the west. The high positive skew in the Northeast is explained by the large amount of selective schools which require a higher SAT score to get into.
#My Question 2 Do people with a higher SAT score attain more advanced degrees?
ggplot(data = college, aes(x = highest_degree, y = sat_avg))+
geom_bar(stat = "summary", fun = "mean")+
labs(title = "SAT Average for Each Degree", x = "Degree", y = "SAT Average")
ggplot(data = college, aes(x = undergrads, y = faculty_salary_avg))+
geom_point()+
geom_smooth()+
labs(title = "Undergrads vs Falculty Salaries", x = "Number of Undergrads", y = "Average Faculty Salary")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'