#Read CSV File

college <- read.csv("college.csv", header = TRUE, stringsAsFactors = TRUE)
summary(college)

##        id                          name                city          state    
##  Min.   :100654   Westminster College:   3   New York    :  15   PA     :101  
##  1st Qu.:153250   Anderson University:   2   Boston      :  11   NY     : 84  
##  Median :186283   Aquinas College    :   2   Chicago     :  10   CA     : 71  
##  Mean   :186988   Bethany College    :   2   Philadelphia:   9   TX     : 63  
##  3rd Qu.:215284   Bethel University  :   2   Cleveland   :   8   OH     : 52  
##  Max.   :484905   Emmanuel College   :   2   Los Angeles :   8   IL     : 47  
##                   (Other)            :1256   (Other)     :1208   (Other):851  
##        region      highest_degree    control      gender     admission_rate  
##  Midwest  :353   Associate:  20   Private:763   CoEd :1237   Min.   :0.0509  
##  Northeast:299   Bachelor : 200   Public :506   Men  :   4   1st Qu.:0.5339  
##  South    :459   Graduate :1049                 Women:  28   Median :0.6687  
##  West     :158                                               Mean   :0.6501  
##                                                              3rd Qu.:0.7859  
##                                                              Max.   :1.0000  
##                                                                              
##     sat_avg       undergrads       tuition      faculty_salary_avg
##  Min.   : 720   Min.   :   47   Min.   : 2732   Min.   : 1451     
##  1st Qu.: 973   1st Qu.: 1296   1st Qu.: 8970   1st Qu.: 6191     
##  Median :1040   Median : 2556   Median :20000   Median : 7272     
##  Mean   :1060   Mean   : 5629   Mean   :21025   Mean   : 7656     
##  3rd Qu.:1120   3rd Qu.: 6715   3rd Qu.:30364   3rd Qu.: 8671     
##  Max.   :1545   Max.   :52280   Max.   :51008   Max.   :20650     
##                                                                   
##  loan_default_rate  median_debt         lon               lat       
##  0.057  :  32      Min.   : 6056   Min.   :-157.92   Min.   :19.71  
##  0.04   :  23      1st Qu.:21250   1st Qu.: -94.17   1st Qu.:35.22  
##  0.046  :  22      Median :24589   Median : -84.89   Median :39.74  
##  0.027  :  21      Mean   :23483   Mean   : -88.29   Mean   :38.61  
##  0.035  :  19      3rd Qu.:27000   3rd Qu.: -78.63   3rd Qu.:41.81  
##  0.038  :  19      Max.   :41000   Max.   : -68.59   Max.   :61.22  
##  (Other):1133

#Load tidy verse

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Question 1 Find the quantity of variables and oberservations in the data

I will use a ncol function to find the number of columns which represents the number of varibles in the data set. I will use a list function to show the names of those variables.The NROW function will show the number of observations (rows) in the dataset.

ncol(college)

## [1] 17

ls(college)

##  [1] "admission_rate"     "city"               "control"           
##  [4] "faculty_salary_avg" "gender"             "highest_degree"    
##  [7] "id"                 "lat"                "loan_default_rate" 
## [10] "lon"                "median_debt"        "name"              
## [13] "region"             "sat_avg"            "state"             
## [16] "tuition"            "undergrads"

NROW(college)

## [1] 1269

There are 17 variables in this data set. They are admission rate, city, control, factulty salary average, gender, highest degree, id, lat, loan defult rate, lon, median debt, name, region, sat average, state, tuition, and undergrads. There are 1269 observations in this data set.

#Question 2 Are there any missing values in the data? If so, can you show/visulaize how the data is missing? ##I used the missing value function to find that nummber of N/A values in the data set. I used the sum function to avoid a long script.

sum(is.na(college))

## [1] 0

There are no missing values in the data set.

#Question 3 For each variable in the data, please describe what you observe, such as some summary statistics, their distributions,etc. # I am going to start off by creating and observing the summary of the data set. summary(college) # Something that I noticed are that a lot of the distrutions are difficult to understand jsut by looking at the statisics. An easier way for me to visualize the data for some of these variables would be to create a box plot or a histogram. I am going to create a box plot for the sat average and a histogram for the tuition. My prediction is that the SAT scores have a fairly even distribution and that the tuition has a negative skewness.

boxplot(college$sat_avg, col = "green", xlab = "All Schools", ylab = "AVG SAT Score", main = "Distribution of the Average ACT Score")

# From this Histogram, we can see that there is a fairly normal distrubtion on a box plot. It has a slight positive skewness. One thing that stuck out to me, is that there are way more out outliers above the maximum than above the minimum. This makes sense to me because I hear about a lot more over-achieving schools than under-achieving schools. Under achieving schools are usually forced to shut down. Next, we will look at the histogram of the tuition.

hist(college$tuition, xlab ="College Tuition in Dollars", ylab = "Number of Schools", main = "Distribuiton of College Tuition", col = "blue")

#This histogram shows a slighly positive skew.

#Question 4 Visualize the association between two variables. #To visualize two variables I will use a scatter plot. I will use a scatter plot to visualize tuition and median debt and see if there is a correlation between the two.

ggplot(data = college, aes(x = tuition, y = median_debt)) + geom_point() + geom_smooth () + labs(title = "Tuition vs Median Debt")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# This appeared: geom_smooth() using method = ‘gam’ and formula ‘y ~ s(x, bs = “cs”)’

#The Scatter Plot shows the tution vs median debt. The results from this plot were suprising to me. I was expecting higher tuition to result in higher debt. This was not the case, however. The median debt was constant across the plot, showing no indicating that median debt has a correlation to tuition cost.

#Question 5 Visualize the association between some variable pairs (of your choice) conditional on some other variables (of your choice). This is similar to the previous questions but your visualization involves more than two variables. # I am going to create another scatter plot. This time I will be visualizing using three variables. I will be looking at the Average Tuition Cost, the average acceptance rate, and whether the college is public or private.

ggplot(data = college, aes(x = tuition, y = admission_rate, color = control)) + geom_point() + geom_smooth() + labs(title = "Tuition vs Median Debt")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#In this scatter plot we can see that most public colleges have lower tuition and a higher acceptance rate. We can see that public colleges are more expensive than public colleges. We can also see that when public colleges are very expensive, they have a very low acceptance rate.

#Question 6 Propose three questions you are interested in about this data set, and then answer these questions using summary statistics and visualization. Some example questions can be: Are private universities harder to get in? Is there an association between SAT and admission rate? Are west coast universities cheaper than east coast universities? Please propose your own questions and do not use the exactly same questions listed above.

My Question 1: Do different regions vary in SAT averages?

ggplot(data = college, aes(x = region, y = sat_avg))+
  geom_boxplot()+
  stat_summary(fun = "mean")+
  labs(title = "Distribution of SAT Average Per Region", x = "Region", y = "Sat Average")

## Warning: Removed 4 rows containing missing values (geom_segment).

# In this box plot, the different regions of each college are represented by a box. Each box shows the distribution of the SAT scores. There is definately a variance in SAT Average per region. Looking at the graph, we see a positive skew in both the midwest and the northeast, and we see a normal distribution in the south and the west. The high positive skew in the Northeast is explained by the large amount of selective schools which require a higher SAT score to get into.

#My Question 2 Do people with a higher SAT score attain more advanced degrees?

ggplot(data = college, aes(x = highest_degree, y = sat_avg))+
  geom_bar(stat = "summary", fun = "mean")+
  labs(title = "SAT Average for Each Degree", x = "Degree", y = "SAT Average")

Although you do not need an SAT score to get into a graduate program, I thought it would be interesting to compare each degree, to see if students with highest SAT averages went further in schooling. This turned out to be true. The graduate students had scored a higher SAT score, the bachloer degree was second, and the associate degree was third. This shows that individuals with higher SAT scores tend to pursue higher more advanced degrees in schooling.

Question 3 Does the number of students affect the average falculty salary?

ggplot(data = college, aes(x = undergrads, y = faculty_salary_avg))+
  geom_point()+
  geom_smooth()+
  labs(title = "Undergrads vs Falculty Salaries", x = "Number of Undergrads", y = "Average Faculty Salary")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There are many other factors to consider when looking at the correlation between the number of undergrads and the average falculty salary. I wanted to look at this relationship anyway to see if falculty at larger schools were paid more (or less). This was not the case. The average falculty salary did not indrease of decrease based on the number of undergrads.

Case 1 Descriptive Analytics