Case 4

1.Q: For college.csv, how many variables and how many observations in the data? Explain what these variables are according to your understanding of the data.

A: There are 17 columns in the college.csv dataset and 1269 rows. The variable are: id(institution id number) name(name of institution) city(the city that the institution resides in) state(the state that the institution resides in) region(the region that the institution resides in) highest_degree(Highest degree attainable) control(Whether the institution is private or public) gender(Whether the institution is Co/Ed, Women, or Men) admission_rate(The rate that students get admitted to the institution) sat_avg(The average SAT score of admitted students) undergrads(Amount of undergraduate students attending) tuition(Average tuition per student) faculty_salary_avg(Average Salary of the Faculty) loan_default_rate(The rate at which loans are defaulted among students) median_debt(The median debt of students who graduate) lon(Longitude coordinates) lat(Latitude coordinates)

To accomplish this, we use the read.csv() and dim() commands This tells us the different variables that are included in the dataset, and how many of them there are.

library(readr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v dplyr   1.0.7
## v tibble  3.1.4     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(tidyr)
library(RColorBrewer)
library(ggridges)
library(viridis)

## Loading required package: viridisLite

library(viridisLite)
library(ggsci)
library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(hexbin)
library(treemapify)
library(ggplot2)

college <- read.csv("college.csv")

dim(college)

## [1] 1269   17

2. Q: Are there missing values in the data? If so, can you show/visualize how data is missing?

A: There are zero missing values in the dataset. This can be accomplished by using count(college, is.na(college)). This tells us the dataset is complete, and there are no missing values.

count(college, is.na(college))

##   is.na(college).id is.na(college).name is.na(college).city
## 1             FALSE               FALSE               FALSE
##   is.na(college).state is.na(college).region is.na(college).highest_degree
## 1                FALSE                 FALSE                         FALSE
##   is.na(college).control is.na(college).gender is.na(college).admission_rate
## 1                  FALSE                 FALSE                         FALSE
##   is.na(college).sat_avg is.na(college).undergrads is.na(college).tuition
## 1                  FALSE                     FALSE                  FALSE
##   is.na(college).faculty_salary_avg is.na(college).loan_default_rate
## 1                             FALSE                            FALSE
##   is.na(college).median_debt is.na(college).lon is.na(college).lat    n
## 1                      FALSE              FALSE              FALSE 1269

3. Q:Pick one continuous variables and visualize its distribution. Pick one discrete variables and visualize its distribution. For continuous variables, you can choose from histogram, density plot, violin, and many others. For discrete variables, you can choose from barplot and many others. Describe what you observe in the visualization. The continuous variables are admission_rate, sat_avg, undergrads, tuition, faculty_salary_avg, loan_default_rate, median_debt, lon, and lat. The discrete varaibles are name, city, state, region, highest_degree, control, and gender.

Plot 1: I picked the sat_avg variable in the college.csv file and plotted it on a histogram plot. The plot is normally distributed, and the most frequent score lie between 1000 and 1200. Plot 2: The second variable I picked was the states in which an institution was found in on a bar plot. This state that had the most colleges was Pennsylvania, followed by New York and California. The states that had the least amount of colleges were Alaska, Wyoming, and Nevada.

I accomplished these visualizations using the ggplot and geom_histogram/geom_bar functions.

g1 <- ggplot(college, aes(x=sat_avg))+
  geom_histogram(binwidth = 10)+
  ggtitle("Distribution of SAT Average")+
  labs(x="SAT Average")
g1

g2 <- ggplot(college, aes(y=state))+
  geom_bar()+
  ggtitle("Count of Institutions per State")+
  labs(x="Count of Colleges",
      y="States")
g2

4. Q:Pick three pairs of variables (i.e., continuous vs continuous, continuous vs discrete, and discrete vs discrete). For each pair of variables, visualize the association between them. Hopefully, there are some interesting patterns in your visualization. Describe what you observe in the visualization.

Plot 1: The variables I chose were sat_avg and admission_rate (two continuous variables) and used a scatter plot to visualize them. In a general sense, as the admission rate got closer to the value 1, the average SAT score decreased.

Plot 2: The variables I chose for the second plot were tuition and region (continuous vs. discrete) and I used a box and whisker plot to visualize the two variables. I found that the region in the United States that had the highest median average tuition is the Northeast, and the lowest was the South. In addition, the Midwest region had the second highest median average tuition, ahead of the West.

Plot 3: The variables I chose for the third plot were region and control (two discrete variables) and I used a bar graph to visualize the data. I found that the region with the most public and private schools was the South. The region with the least amount of public and private schools was the west.

I accomplished creating these visualizations with the ggplot command, as well as the geom_point, geom_boxplot, and geom_bar commands.

g3 <- ggplot(college)+
geom_point(aes(x=admission_rate, y=sat_avg))+
  ggtitle("SAT Average vs Admission Rate")+
  labs(x="Admission Rates",
       y="SAT Average")
g3

g4 <- ggplot(college)+
  geom_boxplot(aes(x=tuition, y=region))+
  ggtitle("Regional Tuition Statistics in the United States")+
  labs(x="Average Tuition",
       y="Region")
g4

g5 <- ggplot(college)+
  geom_bar(aes(x=region, fill=control))+
  ggtitle("Amount of Public or Private Institutions by Region")+
  labs(x="Region",
       y="Count of Colleges")
g5

5. Q:Visualize the association between a pair of variables (of your choice) conditional on a third variable (of your choice). Describe what you observe in the visualization. This is similar to the previous questions but your visualization involves more than two variables. You can often use color, shape, size, facet to represent the third variable.

A: The variables I chose to plot were the Admission Rate vs. SAT Average in this scatter diagram. In addition, I varied the size of each point on the plot by the amount of undergrads current attending according to the dataset given. As the admission rate increased, the SAT average decreased. In a general sense, the schools with a higher SAT average had a larger undergraduate population, with a few outliers in the plot. Because the dataset contains a lot of data and 50 states of data, I decided to use the filter command to filter the college.csv data file by the State of Ohio. I accomplished this visualization by using the filter command and storing a new data table labeled as d1, ggplot, and the geom_point commands.

d1 <- filter(college, state == "OH")

g6 <- ggplot(d1)+
  geom_point(aes(x=admission_rate, y=sat_avg, size=undergrads), alpha = 0.3)+
  ggtitle("SAT Average vs. Admission Rate in the State of Ohio")+
  labs(x="SAT Average",
       y="Admission Rate")
g6

6a. Q: What is the relationship between the amount of undergrads enrolled and the cost of tuition based on the different Regions of the United States.?

A: In general in every region, the as the smaller the amount of undergraduate students enrolled, the higher tuition is at that specific university. However, in the Northeast region, there are quite a few outliers to that trend, but overall it is quite similar to the other regions. In addition, it seems as if the Northeast and Midwest accumulate more institutions that have higher average tuitions than the South and the West combined.

I accomplished this by using the ggplot and geom_treemap functions. I used the undergrad variable as the area of each individual square on the treemap, and divided each section by the region variable. Finally, I used the tuition to colorize the different squares based on how expensive each individual average tuition was at the different college/universities across the country.

g7 <- ggplot(college,
       aes(area = undergrads,
           subgroup = region,
           fill = tuition))+
  geom_treemap()+
  geom_treemap_subgroup_border(color = "red", size = 5)+ 
  geom_treemap_text(aes(label=name),colour = "white", place = "topleft", reflow = TRUE, size=10)+
  geom_treemap_subgroup_text(place = "topright", grow = FALSE, alpha = 1.0, colour ="black")+
  scale_fill_viridis_c()
g7

6b. Q: How does the distribution of the admission_rate affect the average cost of tuition for public and private institutions?

A: In a general sense for both Public and Private institutions, as the admission rate increases (gets closer to 1.00) the cost of tuition decreases.

Plot 1 (g8) is for public institutions, and Plot 2 (g9) is for Private institutions. I accomplished this by first filtering the college dataset by the control variable and using the filter command. I then used the ggplot and geom_hex commands to create the two plots, and the grid.arrange command to pair them on one combined visualization.

d2 <- filter(college, control == c("Public"))
d3 <- filter(college, control == c("Private"))

g8 <- ggplot(d2, aes(x=admission_rate, y=tuition) ) + geom_hex(bins = 20)+
  ggtitle("Admission Rate vs. Tuition for Public Instituions")+
  labs(x="Admission Rate",
       y="Tuition")
g9 <- ggplot(d3, aes(x=admission_rate, y=tuition) ) + geom_hex(bins = 20)+
    ggtitle("Admission Rate vs. Tuition for Private Instituions")+
  labs(x="Admission Rate",
       y="Tuition")

grid.arrange(g8, g9)

Case 4

Jordan

11/8/2021

1.Q: For college.csv, how many variables and how many observations in the data? Explain what these variables are according to your understanding of the data.

2. Q: Are there missing values in the data? If so, can you show/visualize how data is missing?

6a. Q: What is the relationship between the amount of undergrads enrolled and the cost of tuition based on the different Regions of the United States.?

6b. Q: How does the distribution of the admission_rate affect the average cost of tuition for public and private institutions?