Before we can begin the analysis we first have to download the packages that we will be using to run our analysis in R:
Tidyverse: Gives us a bunch of analysis tools combined into one seamless package. Overall, this package allows us to manipulate, wrangle, and visualize data for its intended purpose.
ggplot2: plotting package that provides helpful commands to create complex plots from data in a data frame
cowplot: The cowplot package is a simple add-on to ggplot.
GGally: ‘GGally’ is an extension of ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data.
scales: scales used to demonstrate ggplot2 style scales for specific types of data
DT: A simple package that lets us design data tables.
Dplyr: This package gives us a lot of data wrangling capabilities in order to narrow down observations to create in-depth analyses
Highchart: It’s an interactive graph, just explored with addition to the ggplot
library(tidyverse)
library(dplyr)
library(ggplot2)
library(cowplot)
library(scales)
library(DT)
library("GGally")
library("highcharter")
getwd()
setwd("C:\\Users\\swath\\Downloads")
df <- readRDS('college_expenses_and_enrollment.Rdata')
vtable::vtable(df)
| Name | Class | Label | Values |
|---|---|---|---|
| UNITID | numeric | College ID | Num: 100654 to 491118 |
| STABBR | factor | State | ‘AK’ ‘AL’ ‘AR’ ‘AS’ ‘AZ’ and more |
| type | factor | College Type (NP = Nonprofit) | ‘For-Profit 2-Year’ ‘For-Profit 4-Year’ ‘Private NP 2-Year’ ‘Private NP 4-Year’ ‘Public 2-Year’ and more |
| Total.Income | numeric | Total Income | Num: 6067 to 9703300000 |
| Total.Expenses | numeric | Total Expenses | Num: 4600 to 8629847000 |
| Tuition | numeric | Tuition (Income) | Num: 0 to 1753736000 |
| Federal | numeric | Federal Support (Income) | Num: 0 to 1071432208 |
| State | numeric | State Support (Income) | Num: 0 to 909496000 |
| Local | numeric | Local Support (Income) | Num: 0 to 386724142 |
| Private | numeric | Private Support (Income) | Num: -246016 to 1568865000 |
| Sales | numeric | Sales (Income) | Num: -1073664 to 8471746000 |
| Research | numeric | Research (Expense) | Num: 0 to 2910863000 |
| Public.Service | numeric | Public Service (Expense) | Num: -32 to 807399000 |
| Student.Services | numeric | Student Services (Expense) | Num: 0 to 284363000 |
| Instruction | numeric | Instruction (Expense) | Num: 0 to 2611938000 |
| Academic.Support | numeric | Academic Support (Expense) | Num: 0 to 919577000 |
| Other | numeric | Other Expenses | Num: -1 to 5543912000 |
| Institution.Name | character | NULL | |
| Total.Enrollment | numeric | NULL | Num: 9 to 131629 |
| Full.Time.Enrollment | numeric | NULL | Num: 0 to 131629 |
| Undergraduate.Enrollment | numeric | NULL | Num: 0 to 103711 |
Below is an interactive data table in order to explore the primary dataframe.
datatable(df)
colnames(df)
## [1] "UNITID" "STABBR"
## [3] "type" "Total.Income"
## [5] "Total.Expenses" "Tuition"
## [7] "Federal" "State"
## [9] "Local" "Private"
## [11] "Sales" "Research"
## [13] "Public.Service" "Student.Services"
## [15] "Instruction" "Academic.Support"
## [17] "Other" "Institution.Name"
## [19] "Total.Enrollment" "Full.Time.Enrollment"
## [21] "Undergraduate.Enrollment"
summary(df)
## UNITID STABBR type Total.Income
## Min. :100654 CA : 468 For-Profit 2-Year: 754 Min. :6.067e+03
## 1st Qu.:157674 NY : 324 For-Profit 4-Year: 477 1st Qu.:6.043e+06
## Median :203614 TX : 272 Private NP 2-Year: 159 Median :2.370e+07
## Mean :247693 PA : 239 Private NP 4-Year:1628 Mean :1.255e+08
## 3rd Qu.:366433 FL : 228 Public 2-Year : 992 3rd Qu.:7.395e+07
## Max. :491118 OH : 223 Public 4-Year : 761 Max. :9.703e+09
## (Other):3017
## Total.Expenses Tuition Federal
## Min. :4.600e+03 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:3.202e+06 1st Qu.:2.329e+06 1st Qu.:6.854e+05
## Median :1.450e+07 Median :7.655e+06 Median :2.536e+06
## Mean :1.010e+08 Mean :3.559e+07 Mean :1.247e+07
## 3rd Qu.:5.072e+07 3rd Qu.:2.594e+07 3rd Qu.:8.038e+06
## Max. :8.630e+09 Max. :1.754e+09 Max. :1.071e+09
##
## State Local Private
## Min. : 0 Min. : 0 Min. : -246016
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 883000 Median : 0 Median : 320041
## Mean : 17874966 Mean : 3421588 Mean : 10226745
## 3rd Qu.: 11440117 3rd Qu.: 67965 3rd Qu.: 3145640
## Max. :909496000 Max. :386724142 Max. :1568865000
##
## Sales Research Public.Service
## Min. : -1073664 Min. :0.000e+00 Min. : -32
## 1st Qu.: 0 1st Qu.:0.000e+00 1st Qu.: 0
## Median : 102720 Median :0.000e+00 Median : 0
## Mean : 45901782 Mean :1.234e+07 Mean : 3893081
## 3rd Qu.: 9142570 3rd Qu.:1.228e+04 3rd Qu.: 445858
## Max. :8471746000 Max. :2.911e+09 Max. :807399000
##
## Student.Services Instruction Academic.Support
## Min. : 0 Min. :0.000e+00 Min. : 0
## 1st Qu.: 544548 1st Qu.:1.552e+06 1st Qu.: 347380
## Median : 2770299 Median :7.589e+06 Median : 1716867
## Mean : 8628289 Mean :3.769e+07 Mean : 10321013
## 3rd Qu.: 8839072 3rd Qu.:2.713e+07 3rd Qu.: 6072095
## Max. :284363000 Max. :2.612e+09 Max. :919577000
##
## Other Institution.Name Total.Enrollment
## Min. : -1 Length:4771 Min. : 9.0
## 1st Qu.: 0 Class :character 1st Qu.: 365.5
## Median : 35172 Mode :character Median : 1573.0
## Mean : 28176156 Mean : 4646.3
## 3rd Qu.: 5461534 3rd Qu.: 4993.0
## Max. :5543912000 Max. :131629.0
## NA's :516
## Full.Time.Enrollment Undergraduate.Enrollment
## Min. : 0.0 Min. : 0
## 1st Qu.: 263.5 1st Qu.: 279
## Median : 972.0 Median : 1384
## Mean : 2850.4 Mean : 3956
## 3rd Qu.: 2701.5 3rd Qu.: 4312
## Max. :131629.0 Max. :103711
## NA's :516 NA's :516
str(df)
## tibble [4,771 x 21] (S3: tbl_df/tbl/data.frame)
## $ UNITID : num [1:4771] 100654 100663 100690 100706 100724 ...
## ..- attr(*, "label")= chr "College ID"
## $ STABBR : Factor w/ 59 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "State"
## $ type : Factor w/ 6 levels "For-Profit 2-Year",..: 6 6 4 6 6 6 5 6 6 6 ...
## ..- attr(*, "label")= chr "College Type (NP = Nonprofit)"
## $ Total.Income : num [1:4771] 1.59e+08 2.96e+09 8.32e+06 2.26e+08 1.33e+08 ...
## ..- attr(*, "label")= chr "Total Income"
## $ Total.Expenses : num [1:4771] 1.03e+08 2.69e+09 3.56e+06 1.95e+08 8.94e+07 ...
## ..- attr(*, "label")= chr "Total Expenses"
## $ Tuition : num [1:4771] 4.65e+07 2.09e+08 6.60e+06 6.24e+07 4.53e+07 ...
## ..- attr(*, "label")= chr "Tuition (Income)"
## $ Federal : num [1:4771] 4.52e+07 3.85e+08 9.04e+05 8.04e+07 2.84e+07 ...
## ..- attr(*, "label")= chr "Federal Support (Income)"
## $ State : num [1:4771] 4.22e+07 2.83e+08 0.00 5.20e+07 4.66e+07 ...
## ..- attr(*, "label")= chr "State Support (Income)"
## $ Local : num [1:4771] 0 2350134 0 0 0 ...
## ..- attr(*, "label")= chr "Local Support (Income)"
## $ Private : num [1:4771] 2.28e+06 1.06e+08 7.78e+05 5.59e+06 2.91e+06 ...
## ..- attr(*, "label")= chr "Private Support (Income)"
## $ Sales : num [1:4771] 2.32e+07 1.97e+09 3.34e+04 2.52e+07 9.83e+06 ...
## ..- attr(*, "label")= chr "Sales (Income)"
## $ Research : num [1:4771] 1.02e+07 2.98e+08 0.00 7.30e+07 4.25e+06 ...
## ..- attr(*, "label")= chr "Research (Expense)"
## $ Public.Service : num [1:4771] 1.62e+07 1.73e+08 0.00 7.66e+06 4.52e+06 ...
## ..- attr(*, "label")= chr "Public Service (Expense)"
## $ Student.Services : num [1:4771] 20550501 50272077 771771 20998648 17139324 ...
## ..- attr(*, "label")= chr "Student Services (Expense)"
## $ Instruction : num [1:4771] 3.68e+07 3.11e+08 2.36e+06 6.97e+07 4.01e+07 ...
## ..- attr(*, "label")= chr "Instruction (Expense)"
## $ Academic.Support : num [1:4771] 7.74e+06 1.87e+08 4.28e+05 1.34e+07 1.21e+07 ...
## ..- attr(*, "label")= chr "Academic Support (Expense)"
## $ Other : num [1:4771] 1.19e+07 1.67e+09 0.00 1.00e+07 1.13e+07 ...
## ..- attr(*, "label")= chr "Other Expenses"
## $ Institution.Name : chr [1:4771] "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
## $ Total.Enrollment : num [1:4771] 5859 19535 597 8468 5318 ...
## $ Full.Time.Enrollment : num [1:4771] 5040 12691 216 6017 4694 ...
## $ Undergraduate.Enrollment: num [1:4771] 4851 12369 294 6507 4727 ...
If we take a look at the data set, we can see that there are a handful of missing values. We don’t want to elimintate all of these values because then we would remove nearly all of our observations. We do, however, still want to be mindful of where these missing values are. The code below and its corresponding output shows us the number of rows of missing values (marked “NA”) there are in each column:
colSums(is.na(df))
## UNITID STABBR type
## 0 0 0
## Total.Income Total.Expenses Tuition
## 0 0 0
## Federal State Local
## 0 0 0
## Private Sales Research
## 0 0 0
## Public.Service Student.Services Instruction
## 0 0 0
## Academic.Support Other Institution.Name
## 0 0 426
## Total.Enrollment Full.Time.Enrollment Undergraduate.Enrollment
## 516 516 516
There are six types of institution, Private Non-Profit 2-years, Private Non-Profit 4-years, Public 2-Years, Public 4-Years, For Profit 2_years, For Profit 4-Years. Hence, I regrouped these fields as Private, Public and For Profit type of institutions as “type-recode”
df <- df %>%
mutate(df,type_recode = case_when(type == "Private NP 2-Year" | type == "Private NP 4-Year" ~ "Private",
type == "Public 2-Year" | type == "Public 4-Year" ~ "Public",
type == "For-Profit 2-Year" | type == "For-Profit 4-Year" ~ "For Profit"))
ggplot(data=df, aes(fill = type_recode)) +
geom_bar(mapping=aes(x=type_recode)) +
theme(panel.background=element_blank()) +
theme(plot.background=element_blank()) +
theme(panel.grid.major.y=element_line(color="grey")) +
scale_x_discrete(name="School Type") +
scale_y_continuous(name="Number of Schools") +
scale_fill_manual(values=c("pink","skyblue","lightgreen"), guide=guide_legend(title="Institution Type", label.position="top", nrow=2000, keywidth=1.2))
Majority of institution type in the dataset are private and Public institutions, however, the number of Private institutions are more compared to the Public Institutions. Compared to both the Public and the Private institution For Profit type of institution is less in number.Further I would like to explore the expense and the income among these type of institutions.
I have calculated the average income and the average expenses among the Private, Public and For Profit type of instituions and comparing them.
by_type_degree <- df %>%
group_by(type_recode)
mean_type_degree <- by_type_degree %>%
summarise(mean_income= round(mean(Total.Income), 3),
mean_Expense = round(mean(Total.Expenses), 3))
g3 <- ggplot(mean_type_degree, aes(type_recode, mean_income)) +
geom_bar(aes(fill = type_recode), stat = "identity", show.legend = FALSE) +
geom_text(aes(label = paste(round(mean_income / 1e6, 1), "M")))+
labs(title = "Average Income ($)",
x = "Type of Institution",
y = "Average Income in Million Dollars")+scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))
g4 <- ggplot(mean_type_degree, aes(type_recode, mean_Expense)) +
geom_bar(aes(fill = type_recode), stat = "identity", show.legend = FALSE) +
geom_text(aes(label = paste(round(mean_Expense / 1e6, 1), "M")))+
labs(title = "Average Expense ($)",
x = "Type of Institution",
y = "Average Expense in Million Dollars") +scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))
plot_grid(g3, g4, ncol = 2, align = "h")
According to the graph, Public institutions have a higher Average Income compared to the Private and “For Profit” type of institutions.On the other hand, Public institutions have a higher average expense as well compared to the other two types of institutions.Eventhough, the Private institutions were higher in number in the dataset the income and the expense of the private institutions are way lower than the Public institutions.Hence, we could suspect two reasons for it,one is that we can say that private institutions might get less fundings from State, Federal etc,and second reason might be private institutions might have limited seats to offer because of which the expenses might be low. From this graph, we could also infer that as the income is higher in the institutions the expenses are also higher. Further, I would be analysing on all of these points.
options(repr.plot.width=30, repr.plot.height=8)
df %>%
group_by(type_recode) %>%
summarize(enrollment_by_State=n(),na.rm=TRUE) %>%
ggplot(aes(x = type_recode, y = enrollment_by_State)) +
geom_col(fill = "dodgerblue3") +
ggtitle("Number of Enrollments by Instituion Type") +
xlab("States") +
ylab("Total Number of Enrollment")+
coord_flip()
From the above visualization, we can say that Private institution has a higher rate of enrollment than the Public institution.However,in the dataset the number of students going to Private institution are higher. Hence, it might be purely based on the bias in the dataset. So Further, to know if tution is the major contributor for the income of the private institution and if the enrollment number contributes to the income we might have to find the relationship between the income and tution per capita.
df$tuition_percapita <- df$Tuition/df$Total.Enrollment
df$income_percapita <- df$Total.Income/df$Total.Enrollment
df$expense_percapita <- df$Total.Expenses/df$Total.Enrollment
tution <-df%>%
filter(type_recode=='Private')
ggplot(tution, aes(x=income_percapita, y=tuition_percapita)) + geom_point()+
geom_smooth(method=lm)+
ggtitle("Relationship between Income Per capita and Tuition Per Capita") +
xlab("Income per capita") +
ylab('Tuition per capita')+
scale_y_continuous(labels = dollar)+scale_x_continuous(labels = dollar)
The relationship between income per capita and the Tuition per capita is a positive linear relationship but the relationship between them is very weak.The strength of the relationship between two variables is a crucial piece of information. The strongest linear relationship occurs when the slope is 1 but the above scatter plot shows the slope is lower than 1. Hence, there is a weak relationship between the tuition per capita and Income per capita.Private universities might have the fundings from other sources but they are not as much as the Public universities.
#####Further , I would be exploring more on the relationship between the income and Expense. Whether if the Income is more Expense would also be more?
options(repr.plot.width=30, repr.plot.height=8)
df %>%
group_by(STABBR) %>%
summarize(enrollment_State=n(),na.rm=TRUE) %>%
ggplot(aes(x = STABBR, y = enrollment_State)) +
geom_col(fill = "dodgerblue3") +
ggtitle("Number of Enrollments in Institutions by State") +
xlab("States") +
ylab("Total Number of Enrollment")+
coord_flip()
The above analysis, shows that the top five states with highest enrollment of students are the states California, New York, Texas, Pensylvania and Florida.
data<-df %>%
select(UNITID,STABBR,type,income_percapita,Total.Income,Tuition,Federal,State,Local,Private,Sales,Research,Public.Service,Student.Services,Instruction,Academic.Support,Other,Institution.Name,Total.Enrollment,Undergraduate.Enrollment) %>%
filter(STABBR=="CA" | STABBR== "NY" | STABBR== "TX"| STABBR== "PA"| STABBR== "FL")
ggplot(data, aes(x=STABBR, y =income_percapita)) +
geom_point() +
geom_smooth(method = 'gam') +
geom_point(data = subset(data, STABBR %in% c("CA", "NY", "TX", "PA", "FL")), aes(color = STABBR)) +
ggtitle("Total Income in the top-five student with highest enrollment") +
xlab("States") +
ylab("Total Income")+
scale_y_continuous(labels = dollar)
The above analysis is carried out on income per capita by top five with high enrollment states. We can see that New York has comparitevely higher income compared to other states. However, there is an outlier of more than 10,000,000 income in the Texas State.
data<-df %>%
select(UNITID,STABBR,type,expense_percapita,Total.Expenses,Tuition,Federal,State,Local,Private,Sales,Research,Public.Service,Student.Services,Instruction,Academic.Support,Other,Institution.Name,Total.Enrollment,Undergraduate.Enrollment) %>%
filter(STABBR=="CA" | STABBR== "NY" | STABBR== "TX"| STABBR== "PA"| STABBR== "FL")
ggplot(data, aes(x=STABBR, y =expense_percapita)) +
geom_point() +
geom_smooth(method = 'gam') +
geom_point(data = subset(data, STABBR %in% c("CA", "NY", "TX", "PA", "FL")), aes(color = STABBR)) +
ggtitle("Total Expenses in the top-five states with highest enrollment") +
xlab("States") +
ylab('Total Expenses')+
scale_y_continuous(labels = dollar)
The above analysis is carried out on income per capita by top five with high enrollment states. We can see that New York has comparitevely higher expense compared to other states. However, there is an outlier of around 14,000,000 expense in the Texas State. This looks very similar to the income pattern among the top five states with high enrollment. Hence, it provides more evidence that as the income gets higher the expense is higher too.
ggplot(df, aes(x=Total.Income, y=Total.Expenses)) + geom_point()+
geom_smooth(method=lm)+
ggtitle("Relationship between Total Expenses and Total Income") +
xlab("Total Income") +
ylab('Total Expenses')+
scale_y_continuous(labels = dollar)+scale_x_continuous(labels = dollar)
The relationship between the Total income and the Total Expense is Strong, Positive and linear relationship.The trend line to the plot is showing the mathematically best fit to the data. The above graph shows that, one unit increase in the Total income increases one unit of the total expense.
Further, exploring more on the relationship between the Total Income and the Total Expense.
df %>%
group_by(STABBR,Tuition) %>%
summarize(tuition_by_state=mean(tuition_percapita),na.rm=FALSE) %>%
ggplot(aes(x = STABBR, y = tuition_by_state)) +
geom_col(fill = "dodgerblue3") +
ggtitle("Average Tution per capita across different States") +
xlab("States") +
ylab("Average Tution per capita")+
coord_flip()+
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))
The above graph shows that States California, New York, Pensylvania,Florida and Texas have a higher average tuition per capita. This is reasonable as these states are the top five states with highest enrollment, hence, Tuition per capita might be higher.However, the order has been changed.
df %>%
group_by(STABBR,Total.Income) %>%
summarize(income_by_state=mean(income_percapita),na.rm=FALSE) %>%
ggplot(aes(x = STABBR, y = income_by_state)) +
geom_col(fill = "dodgerblue3") +
ggtitle("Income per capita across different States") +
xlab("States") +
ylab("Average Income per capita")+
coord_flip()
The above graph shows that States Texas,New York,California, Pensylvania and Florida have a higher average income per capita.
df %>%
group_by(STABBR,Total.Expenses) %>%
summarize(expense_by_state=mean(expense_percapita),na.rm=FALSE) %>%
ggplot(aes(x = STABBR, y = expense_by_state)) +
geom_col(fill = "dodgerblue3") +
ggtitle("Expense per capita across different States") +
xlab("States") +
ylab("Average Expense per capita")+
coord_flip()
The above graph shows that States Texas,New York,California, Pensylvania,Illinois and Florida have a higher average expense per capita.This follows the same pattern as the income per capita. Hence, there is another evidence of strong relationship between the Expense and the income.
ggplot(df, aes(x=income_percapita, y=expense_percapita)) + geom_point()+
geom_smooth(method=lm)+
ggtitle("Relationship between Expense per capita and Income per capita") +
xlab("Income Per Capita") +
ylab('Expense Per Capita')+
scale_y_continuous(labels = dollar)+scale_x_continuous(labels = dollar)
The relationship between the income per capita and the expense per capita is strong, linear and positive relationship. The slope between these two variables is 1 which makes the relationship stronger. We can see an outlier in the scatter plot. Further I would be filtering the outlier and replot the relationship between the income per capita and expense per capita.
ggplot(df, aes(x=income_percapita, y=expense_percapita)) +
geom_boxplot(fill='#A4A4A4', color="black")+
ggtitle("Boxplot between income per capita and expense per capita to identify the outlier") +
xlab("Income Per Capita") +
ylab('Expense Per Capita')+
theme_classic()
There is an outlier above the 100,000,00. Hence, in the coming steps I will remove the outlier and replot the relationship.
df1<-df %>%
filter(expense_percapita < 1100000)
The data is filtered on expense per capita lower than 11,000,00 to remove the outlier.
ggplot(df1, aes(x=income_percapita, y=expense_percapita)) +
geom_boxplot(fill='#A4A4A4', color="black")+
ggtitle("Boxplot between income per capita and expense per capita to identify the outlier") +
xlab("Income Per Capita") +
ylab('Expense Per Capita')+
theme_classic()
After filtering the data on expense per capita lower than 11,000,00, the box plot shows no outlier in the dataset.
ggplot(df1, aes(x = income_percapita, y = expense_percapita))+
geom_point()+
geom_smooth(method = 'lm')+
labs(x = 'Income per capita',
y = 'Expense per capita',
title = ' income per capita by expense per capita')
After the removal of the outliers, the relationship between the income per capita and expense per capita is Strong, Positive and linear relationship.There is a higher correlation between the income per capita and the expense per capita. Hence, One unit increase in the income increases a unit increase in the expense.
df3<-df %>%
select(Total.Income,Tuition,Federal,State,Local,Private,Sales)
ggpairs(df3)+theme_bw()
To understand what contributes more to the total income in the universities used the GGally which is the extension of the ggplot2 to find the correlation coefficient among the income sources and the total income. There is a higher correlation between the Total income and the income through sales which is 0.942 and it shows a strong positive relationship. Hence, There is a higher contribution from the Sales income to the impact on the Total income among all the colleges.
df4<-df %>%
select(Total.Expenses,Research,Public.Service,Student.Services,Instruction,Academic.Support,Other)
ggpairs(df4)+theme_bw()
To understand what contributes more to the total expenses in the universities used the GGally which is the extension of the ggplot2 to find the correlation coefficient among the types of spending and the total expense. There is a higher correlation between the Total expense and the other expenses is 0.881 and it shows a strong positive relationship. Hence, There is a higher contribution from the other expenses to the impact on the Total expense among all the colleges.
Private universities are more in this dataset. However, the income of the private universities are lower than that of the Public universities.We also found that there is a higher percentage of enrollment to the private universities than the public universities but the tuition per capita is not a major factor contributing to the total income for the Private Universities eventhough there is a higher enrollment of students. This points us towards the Private Universities getting income from other sources such as State, Local,Private etc but the Private Universities are not getting as much support as Public Universities in funding.
We also found that there is a strong positive relationship between the income and the expenses. The universities with the higher income are spending more.We also found that all the universities are getting the income majorly through sales and the expenses is majorly on the Other expenses.