College Expenses EDA

Loading the Libraries

Before we can begin the analysis we first have to download the packages that we will be using to run our analysis in R:

Tidyverse: Gives us a bunch of analysis tools combined into one seamless package. Overall, this package allows us to manipulate, wrangle, and visualize data for its intended purpose.

ggplot2: plotting package that provides helpful commands to create complex plots from data in a data frame

cowplot: The cowplot package is a simple add-on to ggplot.

GGally: ‘GGally’ is an extension of ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data.

scales: scales used to demonstrate ggplot2 style scales for specific types of data

DT: A simple package that lets us design data tables.

Dplyr: This package gives us a lot of data wrangling capabilities in order to narrow down observations to create in-depth analyses

Highchart: It’s an interactive graph, just explored with addition to the ggplot

library(tidyverse)
library(dplyr)
library(ggplot2)
library(cowplot)
library(scales)
library(DT)
library("GGally")
library("highcharter")

Setting the working directory

getwd()
setwd("C:\\Users\\swath\\Downloads")

Reading the file

df <- readRDS('college_expenses_and_enrollment.Rdata')

Understanding the data

vtable::vtable(df)

df
Name	Class	Label	Values
UNITID	numeric	College ID	Num: 100654 to 491118
STABBR	factor	State	‘AK’ ‘AL’ ‘AR’ ‘AS’ ‘AZ’ and more
type	factor	College Type (NP = Nonprofit)	‘For-Profit 2-Year’ ‘For-Profit 4-Year’ ‘Private NP 2-Year’ ‘Private NP 4-Year’ ‘Public 2-Year’ and more
Total.Income	numeric	Total Income	Num: 6067 to 9703300000
Total.Expenses	numeric	Total Expenses	Num: 4600 to 8629847000
Tuition	numeric	Tuition (Income)	Num: 0 to 1753736000
Federal	numeric	Federal Support (Income)	Num: 0 to 1071432208
State	numeric	State Support (Income)	Num: 0 to 909496000
Local	numeric	Local Support (Income)	Num: 0 to 386724142
Private	numeric	Private Support (Income)	Num: -246016 to 1568865000
Sales	numeric	Sales (Income)	Num: -1073664 to 8471746000
Research	numeric	Research (Expense)	Num: 0 to 2910863000
Public.Service	numeric	Public Service (Expense)	Num: -32 to 807399000
Student.Services	numeric	Student Services (Expense)	Num: 0 to 284363000
Instruction	numeric	Instruction (Expense)	Num: 0 to 2611938000
Academic.Support	numeric	Academic Support (Expense)	Num: 0 to 919577000
Other	numeric	Other Expenses	Num: -1 to 5543912000
Institution.Name	character	NULL
Total.Enrollment	numeric	NULL	Num: 9 to 131629
Full.Time.Enrollment	numeric	NULL	Num: 0 to 131629
Undergraduate.Enrollment	numeric	NULL	Num: 0 to 103711

Data Table

Below is an interactive data table in order to explore the primary dataframe.

datatable(df)

Exploring the variables(columns) in the dataset

colnames(df)

##  [1] "UNITID"                   "STABBR"                  
##  [3] "type"                     "Total.Income"            
##  [5] "Total.Expenses"           "Tuition"                 
##  [7] "Federal"                  "State"                   
##  [9] "Local"                    "Private"                 
## [11] "Sales"                    "Research"                
## [13] "Public.Service"           "Student.Services"        
## [15] "Instruction"              "Academic.Support"        
## [17] "Other"                    "Institution.Name"        
## [19] "Total.Enrollment"         "Full.Time.Enrollment"    
## [21] "Undergraduate.Enrollment"

Summary of the dataset

summary(df)

##      UNITID           STABBR                    type       Total.Income      
##  Min.   :100654   CA     : 468   For-Profit 2-Year: 754   Min.   :6.067e+03  
##  1st Qu.:157674   NY     : 324   For-Profit 4-Year: 477   1st Qu.:6.043e+06  
##  Median :203614   TX     : 272   Private NP 2-Year: 159   Median :2.370e+07  
##  Mean   :247693   PA     : 239   Private NP 4-Year:1628   Mean   :1.255e+08  
##  3rd Qu.:366433   FL     : 228   Public 2-Year    : 992   3rd Qu.:7.395e+07  
##  Max.   :491118   OH     : 223   Public 4-Year    : 761   Max.   :9.703e+09  
##                   (Other):3017                                               
##  Total.Expenses         Tuition             Federal         
##  Min.   :4.600e+03   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:3.202e+06   1st Qu.:2.329e+06   1st Qu.:6.854e+05  
##  Median :1.450e+07   Median :7.655e+06   Median :2.536e+06  
##  Mean   :1.010e+08   Mean   :3.559e+07   Mean   :1.247e+07  
##  3rd Qu.:5.072e+07   3rd Qu.:2.594e+07   3rd Qu.:8.038e+06  
##  Max.   :8.630e+09   Max.   :1.754e+09   Max.   :1.071e+09  
##                                                             
##      State               Local              Private          
##  Min.   :        0   Min.   :        0   Min.   :   -246016  
##  1st Qu.:        0   1st Qu.:        0   1st Qu.:         0  
##  Median :   883000   Median :        0   Median :    320041  
##  Mean   : 17874966   Mean   :  3421588   Mean   :  10226745  
##  3rd Qu.: 11440117   3rd Qu.:    67965   3rd Qu.:   3145640  
##  Max.   :909496000   Max.   :386724142   Max.   :1568865000  
##                                                              
##      Sales               Research         Public.Service     
##  Min.   :  -1073664   Min.   :0.000e+00   Min.   :      -32  
##  1st Qu.:         0   1st Qu.:0.000e+00   1st Qu.:        0  
##  Median :    102720   Median :0.000e+00   Median :        0  
##  Mean   :  45901782   Mean   :1.234e+07   Mean   :  3893081  
##  3rd Qu.:   9142570   3rd Qu.:1.228e+04   3rd Qu.:   445858  
##  Max.   :8471746000   Max.   :2.911e+09   Max.   :807399000  
##                                                              
##  Student.Services     Instruction        Academic.Support   
##  Min.   :        0   Min.   :0.000e+00   Min.   :        0  
##  1st Qu.:   544548   1st Qu.:1.552e+06   1st Qu.:   347380  
##  Median :  2770299   Median :7.589e+06   Median :  1716867  
##  Mean   :  8628289   Mean   :3.769e+07   Mean   : 10321013  
##  3rd Qu.:  8839072   3rd Qu.:2.713e+07   3rd Qu.:  6072095  
##  Max.   :284363000   Max.   :2.612e+09   Max.   :919577000  
##                                                             
##      Other            Institution.Name   Total.Enrollment  
##  Min.   :        -1   Length:4771        Min.   :     9.0  
##  1st Qu.:         0   Class :character   1st Qu.:   365.5  
##  Median :     35172   Mode  :character   Median :  1573.0  
##  Mean   :  28176156                      Mean   :  4646.3  
##  3rd Qu.:   5461534                      3rd Qu.:  4993.0  
##  Max.   :5543912000                      Max.   :131629.0  
##                                          NA's   :516       
##  Full.Time.Enrollment Undergraduate.Enrollment
##  Min.   :     0.0     Min.   :     0          
##  1st Qu.:   263.5     1st Qu.:   279          
##  Median :   972.0     Median :  1384          
##  Mean   :  2850.4     Mean   :  3956          
##  3rd Qu.:  2701.5     3rd Qu.:  4312          
##  Max.   :131629.0     Max.   :103711          
##  NA's   :516          NA's   :516

Internal structure of the data

str(df)

## tibble [4,771 x 21] (S3: tbl_df/tbl/data.frame)
##  $ UNITID                  : num [1:4771] 100654 100663 100690 100706 100724 ...
##   ..- attr(*, "label")= chr "College ID"
##  $ STABBR                  : Factor w/ 59 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "State"
##  $ type                    : Factor w/ 6 levels "For-Profit 2-Year",..: 6 6 4 6 6 6 5 6 6 6 ...
##   ..- attr(*, "label")= chr "College Type (NP = Nonprofit)"
##  $ Total.Income            : num [1:4771] 1.59e+08 2.96e+09 8.32e+06 2.26e+08 1.33e+08 ...
##   ..- attr(*, "label")= chr "Total Income"
##  $ Total.Expenses          : num [1:4771] 1.03e+08 2.69e+09 3.56e+06 1.95e+08 8.94e+07 ...
##   ..- attr(*, "label")= chr "Total Expenses"
##  $ Tuition                 : num [1:4771] 4.65e+07 2.09e+08 6.60e+06 6.24e+07 4.53e+07 ...
##   ..- attr(*, "label")= chr "Tuition (Income)"
##  $ Federal                 : num [1:4771] 4.52e+07 3.85e+08 9.04e+05 8.04e+07 2.84e+07 ...
##   ..- attr(*, "label")= chr "Federal Support (Income)"
##  $ State                   : num [1:4771] 4.22e+07 2.83e+08 0.00 5.20e+07 4.66e+07 ...
##   ..- attr(*, "label")= chr "State Support (Income)"
##  $ Local                   : num [1:4771] 0 2350134 0 0 0 ...
##   ..- attr(*, "label")= chr "Local Support (Income)"
##  $ Private                 : num [1:4771] 2.28e+06 1.06e+08 7.78e+05 5.59e+06 2.91e+06 ...
##   ..- attr(*, "label")= chr "Private Support (Income)"
##  $ Sales                   : num [1:4771] 2.32e+07 1.97e+09 3.34e+04 2.52e+07 9.83e+06 ...
##   ..- attr(*, "label")= chr "Sales (Income)"
##  $ Research                : num [1:4771] 1.02e+07 2.98e+08 0.00 7.30e+07 4.25e+06 ...
##   ..- attr(*, "label")= chr "Research (Expense)"
##  $ Public.Service          : num [1:4771] 1.62e+07 1.73e+08 0.00 7.66e+06 4.52e+06 ...
##   ..- attr(*, "label")= chr "Public Service (Expense)"
##  $ Student.Services        : num [1:4771] 20550501 50272077 771771 20998648 17139324 ...
##   ..- attr(*, "label")= chr "Student Services (Expense)"
##  $ Instruction             : num [1:4771] 3.68e+07 3.11e+08 2.36e+06 6.97e+07 4.01e+07 ...
##   ..- attr(*, "label")= chr "Instruction (Expense)"
##  $ Academic.Support        : num [1:4771] 7.74e+06 1.87e+08 4.28e+05 1.34e+07 1.21e+07 ...
##   ..- attr(*, "label")= chr "Academic Support (Expense)"
##  $ Other                   : num [1:4771] 1.19e+07 1.67e+09 0.00 1.00e+07 1.13e+07 ...
##   ..- attr(*, "label")= chr "Other Expenses"
##  $ Institution.Name        : chr [1:4771] "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
##  $ Total.Enrollment        : num [1:4771] 5859 19535 597 8468 5318 ...
##  $ Full.Time.Enrollment    : num [1:4771] 5040 12691 216 6017 4694 ...
##  $ Undergraduate.Enrollment: num [1:4771] 4851 12369 294 6507 4727 ...

Missing Values

If we take a look at the data set, we can see that there are a handful of missing values. We don’t want to elimintate all of these values because then we would remove nearly all of our observations. We do, however, still want to be mindful of where these missing values are. The code below and its corresponding output shows us the number of rows of missing values (marked “NA”) there are in each column:

colSums(is.na(df))

##                   UNITID                   STABBR                     type 
##                        0                        0                        0 
##             Total.Income           Total.Expenses                  Tuition 
##                        0                        0                        0 
##                  Federal                    State                    Local 
##                        0                        0                        0 
##                  Private                    Sales                 Research 
##                        0                        0                        0 
##           Public.Service         Student.Services              Instruction 
##                        0                        0                        0 
##         Academic.Support                    Other         Institution.Name 
##                        0                        0                      426 
##         Total.Enrollment     Full.Time.Enrollment Undergraduate.Enrollment 
##                      516                      516                      516

Regrouping the Institution type into three groups

There are six types of institution, Private Non-Profit 2-years, Private Non-Profit 4-years, Public 2-Years, Public 4-Years, For Profit 2_years, For Profit 4-Years. Hence, I regrouped these fields as Private, Public and For Profit type of institutions as “type-recode”

df <- df %>%
  mutate(df,type_recode = case_when(type == "Private NP 2-Year" | type == "Private NP 4-Year" ~ "Private",
                                         type == "Public 2-Year" | type == "Public 4-Year" ~ "Public",
                                        type == "For-Profit 2-Year" | type == "For-Profit 4-Year" ~ "For Profit"))

Exploratory Data Analysis

Distribution of the three types of institutions in the dataset.

ggplot(data=df, aes(fill = type_recode)) +
  geom_bar(mapping=aes(x=type_recode)) +

  theme(panel.background=element_blank()) +
  theme(plot.background=element_blank()) +
  theme(panel.grid.major.y=element_line(color="grey")) +
  scale_x_discrete(name="School Type") +
  scale_y_continuous(name="Number of Schools") +
  scale_fill_manual(values=c("pink","skyblue","lightgreen"), guide=guide_legend(title="Institution Type", label.position="top", nrow=2000, keywidth=1.2))

Majority of institution type in the dataset are private and Public institutions, however, the number of Private institutions are more compared to the Public Institutions. Compared to both the Public and the Private institution For Profit type of institution is less in number.Further I would like to explore the expense and the income among these type of institutions.

Income vs Expenses across Private, Public and For Profit type of Institutions

I have calculated the average income and the average expenses among the Private, Public and For Profit type of instituions and comparing them.

by_type_degree <- df %>%
  group_by(type_recode) 
mean_type_degree <- by_type_degree %>%
  summarise(mean_income= round(mean(Total.Income), 3),
            mean_Expense = round(mean(Total.Expenses), 3))

g3 <- ggplot(mean_type_degree, aes(type_recode, mean_income)) +
  geom_bar(aes(fill = type_recode), stat = "identity", show.legend = FALSE) +
   geom_text(aes(label = paste(round(mean_income / 1e6, 1), "M")))+

  labs(title = "Average Income ($)",
       x = "Type of Institution", 
       y = "Average Income in Million Dollars")+scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))

g4 <- ggplot(mean_type_degree, aes(type_recode, mean_Expense)) +
  geom_bar(aes(fill = type_recode), stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste(round(mean_Expense / 1e6, 1), "M")))+
  labs(title = "Average Expense ($)",
       x = "Type of Institution", 
       y = "Average Expense in Million Dollars") +scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))

plot_grid(g3, g4, ncol = 2, align = "h")

According to the graph, Public institutions have a higher Average Income compared to the Private and “For Profit” type of institutions.On the other hand, Public institutions have a higher average expense as well compared to the other two types of institutions.Eventhough, the Private institutions were higher in number in the dataset the income and the expense of the private institutions are way lower than the Public institutions.Hence, we could suspect two reasons for it,one is that we can say that private institutions might get less fundings from State, Federal etc,and second reason might be private institutions might have limited seats to offer because of which the expenses might be low. From this graph, we could also infer that as the income is higher in the institutions the expenses are also higher. Further, I would be analysing on all of these points.

Which type of Institution has a higher enrollment of Students?

options(repr.plot.width=30, repr.plot.height=8)
df %>%
  group_by(type_recode) %>% 
  summarize(enrollment_by_State=n(),na.rm=TRUE) %>% 
  ggplot(aes(x = type_recode, y = enrollment_by_State)) +
  geom_col(fill = "dodgerblue3") +
  ggtitle("Number of Enrollments by Instituion Type") +
  xlab("States") +
  ylab("Total Number of Enrollment")+
  coord_flip()

From the above visualization, we can say that Private institution has a higher rate of enrollment than the Public institution.However,in the dataset the number of students going to Private institution are higher. Hence, it might be purely based on the bias in the dataset. So Further, to know if tution is the major contributor for the income of the private institution and if the enrollment number contributes to the income we might have to find the relationship between the income and tution per capita.

Creating a columns “TUITION PER CAPITA, EXPENSE PER CAPITA AND INCOME PER CAPITA”

df$tuition_percapita <- df$Tuition/df$Total.Enrollment
df$income_percapita <- df$Total.Income/df$Total.Enrollment
df$expense_percapita <- df$Total.Expenses/df$Total.Enrollment

Relationship between Income per Capita and Tuition per Capita for the Private instituition

tution <-df%>%
  filter(type_recode=='Private')
  
ggplot(tution, aes(x=income_percapita, y=tuition_percapita)) + geom_point()+
geom_smooth(method=lm)+
ggtitle("Relationship between Income Per capita and Tuition Per Capita") +
xlab("Income per capita") +
ylab('Tuition per capita')+
scale_y_continuous(labels = dollar)+scale_x_continuous(labels = dollar)

The relationship between income per capita and the Tuition per capita is a positive linear relationship but the relationship between them is very weak.The strength of the relationship between two variables is a crucial piece of information. The strongest linear relationship occurs when the slope is 1 but the above scatter plot shows the slope is lower than 1. Hence, there is a weak relationship between the tuition per capita and Income per capita.Private universities might have the fundings from other sources but they are not as much as the Public universities.

#####Further , I would be exploring more on the relationship between the income and Expense. Whether if the Income is more Expense would also be more?

Which state has the highest enrollment?

options(repr.plot.width=30, repr.plot.height=8)
df %>%
  group_by(STABBR) %>% 
  summarize(enrollment_State=n(),na.rm=TRUE) %>% 
  ggplot(aes(x = STABBR, y = enrollment_State)) +
  geom_col(fill = "dodgerblue3") +
  ggtitle("Number of Enrollments in Institutions by State") +
  xlab("States") +
  ylab("Total Number of Enrollment")+
  coord_flip()

The above analysis, shows that the top five states with highest enrollment of students are the states California, New York, Texas, Pensylvania and Florida.

Total Income among the top five states with highest enrollment

data<-df %>%
  select(UNITID,STABBR,type,income_percapita,Total.Income,Tuition,Federal,State,Local,Private,Sales,Research,Public.Service,Student.Services,Instruction,Academic.Support,Other,Institution.Name,Total.Enrollment,Undergraduate.Enrollment) %>%
  filter(STABBR=="CA" | STABBR== "NY" | STABBR== "TX"| STABBR== "PA"| STABBR== "FL")
  
ggplot(data, aes(x=STABBR, y =income_percapita)) +
 geom_point() +
  geom_smooth(method = 'gam') +
  geom_point(data = subset(data, STABBR %in% c("CA", "NY", "TX", "PA", "FL")), aes(color = STABBR)) +
  ggtitle("Total Income in  the top-five student with highest enrollment") +
  xlab("States") +
  ylab("Total Income")+
  scale_y_continuous(labels = dollar)

The above analysis is carried out on income per capita by top five with high enrollment states. We can see that New York has comparitevely higher income compared to other states. However, there is an outlier of more than 10,000,000 income in the Texas State.

Total Expense among the top five states with highest enrollment

data<-df %>%
  select(UNITID,STABBR,type,expense_percapita,Total.Expenses,Tuition,Federal,State,Local,Private,Sales,Research,Public.Service,Student.Services,Instruction,Academic.Support,Other,Institution.Name,Total.Enrollment,Undergraduate.Enrollment) %>%
  filter(STABBR=="CA" | STABBR== "NY" | STABBR== "TX"| STABBR== "PA"| STABBR== "FL")
  
ggplot(data, aes(x=STABBR, y =expense_percapita)) +
 geom_point() +
  geom_smooth(method = 'gam') +
  geom_point(data = subset(data, STABBR %in% c("CA", "NY", "TX", "PA", "FL")), aes(color = STABBR)) +
  ggtitle("Total Expenses in the top-five states with highest enrollment") +
  xlab("States") +
  ylab('Total Expenses')+
  scale_y_continuous(labels = dollar)

The above analysis is carried out on income per capita by top five with high enrollment states. We can see that New York has comparitevely higher expense compared to other states. However, there is an outlier of around 14,000,000 expense in the Texas State. This looks very similar to the income pattern among the top five states with high enrollment. Hence, it provides more evidence that as the income gets higher the expense is higher too.

Relationship between Total Income and Total Expenses

ggplot(df, aes(x=Total.Income, y=Total.Expenses)) + geom_point()+
geom_smooth(method=lm)+
ggtitle("Relationship between Total Expenses and Total Income") +
xlab("Total Income") +
ylab('Total Expenses')+
scale_y_continuous(labels = dollar)+scale_x_continuous(labels = dollar)

The relationship between the Total income and the Total Expense is Strong, Positive and linear relationship.The trend line to the plot is showing the mathematically best fit to the data. The above graph shows that, one unit increase in the Total income increases one unit of the total expense.

Further, exploring more on the relationship between the Total Income and the Total Expense.

Average Tution Per capita across different States

df %>%
  group_by(STABBR,Tuition) %>% 
  summarize(tuition_by_state=mean(tuition_percapita),na.rm=FALSE) %>% 
  ggplot(aes(x = STABBR, y = tuition_by_state)) +
  geom_col(fill = "dodgerblue3") +
  ggtitle("Average Tution per capita across different States") +
  xlab("States") +
  ylab("Average Tution per capita")+
  coord_flip()+
  scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))

The above graph shows that States California, New York, Pensylvania,Florida and Texas have a higher average tuition per capita. This is reasonable as these states are the top five states with highest enrollment, hence, Tuition per capita might be higher.However, the order has been changed.

Average Income per capita across different States

df %>%
  group_by(STABBR,Total.Income) %>% 
  summarize(income_by_state=mean(income_percapita),na.rm=FALSE) %>% 
  ggplot(aes(x = STABBR, y = income_by_state)) +
  geom_col(fill = "dodgerblue3") +
  ggtitle("Income per capita across different States") +
  xlab("States") +
  ylab("Average Income per capita")+
  coord_flip()

The above graph shows that States Texas,New York,California, Pensylvania and Florida have a higher average income per capita.

df %>%
  group_by(STABBR,Total.Expenses) %>% 
  summarize(expense_by_state=mean(expense_percapita),na.rm=FALSE) %>% 
  ggplot(aes(x = STABBR, y = expense_by_state)) +
  geom_col(fill = "dodgerblue3") +
  ggtitle("Expense per capita across different States") +
  xlab("States") +
  ylab("Average Expense per capita")+
  coord_flip()

The above graph shows that States Texas,New York,California, Pensylvania,Illinois and Florida have a higher average expense per capita.This follows the same pattern as the income per capita. Hence, there is another evidence of strong relationship between the Expense and the income.

Relationship between Expense per capita and Income per capita

ggplot(df, aes(x=income_percapita, y=expense_percapita)) + geom_point()+
geom_smooth(method=lm)+
ggtitle("Relationship between Expense per capita and Income per capita") +
xlab("Income Per Capita") +
ylab('Expense Per Capita')+
scale_y_continuous(labels = dollar)+scale_x_continuous(labels = dollar)

The relationship between the income per capita and the expense per capita is strong, linear and positive relationship. The slope between these two variables is 1 which makes the relationship stronger. We can see an outlier in the scatter plot. Further I would be filtering the outlier and replot the relationship between the income per capita and expense per capita.

Boxplot between income per capita and expense per capita to identify the outlier

ggplot(df, aes(x=income_percapita, y=expense_percapita)) + 
 geom_boxplot(fill='#A4A4A4', color="black")+
ggtitle("Boxplot between income per capita and expense per capita to identify the outlier") +
xlab("Income Per Capita") +
ylab('Expense Per Capita')+
theme_classic()

There is an outlier above the 100,000,00. Hence, in the coming steps I will remove the outlier and replot the relationship.

Filtering the data to remove the outlier

df1<-df %>% 
  filter(expense_percapita < 1100000)

The data is filtered on expense per capita lower than 11,000,00 to remove the outlier.

Boxplot between income per capita and expense per capita to identify the outlier

ggplot(df1, aes(x=income_percapita, y=expense_percapita)) + 
 geom_boxplot(fill='#A4A4A4', color="black")+
  ggtitle("Boxplot between income per capita and expense per capita to identify the outlier") +
xlab("Income Per Capita") +
ylab('Expense Per Capita')+
theme_classic()

After filtering the data on expense per capita lower than 11,000,00, the box plot shows no outlier in the dataset.

Replotting the relationship between the income per capita and expense percapita

ggplot(df1, aes(x = income_percapita, y = expense_percapita))+
  geom_point()+
  geom_smooth(method = 'lm')+
   labs(x = 'Income per capita',
      y = 'Expense per capita',
      title = ' income per capita by expense per capita')

After the removal of the outliers, the relationship between the income per capita and expense per capita is Strong, Positive and linear relationship.There is a higher correlation between the income per capita and the expense per capita. Hence, One unit increase in the income increases a unit increase in the expense.

Correlation between the Total Income and the types of income

df3<-df %>% 
  select(Total.Income,Tuition,Federal,State,Local,Private,Sales)
ggpairs(df3)+theme_bw()

To understand what contributes more to the total income in the universities used the GGally which is the extension of the ggplot2 to find the correlation coefficient among the income sources and the total income. There is a higher correlation between the Total income and the income through sales which is 0.942 and it shows a strong positive relationship. Hence, There is a higher contribution from the Sales income to the impact on the Total income among all the colleges.

Correlation between the Total Expenses and the types of Spending

df4<-df %>% 
  select(Total.Expenses,Research,Public.Service,Student.Services,Instruction,Academic.Support,Other)
ggpairs(df4)+theme_bw()

To understand what contributes more to the total expenses in the universities used the GGally which is the extension of the ggplot2 to find the correlation coefficient among the types of spending and the total expense. There is a higher correlation between the Total expense and the other expenses is 0.881 and it shows a strong positive relationship. Hence, There is a higher contribution from the other expenses to the impact on the Total expense among all the colleges.

Conclusion

Private universities are more in this dataset. However, the income of the private universities are lower than that of the Public universities.We also found that there is a higher percentage of enrollment to the private universities than the public universities but the tuition per capita is not a major factor contributing to the total income for the Private Universities eventhough there is a higher enrollment of students. This points us towards the Private Universities getting income from other sources such as State, Local,Private etc but the Private Universities are not getting as much support as Public Universities in funding.

We also found that there is a strong positive relationship between the income and the expenses. The universities with the higher income are spending more.We also found that all the universities are getting the income majorly through sales and the expenses is majorly on the Other expenses.

College Expenses EDA

Swathi Bangalore Madhuranath

Loading the Libraries

Setting the working directory

Reading the file

Understanding the data

Data Table

Exploring the variables(columns) in the dataset

Summary of the dataset

Internal structure of the data

Missing Values

Regrouping the Institution type into three groups

Exploratory Data Analysis

Distribution of the three types of institutions in the dataset.

Income vs Expenses across Private, Public and For Profit type of Institutions

Which type of Institution has a higher enrollment of Students?

Creating a columns “TUITION PER CAPITA, EXPENSE PER CAPITA AND INCOME PER CAPITA”

Relationship between Income per Capita and Tuition per Capita for the Private instituition

Which state has the highest enrollment?

Total Income among the top five states with highest enrollment

Total Expense among the top five states with highest enrollment

Relationship between Total Income and Total Expenses

Average Tution Per capita across different States

Average Income per capita across different States

Relationship between Expense per capita and Income per capita

Boxplot between income per capita and expense per capita to identify the outlier

Filtering the data to remove the outlier

Boxplot between income per capita and expense per capita to identify the outlier

Replotting the relationship between the income per capita and expense percapita

Correlation between the Total Income and the types of income

Correlation between the Total Expenses and the types of Spending

Conclusion