Introduction

The data was obtained through Kaggle. The original dataset was published by The Center for World University Rankings (CWUR).

According to the CWUR website, the methodology for their measurements are as follows:

1. Quality of Education, measured by the number of a university’s alumni who have won major international awards, prizes, and medals relative to the university’s size [25%]

2. Alumni Employment, measured by the number of a university’s alumni who have held CEO positions at the world’s top companies relative to the university’s size [25%]

3. Quality of Faculty, measured by the number of academics who have won major international awards, prizes, and medals [25%]

4. Publications, measured by the number of research papers appearing in reputable journals [5%]

5. Influence, measured by the number of research papers appearing in highly-influential journals [5%]

6. Citations, measured by the number of highly-cited research papers [5%]

7. Broad Impact, measured by the university’s h-index [5%]

8. Patents, measured by the number of international patent filings [5%]

Reading in the Data

library(DT)

data<- read.csv("cwurData.csv")

Since this list has different entries for different years for the same schools, we need to choose a year to analyze.

In this case, I will analyze the most recent available data: 2015

This section of code is to clean our data set and get rid of duplicates.

#data <- subset(data,!duplicated(data$institution))
data <- subset(data, data$year == 2015)
str(data)
## 'data.frame':    1000 obs. of  14 variables:
##  $ world_rank          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ institution         : Factor w/ 1024 levels "Örebro University",..: 194 520 322 653 833 106 643 664 442 110 ...
##  $ country             : Factor w/ 59 levels "Argentina","Australia",..: 59 59 59 57 57 59 59 59 59 59 ...
##  $ national_rank       : int  1 2 3 1 2 4 5 6 7 8 ...
##  $ quality_of_education: int  1 9 3 2 7 13 5 11 4 12 ...
##  $ alumni_employment   : int  1 2 11 10 13 6 21 14 15 18 ...
##  $ quality_of_faculty  : int  1 4 2 5 10 9 6 8 3 14 ...
##  $ publications        : int  1 5 15 11 7 13 10 17 72 24 ...
##  $ influence           : int  1 3 2 6 12 13 4 16 25 15 ...
##  $ citations           : int  1 3 2 12 7 11 4 12 24 25 ...
##  $ broad_impact        : int  1 4 2 13 9 12 7 22 33 22 ...
##  $ patents             : int  3 10 1 48 15 4 29 141 225 11 ...
##  $ score               : num  100 98.7 97.5 96.8 96.5 ...
##  $ year                : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...

Visualizing the Top Countries in Higher Learning

height<- sort(table(data$country), decreasing = TRUE)

barplot(height[1:10], las = 3, main = "Top Countries in University Rankings 2015")

The US seems to be leagues ahead of other countries in higher learning.

Exploring US Schools

Let’s explore the top Universities in the US.

Since we are only interested in American schools, we need to build a data frame that contains only USA schools.

usa <- subset(data, data$country == "USA")

Use the search bar to see if your school made the list!

library(DT)

datatable(usa)

Here is a matrix plot of the data

#pairs(data)

Analysis

Looking at the data, it is clear that national ranking is determined by multiple factors. To determine how different factors correlate with National rank, we first begin by visualizing National Rank against different variables.

I chose these three variables for comparison.

Quality Of Faculty

plot (usa$quality_of_faculty, usa$national_rank, xlab ="Quality of Faculty", ylab = "National Rank", main = "Quality of Faculty vs National Rank")
c <- lm(national_rank ~ quality_of_faculty, data = usa)
abline(c)

summary(c) #regression model
## 
## Call:
## lm(formula = national_rank ~ quality_of_faculty, data = usa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -118.610  -28.610   -0.349   31.390   82.390 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.4456     7.4219    0.33    0.742    
## quality_of_faculty   0.6613     0.0400   16.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.72 on 227 degrees of freedom
## Multiple R-squared:  0.5463, Adjusted R-squared:  0.5443 
## F-statistic: 273.3 on 1 and 227 DF,  p-value: < 2.2e-16

Influence

plot (usa$influence, usa$national_rank, xlab ="Influence", ylab="National Rank", main = "Influnce vs National Rank")
c <- lm(national_rank ~ influence, data = usa)
abline(c)

summary(c) #regression model
## 
## Call:
## lm(formula = national_rank ~ influence, data = usa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.27  -17.67   -0.14   17.16   70.99 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.252418   3.075865   11.46   <2e-16 ***
## influence    0.233616   0.007202   32.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.97 on 227 degrees of freedom
## Multiple R-squared:  0.8225, Adjusted R-squared:  0.8218 
## F-statistic:  1052 on 1 and 227 DF,  p-value: < 2.2e-16

Citations

plot (usa$citations, usa$national_rank, xlab = "Citations", ylab = "National Rank", main = "Citations vs National Rank")
c <- lm(national_rank ~ citations, data = usa)
abline(c)

summary(c) #regression model
## 
## Call:
## lm(formula = national_rank ~ citations, data = usa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.526  -19.187   -2.066   18.193  106.955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.855309   3.403867   12.00   <2e-16 ***
## citations    0.231122   0.008354   27.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.76 on 227 degrees of freedom
## Multiple R-squared:  0.7713, Adjusted R-squared:  0.7702 
## F-statistic: 765.4 on 1 and 227 DF,  p-value: < 2.2e-16

Interpretation

According the model, these three variables are positively correlated with national rank.

This is to be expected as one would think the quality of faculty, influence and number of citations would be a positive boon for national ranking.

Let’s look at a regression model with these three variables.

regline <- lm(national_rank ~ quality_of_faculty + influence + citations, data = usa)

summary(regline)
## 
## Call:
## lm(formula = national_rank ~ quality_of_faculty + influence + 
##     citations, data = usa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.367 -11.842  -1.757  14.038  55.969 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        14.29659    3.77518   3.787 0.000196 ***
## quality_of_faculty  0.16760    0.02786   6.015 7.20e-09 ***
## influence           0.12428    0.01152  10.786  < 2e-16 ***
## citations           0.09275    0.01118   8.295 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.56 on 225 degrees of freedom
## Multiple R-squared:  0.8855, Adjusted R-squared:  0.884 
## F-statistic: 580.1 on 3 and 225 DF,  p-value: < 2.2e-16

In this model, we see that quality of faculty has the strongest correlation with national ranking compared to the other variables.

In other words, according to this model, the best way for a school to increase national ranking would be to invest in better faculty.

NOTE: some errors to take note…. There could be some multicollinearity. In other words, the predictor variables could be highly correlated with one another as well.

Just out of curiousity, let’s perform a multi-variable regression on all of the factors providied by the data set.

NOTE: The multicollinearity error in this model will be HUGE. It will be large because it is not far fetched to think that the predictor variables all affect one another. For example, a high publication rating could be explained by a high faculty quality

regline <- lm(national_rank ~quality_of_education + alumni_employment + quality_of_faculty + publications + influence + citations +broad_impact + patents + score, data = usa)

summary(regline)
## 
## Call:
## lm(formula = national_rank ~ quality_of_education + alumni_employment + 
##     quality_of_faculty + publications + influence + citations + 
##     broad_impact + patents + score, data = usa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.688  -5.390   3.724   8.233  18.032 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          28.967883  10.130419   2.859  0.00465 ** 
## quality_of_education  0.029685   0.011377   2.609  0.00970 ** 
## alumni_employment     0.056763   0.005802   9.784  < 2e-16 ***
## quality_of_faculty    0.069824   0.021803   3.202  0.00157 ** 
## publications          0.004607   0.011134   0.414  0.67942    
## influence             0.010750   0.010273   1.046  0.29654    
## citations             0.015292   0.007986   1.915  0.05681 .  
## broad_impact          0.146608   0.013179  11.124  < 2e-16 ***
## patents               0.014724   0.005827   2.527  0.01221 *  
## score                -0.383529   0.138639  -2.766  0.00615 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.43 on 219 degrees of freedom
## Multiple R-squared:  0.9605, Adjusted R-squared:  0.9589 
## F-statistic: 592.1 on 9 and 219 DF,  p-value: < 2.2e-16

If we were to take the results face value, quality of faculty has one of the stronger correlation with national rank.

Interestingly, in this model, score is negatively correlated with national rank. This could be the result of multicollinearity errors due to the simillarity of the predictor variables.

What About Employment Rank?

Because the dataset is calculated on a global level and we are only interested in the USA region, we need to recalculate employment rank on a national level.

usaEmployment <- usa[order(usa$alumni_employment),]
usaEmploymentRank<-c(1:229) #229 because that is the # of US schools on the list
usaEmployment<-cbind(usaEmployment,usaEmploymentRank)

Now that the schools are assigned a national rank employment, we can proceed with the analysis.

The list with Employment Rank is as follows

I have also included the national rank in the table.

library(DT)
uTable <- data.frame(usaEmployment$institution, usaEmployment$usaEmploymentRank,usaEmployment$national_rank)
datatable(uTable, colnames = c("Institution", "National Employment Rank", "National Rank"))

Some variables that would be interesting to compare with employment rank:

  • National Rank

  • Quality Of Education

  • Quality Of Faculty

National Rank

string= "Employment Rank"

plot(usaEmployment$national_rank,usaEmployment$usaEmploymentRank, xlab = "National Rank", ylab = string ,main = "National Rank vs Employment Rank")
c<-lm(usaEmploymentRank ~ national_rank, data = usaEmployment)
abline(c)

summary(c)
## 
## Call:
## lm(formula = usaEmploymentRank ~ national_rank, data = usaEmployment)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.85 -41.02 -12.07  45.58 111.50 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   40.07791    6.67920    6.00 7.71e-09 ***
## national_rank  0.65150    0.05035   12.94  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 227 degrees of freedom
## Multiple R-squared:  0.4244, Adjusted R-squared:  0.4219 
## F-statistic: 167.4 on 1 and 227 DF,  p-value: < 2.2e-16

Quality Of Education

plot(usaEmployment$quality_of_education,usaEmployment$usaEmploymentRank, xlab = "Quality of Education ", ylab = string, main = "Quality of Education vs Employment Rank")
c<-lm(usaEmploymentRank ~ quality_of_education, data = usaEmployment)
abline(c)

summary(c)
## 
## Call:
## lm(formula = usaEmploymentRank ~ quality_of_education, data = usaEmployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -122.416  -35.706   -5.416   39.809  133.237 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.38255    7.92049   4.088 6.03e-05 ***
## quality_of_education  0.31617    0.02724  11.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.6 on 227 degrees of freedom
## Multiple R-squared:  0.3725, Adjusted R-squared:  0.3697 
## F-statistic: 134.8 on 1 and 227 DF,  p-value: < 2.2e-16

Quality Of faculty

plot(usaEmployment$quality_of_faculty,usaEmployment$usaEmploymentRank, xlab = "Quality of Faculty", ylab = string, main = "Quality of Faculty vs Employment Rank")
c<-lm(usaEmploymentRank ~ quality_of_faculty, data = usaEmployment)
abline(c)

summary(c)
## 
## Call:
## lm(formula = usaEmploymentRank ~ quality_of_faculty, data = usaEmployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -123.354  -44.116   -7.686   51.646  113.056 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        46.08519    9.82576   4.690 4.71e-06 ***
## quality_of_faculty  0.40490    0.05296   7.646 5.82e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.21 on 227 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.2013 
## F-statistic: 58.46 on 1 and 227 DF,  p-value: 5.821e-13

multi-variable regression of the three variables together

linReg<- lm(usaEmploymentRank~national_rank+quality_of_education+quality_of_faculty, data = usaEmployment)
summary(linReg)
## 
## Call:
## lm(formula = usaEmploymentRank ~ national_rank + quality_of_education + 
##     quality_of_faculty, data = usaEmployment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.799 -34.415  -7.601  42.272 116.587 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          33.24391    8.18938   4.059 6.79e-05 ***
## national_rank         0.54415    0.07659   7.105 1.57e-11 ***
## quality_of_education  0.19351    0.03676   5.265 3.27e-07 ***
## quality_of_faculty   -0.18441    0.06775  -2.722    0.007 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.66 on 225 degrees of freedom
## Multiple R-squared:  0.4892, Adjusted R-squared:  0.4824 
## F-statistic: 71.83 on 3 and 225 DF,  p-value: < 2.2e-16

Because Quality of Education and Quality of Faculty may have some overlap, it may be better to leave out one of the variables

linReg<- lm(usaEmploymentRank~national_rank+quality_of_education, data = usaEmployment)
summary(linReg)
## 
## Call:
## lm(formula = usaEmploymentRank ~ national_rank + quality_of_education, 
##     data = usaEmployment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -97.275 -36.367  -8.261  45.342 126.089 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          23.19474    7.41297   3.129  0.00199 ** 
## national_rank         0.43971    0.06722   6.542 4.01e-10 ***
## quality_of_education  0.15781    0.03482   4.532 9.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.33 on 226 degrees of freedom
## Multiple R-squared:  0.4724, Adjusted R-squared:  0.4677 
## F-statistic: 101.2 on 2 and 226 DF,  p-value: < 2.2e-16

According to the model, national rank has the strongest correlation with employment rank

Interpretation

If we were to take this data face value, the prestige of the University matters more when in comes to employment ranking.

However, as I mentioned before, the factors within the provided dataset may be multicollinear. These may fudge the results.

Although, it is interesting to see that prestige seem to be more correlated to employment rather than factors such as quality of education.

Main Take Away

For Universities to increase the employment rate of their students, the most effective way is to increase their national ranking. To increase their national ranking, the best way is to invest in better faculty. In other words, higher faculty quality is highly correlated with student employment.