Introduction
The data was obtained through Kaggle. The original dataset was published by The Center for World University Rankings (CWUR).
According to the CWUR website, the methodology for their measurements are as follows:
1. Quality of Education, measured by the number of a university’s alumni who have won major international awards, prizes, and medals relative to the university’s size [25%]
2. Alumni Employment, measured by the number of a university’s alumni who have held CEO positions at the world’s top companies relative to the university’s size [25%]
3. Quality of Faculty, measured by the number of academics who have won major international awards, prizes, and medals [25%]
4. Publications, measured by the number of research papers appearing in reputable journals [5%]
5. Influence, measured by the number of research papers appearing in highly-influential journals [5%]
6. Citations, measured by the number of highly-cited research papers [5%]
7. Broad Impact, measured by the university’s h-index [5%]
8. Patents, measured by the number of international patent filings [5%]
Reading in the Data
library(DT)
data<- read.csv("cwurData.csv")
Since this list has different entries for different years for the same schools, we need to choose a year to analyze.
In this case, I will analyze the most recent available data: 2015
This section of code is to clean our data set and get rid of duplicates.
#data <- subset(data,!duplicated(data$institution))
data <- subset(data, data$year == 2015)
str(data)
## 'data.frame': 1000 obs. of 14 variables:
## $ world_rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ institution : Factor w/ 1024 levels "Örebro University",..: 194 520 322 653 833 106 643 664 442 110 ...
## $ country : Factor w/ 59 levels "Argentina","Australia",..: 59 59 59 57 57 59 59 59 59 59 ...
## $ national_rank : int 1 2 3 1 2 4 5 6 7 8 ...
## $ quality_of_education: int 1 9 3 2 7 13 5 11 4 12 ...
## $ alumni_employment : int 1 2 11 10 13 6 21 14 15 18 ...
## $ quality_of_faculty : int 1 4 2 5 10 9 6 8 3 14 ...
## $ publications : int 1 5 15 11 7 13 10 17 72 24 ...
## $ influence : int 1 3 2 6 12 13 4 16 25 15 ...
## $ citations : int 1 3 2 12 7 11 4 12 24 25 ...
## $ broad_impact : int 1 4 2 13 9 12 7 22 33 22 ...
## $ patents : int 3 10 1 48 15 4 29 141 225 11 ...
## $ score : num 100 98.7 97.5 96.8 96.5 ...
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
Visualizing the Top Countries in Higher Learning
height<- sort(table(data$country), decreasing = TRUE)
barplot(height[1:10], las = 3, main = "Top Countries in University Rankings 2015")

The US seems to be leagues ahead of other countries in higher learning.
Exploring US Schools
Let’s explore the top Universities in the US.
Since we are only interested in American schools, we need to build a data frame that contains only USA schools.
usa <- subset(data, data$country == "USA")
Use the search bar to see if your school made the list!
library(DT)
datatable(usa)
Here is a matrix plot of the data
#pairs(data)
Analysis
Looking at the data, it is clear that national ranking is determined by multiple factors. To determine how different factors correlate with National rank, we first begin by visualizing National Rank against different variables.
I chose these three variables for comparison.
Quality of Faculty
Influence
Citations
Quality Of Faculty
plot (usa$quality_of_faculty, usa$national_rank, xlab ="Quality of Faculty", ylab = "National Rank", main = "Quality of Faculty vs National Rank")
c <- lm(national_rank ~ quality_of_faculty, data = usa)
abline(c)

summary(c) #regression model
##
## Call:
## lm(formula = national_rank ~ quality_of_faculty, data = usa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.610 -28.610 -0.349 31.390 82.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4456 7.4219 0.33 0.742
## quality_of_faculty 0.6613 0.0400 16.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.72 on 227 degrees of freedom
## Multiple R-squared: 0.5463, Adjusted R-squared: 0.5443
## F-statistic: 273.3 on 1 and 227 DF, p-value: < 2.2e-16
Influence
plot (usa$influence, usa$national_rank, xlab ="Influence", ylab="National Rank", main = "Influnce vs National Rank")
c <- lm(national_rank ~ influence, data = usa)
abline(c)

summary(c) #regression model
##
## Call:
## lm(formula = national_rank ~ influence, data = usa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.27 -17.67 -0.14 17.16 70.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.252418 3.075865 11.46 <2e-16 ***
## influence 0.233616 0.007202 32.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.97 on 227 degrees of freedom
## Multiple R-squared: 0.8225, Adjusted R-squared: 0.8218
## F-statistic: 1052 on 1 and 227 DF, p-value: < 2.2e-16
Citations
plot (usa$citations, usa$national_rank, xlab = "Citations", ylab = "National Rank", main = "Citations vs National Rank")
c <- lm(national_rank ~ citations, data = usa)
abline(c)

summary(c) #regression model
##
## Call:
## lm(formula = national_rank ~ citations, data = usa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.526 -19.187 -2.066 18.193 106.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.855309 3.403867 12.00 <2e-16 ***
## citations 0.231122 0.008354 27.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.76 on 227 degrees of freedom
## Multiple R-squared: 0.7713, Adjusted R-squared: 0.7702
## F-statistic: 765.4 on 1 and 227 DF, p-value: < 2.2e-16
Interpretation
According the model, these three variables are positively correlated with national rank.
This is to be expected as one would think the quality of faculty, influence and number of citations would be a positive boon for national ranking.
Let’s look at a regression model with these three variables.
regline <- lm(national_rank ~ quality_of_faculty + influence + citations, data = usa)
summary(regline)
##
## Call:
## lm(formula = national_rank ~ quality_of_faculty + influence +
## citations, data = usa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.367 -11.842 -1.757 14.038 55.969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.29659 3.77518 3.787 0.000196 ***
## quality_of_faculty 0.16760 0.02786 6.015 7.20e-09 ***
## influence 0.12428 0.01152 10.786 < 2e-16 ***
## citations 0.09275 0.01118 8.295 9.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.56 on 225 degrees of freedom
## Multiple R-squared: 0.8855, Adjusted R-squared: 0.884
## F-statistic: 580.1 on 3 and 225 DF, p-value: < 2.2e-16
In this model, we see that quality of faculty has the strongest correlation with national ranking compared to the other variables.
In other words, according to this model, the best way for a school to increase national ranking would be to invest in better faculty.
NOTE: The multicollinearity error in this model will be HUGE. It will be large because it is not far fetched to think that the predictor variables all affect one another. For example, a high publication rating could be explained by a high faculty quality
regline <- lm(national_rank ~quality_of_education + alumni_employment + quality_of_faculty + publications + influence + citations +broad_impact + patents + score, data = usa)
summary(regline)
##
## Call:
## lm(formula = national_rank ~ quality_of_education + alumni_employment +
## quality_of_faculty + publications + influence + citations +
## broad_impact + patents + score, data = usa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.688 -5.390 3.724 8.233 18.032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.967883 10.130419 2.859 0.00465 **
## quality_of_education 0.029685 0.011377 2.609 0.00970 **
## alumni_employment 0.056763 0.005802 9.784 < 2e-16 ***
## quality_of_faculty 0.069824 0.021803 3.202 0.00157 **
## publications 0.004607 0.011134 0.414 0.67942
## influence 0.010750 0.010273 1.046 0.29654
## citations 0.015292 0.007986 1.915 0.05681 .
## broad_impact 0.146608 0.013179 11.124 < 2e-16 ***
## patents 0.014724 0.005827 2.527 0.01221 *
## score -0.383529 0.138639 -2.766 0.00615 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.43 on 219 degrees of freedom
## Multiple R-squared: 0.9605, Adjusted R-squared: 0.9589
## F-statistic: 592.1 on 9 and 219 DF, p-value: < 2.2e-16
If we were to take the results face value, quality of faculty has one of the stronger correlation with national rank.
Interestingly, in this model, score is negatively correlated with national rank. This could be the result of multicollinearity errors due to the simillarity of the predictor variables.
What About Employment Rank?
Because the dataset is calculated on a global level and we are only interested in the USA region, we need to recalculate employment rank on a national level.
usaEmployment <- usa[order(usa$alumni_employment),]
usaEmploymentRank<-c(1:229) #229 because that is the # of US schools on the list
usaEmployment<-cbind(usaEmployment,usaEmploymentRank)
Now that the schools are assigned a national rank employment, we can proceed with the analysis.
The list with Employment Rank is as follows
I have also included the national rank in the table.
library(DT)
uTable <- data.frame(usaEmployment$institution, usaEmployment$usaEmploymentRank,usaEmployment$national_rank)
datatable(uTable, colnames = c("Institution", "National Employment Rank", "National Rank"))
Some variables that would be interesting to compare with employment rank:
National Rank
Quality Of Education
Quality Of Faculty
National Rank
string= "Employment Rank"
plot(usaEmployment$national_rank,usaEmployment$usaEmploymentRank, xlab = "National Rank", ylab = string ,main = "National Rank vs Employment Rank")
c<-lm(usaEmploymentRank ~ national_rank, data = usaEmployment)
abline(c)

summary(c)
##
## Call:
## lm(formula = usaEmploymentRank ~ national_rank, data = usaEmployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.85 -41.02 -12.07 45.58 111.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.07791 6.67920 6.00 7.71e-09 ***
## national_rank 0.65150 0.05035 12.94 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 227 degrees of freedom
## Multiple R-squared: 0.4244, Adjusted R-squared: 0.4219
## F-statistic: 167.4 on 1 and 227 DF, p-value: < 2.2e-16
Quality Of Education
plot(usaEmployment$quality_of_education,usaEmployment$usaEmploymentRank, xlab = "Quality of Education ", ylab = string, main = "Quality of Education vs Employment Rank")
c<-lm(usaEmploymentRank ~ quality_of_education, data = usaEmployment)
abline(c)

summary(c)
##
## Call:
## lm(formula = usaEmploymentRank ~ quality_of_education, data = usaEmployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.416 -35.706 -5.416 39.809 133.237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.38255 7.92049 4.088 6.03e-05 ***
## quality_of_education 0.31617 0.02724 11.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.6 on 227 degrees of freedom
## Multiple R-squared: 0.3725, Adjusted R-squared: 0.3697
## F-statistic: 134.8 on 1 and 227 DF, p-value: < 2.2e-16
Quality Of faculty
plot(usaEmployment$quality_of_faculty,usaEmployment$usaEmploymentRank, xlab = "Quality of Faculty", ylab = string, main = "Quality of Faculty vs Employment Rank")
c<-lm(usaEmploymentRank ~ quality_of_faculty, data = usaEmployment)
abline(c)

summary(c)
##
## Call:
## lm(formula = usaEmploymentRank ~ quality_of_faculty, data = usaEmployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.354 -44.116 -7.686 51.646 113.056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.08519 9.82576 4.690 4.71e-06 ***
## quality_of_faculty 0.40490 0.05296 7.646 5.82e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.21 on 227 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.2013
## F-statistic: 58.46 on 1 and 227 DF, p-value: 5.821e-13
multi-variable regression of the three variables together
linReg<- lm(usaEmploymentRank~national_rank+quality_of_education+quality_of_faculty, data = usaEmployment)
summary(linReg)
##
## Call:
## lm(formula = usaEmploymentRank ~ national_rank + quality_of_education +
## quality_of_faculty, data = usaEmployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.799 -34.415 -7.601 42.272 116.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.24391 8.18938 4.059 6.79e-05 ***
## national_rank 0.54415 0.07659 7.105 1.57e-11 ***
## quality_of_education 0.19351 0.03676 5.265 3.27e-07 ***
## quality_of_faculty -0.18441 0.06775 -2.722 0.007 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.66 on 225 degrees of freedom
## Multiple R-squared: 0.4892, Adjusted R-squared: 0.4824
## F-statistic: 71.83 on 3 and 225 DF, p-value: < 2.2e-16
Because Quality of Education and Quality of Faculty may have some overlap, it may be better to leave out one of the variables
linReg<- lm(usaEmploymentRank~national_rank+quality_of_education, data = usaEmployment)
summary(linReg)
##
## Call:
## lm(formula = usaEmploymentRank ~ national_rank + quality_of_education,
## data = usaEmployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.275 -36.367 -8.261 45.342 126.089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.19474 7.41297 3.129 0.00199 **
## national_rank 0.43971 0.06722 6.542 4.01e-10 ***
## quality_of_education 0.15781 0.03482 4.532 9.46e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.33 on 226 degrees of freedom
## Multiple R-squared: 0.4724, Adjusted R-squared: 0.4677
## F-statistic: 101.2 on 2 and 226 DF, p-value: < 2.2e-16
According to the model, national rank has the strongest correlation with employment rank
Interpretation
If we were to take this data face value, the prestige of the University matters more when in comes to employment ranking.
However, as I mentioned before, the factors within the provided dataset may be multicollinear. These may fudge the results.
Although, it is interesting to see that prestige seem to be more correlated to employment rather than factors such as quality of education.
Main Take Away
For Universities to increase the employment rate of their students, the most effective way is to increase their national ranking. To increase their national ranking, the best way is to invest in better faculty. In other words, higher faculty quality is highly correlated with student employment.