Introduction

This article aims to answer my research question “What variables influence the world rank of universities in 2016?” I used ‘world_rank’ as my dependent variable, representing the university position in the global ranking from 2016, which had 800 positions of universities from all over the world. Therefore, we may treat this as a representative sample, so the results of my analysis may be interpreted in the real life perspective, and may provide a valuable research outcome. I used other variables as inpdepent variables, representing various university features to run a Principal Component Analysis (PCA) analysis to determine which independent variables have the most influence on my dependent variable. For me, this research question is particularly interesting, because recently I have been choosing an international university abroad to study after my master degree, therefore this research was a very useful and practical asset for me, as I hope it will be for all the readers of this article.

Data

The data set consists of a .csv file from Kaggle, particularly the “timesData.csv” file.The “World University Rankings” data set, available on Kaggle and used in your project, includes data on university rankings from 2011 to 2016. The data set contains 14 columns of information on universities from all over the world, including their names, locations, and performance scores across various measures of research, teaching, and internationalization. The “timesData.csv” file, which was used in my project, is a subset of the full data set and includes data from the Times Higher Education (THE) World University Rankings. This file contains 14 columns and 2603 rows of data, with each row representing a different university and each column representing a different variable, such as the university name, country, and performance scores in various domains. I chose only this file, because it contained all relevant and concise information to answer my research question. I used only 2016 data to ensure correctness of my analysis, because performing PCA method on time-series data may be challenging.

Principal Component Analysis (PCA) methodology

According to Cord and Cunningham (2014), Principal Component Analysis (PCA) is a widely used technique for dimension reduction in multivariate data analysis. In their book, they state the following features of the PCA method:

Strengths:

PCA can effectively reduce the dimensionality of data while retaining the maximum amount of information and variability.
It can simplify complex data sets and improve interpretability.
PCA is a linear method, which means it can be applied to a wide range of data types, including continuous, categorical, and ordinal data.

Weaknesses:

PCA assumes that the relationship between the variables is linear, which may not be true in all cases.
PCA may not work well with highly skewed or non-normal data.
The interpretability of the principal components can be difficult, especially if the original variables have complex relationships with each other.

Analysis

Loading packages

To do the analysis, the following packages will be needed.

packages <- c("tidyverse", "readr", "corrplot", "ggplot2", "GGally", 
              "patchwork", "tibble", "patchwork", "gridExtra", "RColorBrewer", 
              "factoextra", "labdsv", "maptools", "psych", "ClusterR", 
              "readxl", "cluster", "flexclust", "fpc", "clustertend", 
              "ggthemes", "plotly", "stringr", "missMDA", "ade4", "smacof", 
              "Rtsne", "psy", "scales", "kableExtra", "pdp", "jpeg", "dplyr")

# Load packages if not already loaded
for (package in packages) {
         if (!require(package, character.only = TRUE)) {
                  install.packages(package)
                  library(package, character.only = TRUE)
         }
}

Data input

The first step is to input a “timesData.csv” file and filter the data to only include observations from the year 2016. Then I set the row names of the data frame to the names of the universities to for better performance of further analysis, and because of that I dropped the “university_name”. I also deleted the “country”, because it wouldn’t give meaningful input to the PCA analysis which works on numeric data. Finally, I printed the column names of the cleaned data frame to check that the cleaning process was successful.

# read in the CSV file
df <- read.csv("/Users/annaczarnocka/Desktop/UL2/timesData.csv", header = TRUE, stringsAsFactors = FALSE)

# filter data for the year 2016
df <- df[df$year == 2016,]
df <- subset(df, select = -year)

#set row_names to university names
rownames(df) <- df[,2]
# drop the university_name column
df <- df[, -2] 
# drop the country column
df <- df[, -which(names(df) == "country")]

# print data summary
print(colnames(df))

##  [1] "world_rank"             "teaching"               "international"         
##  [4] "research"               "citations"              "income"                
##  [7] "total_score"            "num_students"           "student_staff_ratio"   
## [10] "international_students" "female_male_ratio"

str(df)

## 'data.frame':    800 obs. of  11 variables:
##  $ world_rank            : chr  "1" "2" "3" "4" ...
##  $ teaching              : num  95.6 86.5 92.5 88.2 89.4 83.6 85.1 83.3 77 85.7 ...
##  $ international         : chr  "64.0" "94.4" "76.3" "91.5" ...
##  $ research              : num  97.6 98.9 96.2 96.7 88.6 99 91.9 88.5 95 88.9 ...
##  $ citations             : num  99.8 98.8 99.9 97 99.7 99.8 99.3 96.7 91.1 99.2 ...
##  $ income                : chr  "97.8" "73.1" "63.3" "55.0" ...
##  $ total_score           : chr  "95.2" "94.2" "93.9" "92.8" ...
##  $ num_students          : chr  "2,243" "19,919" "15,596" "18,812" ...
##  $ student_staff_ratio   : num  6.9 11.6 7.8 11.8 9 8.9 8.4 11.7 14.7 6.9 ...
##  $ international_students: chr  "27%" "34%" "22%" "34%" ...
##  $ female_male_ratio     : chr  "33 : 67" "46 : 54" "42 : 58" "46 : 54" ...

Data cleaning

Deleting rows with missing values

# Check for missing values
sum(is.na(df))

## [1] 7

# Remove rows with missing values
df <- df[complete.cases(df), ]

Converting world_rank to numeric

The ‘world_rank’ variable had, apart from integers, also values of half ranks like 211.5, and values of ranges, like (205-211), so to perform PCA this format was needed to be properly converted to numeric, without the information loss of rank values.

# convert ex aequo ranks to numeric values
df$world_rank <- gsub("^=", "", df$world_rank)

# convert range values to the average of lower and upper bounds
range_vals <- grepl("-", df$world_rank)
lower_bounds <- as.numeric(gsub("-.*", "", df$world_rank[range_vals]))
upper_bounds <- as.numeric(gsub(".*-", "", df$world_rank[range_vals]))
df$world_rank[range_vals] <- ifelse(!is.na(lower_bounds) & !is.na(upper_bounds), (lower_bounds + upper_bounds) / 2, NA)

# convert remaining values to numeric
df$world_rank <- as.numeric(df$world_rank)

# checking outcome
sum(is.na(df$world_rank))

## [1] 0

summary(df$world_rank)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   225.5   450.5   400.6   700.5   700.5

Converting the rest of variables to numeric

All character variables needed to be correctly transformed to numeric, as below. With this method, some missing values were introduced, particularly introducing 595 of missing in ‘total_score’ column, so I decided to remove it, because it was not an essential variable for my research question, compared with the rest of the variables. I also had to calculate the female_male_ration from the format of for instance 33:44, to the corresponding numeric value of the ratio. At the end there is a summary of the changes made, with all variables changed to numeric format.

# convert 'international' column to numeric
df$international <- as.numeric(df$international)

# convert 'income' column to numeric
df$income <- as.numeric(df$income)

## Warning: NAs introduced by coercion

# convert 'total_score' column to numeric
df$total_score <- as.numeric(df$total_score)

## Warning: NAs introduced by coercion

colSums(is.na(df))

##             world_rank               teaching          international 
##                      0                      0                      0 
##               research              citations                 income 
##                      0                      0                     37 
##            total_score           num_students    student_staff_ratio 
##                    595                      0                      0 
## international_students      female_male_ratio 
##                      0                      0

# total_score has 563 missing variables
df <- subset(df, select = -total_score)
df <- na.omit(df)
sum(is.na(df))

## [1] 0

# convert 'num_students' column to numeric
df$num_students <- as.numeric(gsub(",", "", df$num_students))

# convert 'international_students' column to numeric
df$international_students <- as.numeric(gsub("%", "", df$international_students))/100

# split the "female_male_ratio" column into two separate columns
df$female_ratio <- as.numeric(gsub("^([0-9]+).*", "\\1", df$female_male_ratio))

## Warning: NAs introduced by coercion

df$male_ratio <- as.numeric(gsub(".*: ([0-9]+)$", "\\1", df$female_male_ratio))

## Warning: NAs introduced by coercion

# calculate the female-male ratio
df$female_male_ratio <- df$female_ratio / df$male_ratio

# remove the female_ratio and male_ratio columns
df$female_ratio <- NULL
df$male_ratio <- NULL
df$female_male_ratio <- replace(df$female_male_ratio, is.infinite(df$female_male_ratio), NA)
sum(is.na(df$female_male_ratio))

## [1] 54

df <- na.omit(df)

# see the changes
str(df)

## 'data.frame':    701 obs. of  10 variables:
##  $ world_rank            : num  1 2 3 4 5 7 8 9 10 11 ...
##  $ teaching              : num  95.6 86.5 92.5 88.2 89.4 85.1 83.3 77 85.7 77.6 ...
##  $ international         : num  64 94.4 76.3 91.5 84 78.5 96 97.9 65 70 ...
##  $ research              : num  97.6 98.9 96.2 96.7 88.6 91.9 88.5 95 88.9 90.4 ...
##  $ citations             : num  99.8 98.8 99.9 97 99.7 99.3 96.7 91.1 99.2 98.2 ...
##  $ income                : num  97.8 73.1 63.3 55 95.4 52.1 53.7 80 36.6 100 ...
##  $ num_students          : num  2243 19919 15596 18812 11074 ...
##  $ student_staff_ratio   : num  6.9 11.6 7.8 11.8 9 8.4 11.7 14.7 6.9 3.6 ...
##  $ international_students: num  0.27 0.34 0.22 0.34 0.33 0.27 0.51 0.37 0.21 0.23 ...
##  $ female_male_ratio     : num  0.493 0.852 0.724 0.852 0.587 ...
##  - attr(*, "na.action")= 'omit' Named int [1:55] 6 18 40 41 47 49 57 59 83 86 ...
##   ..- attr(*, "names")= chr [1:55] "Harvard University" "University of Toronto" "Peking University" "University of Tokyo" ...

sum(is.na(df))

## [1] 0

Variables

The description of my variables:

world_rank: Ranking of universities in the world, where a lower number indicates a higher rank.
teaching: Score for the quality of teaching, based on a survey of academics.
international: Score for international outlook, based on the number of international staff, students, and research collaborations.
research: Score for research output, based on the number of research papers and citations per faculty member.
citations: Score for research influence, based on the number of citations per paper.
income: Score for industry income, based on the amount of research funding from industry.
num_students: Number of students at the university.
student_staff_ratio: Ratio of students to academic staff.
international_students: Percentage of international students at the university.
female_male_ratio: Ratio of female to male students at the university.

Based on the summary, the range and distribution of each variable is different, with some variables having a wide range and others having a narrow range. The distribution of some variables appears to be skewed, such as income and international_students, while others appear to be more normally distributed. All variables are continuous numerical variables, which are generally well-suited for PCA analysis of dimension reduction because they have a large range of potential values that can be used to capture variation in the data. PCA works by finding the linear combinations of variables that explain the most variation in the data, so having a wide range of potential values for each variable can help to capture more of the underlying variation.

Below we can also see the first 6 positions of the cleaned data set, so the best 6 universities and their features. I also added examples of searching for a particular university, which can be useful if one is interested getting information about a chosen university.

print(colnames(df))

##  [1] "world_rank"             "teaching"               "international"         
##  [4] "research"               "citations"              "income"                
##  [7] "num_students"           "student_staff_ratio"    "international_students"
## [10] "female_male_ratio"

summary(df)

##    world_rank       teaching     international      research    
##  Min.   :  1.0   Min.   : 9.90   Min.   : 7.10   Min.   : 5.40  
##  1st Qu.:199.0   1st Qu.:20.90   1st Qu.:30.10   1st Qu.:14.60  
##  Median :375.5   Median :27.00   Median :46.40   Median :22.50  
##  Mean   :395.3   Mean   :31.41   Mean   :49.26   Mean   :28.22  
##  3rd Qu.:550.5   3rd Qu.:37.30   3rd Qu.:66.60   3rd Qu.:35.30  
##  Max.   :700.5   Max.   :95.60   Max.   :99.90   Max.   :98.90  
##    citations          income        num_students    student_staff_ratio
##  Min.   :  1.20   Min.   : 28.00   Min.   :   462   Min.   :  0.6      
##  1st Qu.: 28.30   1st Qu.: 31.60   1st Qu.: 12063   1st Qu.: 12.6      
##  Median : 52.20   Median : 38.10   Median : 19646   Median : 16.9      
##  Mean   : 52.07   Mean   : 46.27   Mean   : 23976   Mean   : 19.4      
##  3rd Qu.: 75.40   3rd Qu.: 53.70   3rd Qu.: 29396   3rd Qu.: 22.8      
##  Max.   :100.00   Max.   :100.00   Max.   :379231   Max.   :162.6      
##  international_students female_male_ratio
##  Min.   :0.0000         Min.   :0.0101   
##  1st Qu.:0.0500         1st Qu.:0.7857   
##  Median :0.1000         Median :1.0833   
##  Mean   :0.1307         Mean   :1.0781   
##  3rd Qu.:0.1800         3rd Qu.:1.3256   
##  Max.   :0.8200         Max.   :3.5455

head(df)

##                                       world_rank teaching international
## California Institute of Technology             1     95.6          64.0
## University of Oxford                           2     86.5          94.4
## Stanford University                            3     92.5          76.3
## University of Cambridge                        4     88.2          91.5
## Massachusetts Institute of Technology          5     89.4          84.0
## Princeton University                           7     85.1          78.5
##                                       research citations income num_students
## California Institute of Technology        97.6      99.8   97.8         2243
## University of Oxford                      98.9      98.8   73.1        19919
## Stanford University                       96.2      99.9   63.3        15596
## University of Cambridge                   96.7      97.0   55.0        18812
## Massachusetts Institute of Technology     88.6      99.7   95.4        11074
## Princeton University                      91.9      99.3   52.1         7929
##                                       student_staff_ratio
## California Institute of Technology                    6.9
## University of Oxford                                 11.6
## Stanford University                                   7.8
## University of Cambridge                              11.8
## Massachusetts Institute of Technology                 9.0
## Princeton University                                  8.4
##                                       international_students female_male_ratio
## California Institute of Technology                      0.27         0.4925373
## University of Oxford                                    0.34         0.8518519
## Stanford University                                     0.22         0.7241379
## University of Cambridge                                 0.34         0.8518519
## Massachusetts Institute of Technology                   0.33         0.5873016
## Princeton University                                    0.27         0.8181818

# example searching
df["Stanford University", ]

##                     world_rank teaching international research citations income
## Stanford University          3     92.5          76.3     96.2      99.9   63.3
##                     num_students student_staff_ratio international_students
## Stanford University        15596                 7.8                   0.22
##                     female_male_ratio
## Stanford University         0.7241379

df["University of Warsaw", ]

##                      world_rank teaching international research citations
## University of Warsaw      550.5     26.1          41.6     15.8      42.2
##                      income num_students student_staff_ratio
## University of Warsaw   28.5        49292                14.1
##                      international_students female_male_ratio
## University of Warsaw                   0.07          2.030303

Correlations

In general, a correlation coefficient of 1 indicates a perfect positive correlation, while a coefficient of -1 indicates a perfect negative correlation. Correlation coefficients between -0.7 and -1.0 or between 0.7 and 1.0 are generally considered strong correlations.Those between -0.3 and -0.7 or between 0.3 and 0.7 are considered moderate correlations. Correlation coefficients between -0.3 and 0.3 are considered weak correlations.

Looking at the correlation matrix, it appears that there are several strong correlations:

Teaching and research have a correlation of 0.91
Teaching and citations have a correlation of 0.57
Research and citations have a correlation of 0.62
International and international students have a correlation of 0.84

These correlations indicate a strong relationship between these variables, meaning that as one variable increases, the other variable is likely to increase as well. In terms of the strong correlations, it is common to have some level of correlation among variables in a data set. It is not necessary to remove variables that are strongly correlated before performing PCA, as this can result in oversimplification or bias in the analysis. In fact, PCA can help to identify and capture the underlying relationships among variables, including those that are correlated.However, it is important to keep in mind that highly correlated variables may have a greater influence on the principal components and can potentially skew the results. Therefore, it is recommended to perform some form of variable selection or dimensionality reduction, such as PCA, to identify the most important and meaningful variables to include in the analysis. Fortunately, most of the correlations between my variables are not strong.

#covariance matrix
cor.matrix <- cor(df[, 2:10], method = "pearson")
mypal <- RColorBrewer::brewer.pal(9, "Blues")
corrplot::corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col = mypal)

# Compute the correlation matrix
cor.matrix <- cor(df[, 2:10], method = "pearson")
# Round the correlation matrix to one decimal place
cor.matrix <- round(cor.matrix, 1)
# Print the correlation matrix
print(cor.matrix)

##                        teaching international research citations income
## teaching                    1.0           0.3      0.9       0.6    0.4
## international               0.3           1.0      0.4       0.5    0.1
## research                    0.9           0.4      1.0       0.6    0.5
## citations                   0.6           0.5      0.6       1.0    0.2
## income                      0.4           0.1      0.5       0.2    1.0
## num_students                0.0          -0.1      0.0      -0.1    0.0
## student_staff_ratio        -0.2           0.0      0.0      -0.1    0.1
## international_students      0.3           0.8      0.4       0.4    0.1
## female_male_ratio          -0.1           0.2     -0.1       0.1   -0.3
##                        num_students student_staff_ratio international_students
## teaching                        0.0                -0.2                    0.3
## international                  -0.1                 0.0                    0.8
## research                        0.0                 0.0                    0.4
## citations                      -0.1                -0.1                    0.4
## income                          0.0                 0.1                    0.1
## num_students                    1.0                 0.5                   -0.2
## student_staff_ratio             0.5                 1.0                   -0.1
## international_students         -0.2                -0.1                    1.0
## female_male_ratio               0.2                 0.2                    0.1
##                        female_male_ratio
## teaching                            -0.1
## international                        0.2
## research                            -0.1
## citations                            0.1
## income                              -0.3
## num_students                         0.2
## student_staff_ratio                  0.2
## international_students               0.1
## female_male_ratio                    1.0

Outliers

First, I checked for outliers in my dataset by creating boxplots for each variable.

Next, I calculated z-scores for each variable in the dataset to help me identify potential outliers. I added the z-scores to the dataframe and then looked for any z-scores that were greater than 3 or less than -3. I found that there were a total of 49 potential outliers in the dataset, which in my 700 observations is not a considerably high number.

To further investigate, I used a function to calculate the number of outliers in each column of the data frame. This allowed me to see which variables had the most outliers.

Finally, I used another function to find the outliers in the dataset. This function used the interquartile range (IQR) to identify potential outliers. I created a variable containing the outliers and checked its dimensions and head to confirm that the function had worked properly.

Overall, the process of identifying and investigating outliers in my dataset allowed me to better understand the distribution of the data and to ensure that any extreme values were appropriately handled in my analysis. However, the outliers may be simply due to the high differences in results and characterstics of very high and low ranked universities, so I decided not to delete the outliers.

#checking for outliers

# create boxplots for each variable to identify outliers
par(mfrow=c(3,3))
for(i in 2:10) {
         boxplot(df[,i], main=names(df)[i], col=mypal[5])
}

# Calculate z-scores for each variable
df_z <- apply(df[, 2:10], 2, function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
# Add z-scores to the dataframe
df[, paste0(names(df)[2:10], "_z")] <- df_z
# Identify potential outliers (z-score > 3 or < -3)
outliers <- apply(df[, 11:19], 1, function(x) any(abs(x) > 3))
# Print number of outliers
cat("Number of potential outliers:", sum(outliers))

## Number of potential outliers: 49

# Calculate potential outliers
outliers <- boxplot(df)$out
# Calculate outliers for each column
num_outliers_all <- function(df) {
         num_vars <- sapply(df, is.numeric)
         outlier_counts <- sapply(df[num_vars], function(x) {
                  length(boxplot(x, plot = FALSE)$out)
         })
         names(outlier_counts) <- names(df)[num_vars]
         outlier_counts
}
num_outliers_all(df)

##               world_rank                 teaching            international 
##                        0                       35                        0 
##                 research                citations                   income 
##                       47                        0                       54 
##             num_students      student_staff_ratio   international_students 
##                       28                       48                       24 
##        female_male_ratio               teaching_z          international_z 
##                       17                       35                        0 
##               research_z              citations_z                 income_z 
##                       47                        0                       54 
##           num_students_z    student_staff_ratio_z international_students_z 
##                       28                       49                       24 
##      female_male_ratio_z 
##                       17

# Run the function to find outliers
# Find outliers in each variable of the data frame
find_outliers <- function(data) {
         outliers <- lapply(data, function(x) {
                  qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
                  H <- 1.5 * IQR(x, na.rm = TRUE)
                  x[x < (qnt[1] - H) | x > (qnt[2] + H)]
         })
         return(outliers)
}

# Generate a variable containing the outliers in the data frame
outliers <- find_outliers(df)

# Check the dimensions and head of the outliers variable
dim(outliers)

## NULL

head(outliers)

## $world_rank
## numeric(0)
## 
## $teaching
##  [1] 95.6 86.5 92.5 88.2 89.4 85.1 83.3 77.0 85.7 77.6 86.5 80.4 78.1 80.8 82.0
## [16] 77.9 76.0 76.8 67.4 68.6 69.8 71.7 64.5 70.5 74.7 67.1 62.0 64.5 68.8 66.1
## [31] 64.6 73.3 65.1 70.6 75.4
## 
## $international
## numeric(0)
## 
## $research
##  [1] 97.6 98.9 96.2 96.7 88.6 91.9 88.5 95.0 88.9 90.4 87.8 91.1 91.0 88.6 86.9
## [16] 86.1 78.0 85.2 88.8 77.2 78.4 84.5 75.8 81.1 77.4 72.3 67.5 70.0 75.5 73.2
## [31] 76.9 81.2 69.6 72.1 69.8 72.7 72.7 72.8 69.7 83.0 66.9 68.2 77.3 68.1 73.8
## [46] 72.2 66.7
## 
## $citations
## numeric(0)
## 
## $income
##  [1]  97.8  95.4 100.0 100.0 100.0 100.0  90.4 100.0 100.0  99.2  99.9  89.1
## [13] 100.0  98.5  92.4 100.0  95.4  88.0  99.9  98.1  92.6 100.0  88.3  92.7
## [25]  99.4  99.5 100.0  91.2  97.5  99.7  95.7  92.4  99.8  99.1  99.7  98.7
## [37]  98.9  96.2  91.4  96.9  92.2  92.9 100.0  88.9 100.0  87.8  99.8  99.5
## [49] 100.0 100.0  93.3 100.0  89.7  90.8

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- df %>% select(-contains("_z"))
str(df)

## 'data.frame':    701 obs. of  10 variables:
##  $ world_rank            : num  1 2 3 4 5 7 8 9 10 11 ...
##  $ teaching              : num  95.6 86.5 92.5 88.2 89.4 85.1 83.3 77 85.7 77.6 ...
##  $ international         : num  64 94.4 76.3 91.5 84 78.5 96 97.9 65 70 ...
##  $ research              : num  97.6 98.9 96.2 96.7 88.6 91.9 88.5 95 88.9 90.4 ...
##  $ citations             : num  99.8 98.8 99.9 97 99.7 99.3 96.7 91.1 99.2 98.2 ...
##  $ income                : num  97.8 73.1 63.3 55 95.4 52.1 53.7 80 36.6 100 ...
##  $ num_students          : num  2243 19919 15596 18812 11074 ...
##  $ student_staff_ratio   : num  6.9 11.6 7.8 11.8 9 8.4 11.7 14.7 6.9 3.6 ...
##  $ international_students: num  0.27 0.34 0.22 0.34 0.33 0.27 0.51 0.37 0.21 0.23 ...
##  $ female_male_ratio     : num  0.493 0.852 0.724 0.852 0.587 ...
##  - attr(*, "na.action")= 'omit' Named int [1:55] 6 18 40 41 47 49 57 59 83 86 ...
##   ..- attr(*, "names")= chr [1:55] "Harvard University" "University of Toronto" "Peking University" "University of Tokyo" ...

Explanatory data analysis

Tibble conversion

Tibble is a modern and more flexible version of data frame, which is particularly useful for data analysis and manipulation. It provides several advantages over data frame, such as a more informative print method, the ability to handle non-standard evaluation, and better compatibility with the tidyverse packages, which are designed for data science workflows.

Converting the data frame to a tibble before PCA analysis is not strictly necessary, but it can help streamline the analysis and make it more efficient. Additionally, some functions in the tidyverse package may work better with tibbles than with data frames, so using tibbles can make it easier to integrate PCA analysis with other data manipulation tasks.

# convert the data frame to a tibble
df <- tibble::as_tibble(df)

Scatterplots

Each scatterplot examines the relationship between two variables and the color and size of the points represent different variables as well. The scatterplots help visualize the relationships between variables and can reveal patterns or trends in the data.

In this code block, the ggplot2 and tidyverse libraries are used to create a series of scatterplot matrices using the data frame df from the Times Higher Education World University Rankings dataset. Each scatterplot matrix visualizes the relationship between two variables. Scatterplots 2-10 show relations between the dependent variable ‘world_rank’ and independent variables.

The first scatterplot shows that there is a positive relationship between teaching and research, and universities with higher teaching scores also tend to have higher research scores. There seems to be a positive relationship between the percentage of international students and research score, although there is some variance. The scatterplot also shows that universities with higher income tend to have higher research and teaching scores.
The scatterplot of world rank vs. teaching shows a negative relationship, meaning that universities with higher world rankings tend to have lower teaching scores. However, there are some outliers that have high teaching scores but lower world rankings.
The scatterplot of world rank vs. international shows a positive relationship, meaning that universities with higher world rankings tend to have a higher percentage of international students.
The scatterplot of world rank vs. research shows a negative relationship, meaning that universities with higher world rankings tend to have lower research scores.
The scatterplot of world rank vs. citations also shows a negative relationship, meaning that universities with higher world rankings tend to have lower citation scores.
The scatterplot of world rank vs. income shows a positive relationship, meaning that universities with higher world rankings tend to have higher incomes.
The scatterplot of world rank vs. number of students shows a negative relationship, meaning that universities with higher world rankings tend to have fewer students.
The scatterplot of world rank vs. student-staff ratio shows a positive relationship, meaning that universities with higher world rankings tend to have a lower student-staff ratio.
The scatterplot of world rank vs. international students shows that lower-ranked universities tend to have a higher percentage of international students, suggesting that higher-ranked universities may be less accessible to international students.
The scatterplot of world rank vs. female-male ratio shows a negative relationship, meaning that universities with higher world rankings tend to have a lower female-male ratio.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ tidyr   1.3.0     ✔ forcats 1.0.0
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)

# create scatterplot matrix with customized plots
ggplot(df, aes(x = teaching, y = research, color = international_students, size = income)) +
         geom_point() + labs(x = "Teaching", y = "Research", color = "International Students", size = "Income")

#scatterplots

ggplot(df, aes(x = world_rank, y = teaching)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs. Teaching", x = "World Rank", y = "Teaching", color = "International Students") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = international)) + 
         geom_point(aes(color = female_male_ratio), size = 3) +
         labs(title = "Scatter plot: World Rank vs. International", x = "World Rank", y = "International", color = "Female-Male Ratio") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = research)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs. Research", x = "World Rank", y = "Research", color = "International Students") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = citations)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs. Citations", x = "World Rank", y = "Citations", color = "International Students") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = income)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs. Income", x = "World Rank", y = "Income", color = "International Students") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = num_students)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs. Number of Students", x = "World Rank", y = "Number of Students", color = "International Students") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = student_staff_ratio)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs. Student-Staff Ratio", x = "World Rank", y = "Student-Staff Ratio", color = "International Students") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = international_students)) + 
         geom_point(aes(color = female_male_ratio), size = 3) +
         labs(title = "Scatter plot: World Rank vs. International Students", x = "World Rank", y = "International Students", color = "Female-Male Ratio") +
         scale_color_gradientn(colors = mypal)

ggplot(df, aes(x = world_rank, y = female_male_ratio)) + 
         geom_point(aes(color = international_students), size = 3) +
         labs(title = "Scatter plot: World Rank vs.Female-Male Ratio", x = "World Rank", y = "Female-Male Ratio", color = "Female-Male Ratio") +
         scale_color_gradientn(colors = mypal)

Inspecting highly correlated variables

Summary of relationships shown in each of the scatterplots below:

Teaching vs. Research: This plot shows the relationship between the scores universities received for teaching and research in the THE World University Rankings. The plot suggests that universities with higher scores for teaching tend to also have higher scores for research.
Teaching vs. Citations: This plot shows the relationship between the scores universities received for teaching and the number of citations their research publications have received. The plot suggests that universities with higher scores for teaching tend to also have higher numbers of citations.
Research vs. Citations: This plot shows the relationship between the scores universities received for research and the number of citations their research publications have received. The plot suggests that universities with higher scores for research tend to also have higher numbers of citations.
International vs. International Students: This plot shows the relationship between the proportion of international students enrolled at a university and the proportion of international faculty at the same university.

However, there is still some variability in the data, suggesting that there may be universities with high scores for one variable but lower scores for the other.

library(tidyverse)
library(ggplot2)

# Scatter plot: Teaching vs. Research
ggplot(df, aes(x = teaching, y = research)) +
         geom_point(aes(color = teaching), size = 3) +
         labs(title = "Teaching vs. Research",
              x = "Teaching",
              y = "Research",
              color = "Teaching") +
         scale_color_gradient(low = mypal[1], high = mypal[9]) +
         theme(plot.title = element_text(face = "bold", size = 20),
               axis.title = element_text(face = "bold", size = 16),
               legend.title = element_text(face = "bold", size = 14),
               legend.position = "bottom",
               legend.key.size = unit(1.5, "lines"))

# Scatter plot: Teaching vs. Citations
ggplot(df, aes(x = teaching, y = citations)) +
         geom_point(aes(color = teaching), size = 3) +
         labs(title = "Teaching vs. Citations",
              x = "Teaching",
              y = "Citations",
              color = "Teaching") +
         scale_color_gradient(low = mypal[1], high = mypal[9]) +
         theme(plot.title = element_text(face = "bold", size = 20),
               axis.title = element_text(face = "bold", size = 16),
               legend.title = element_text(face = "bold", size = 14),
               legend.position = "bottom",
               legend.key.size = unit(1.5, "lines"))

# Scatter plot: Research vs. Citations
ggplot(df, aes(x = research, y = citations)) +
         geom_point(aes(color = research), size = 3) +
         labs(title = "Research vs. Citations",
              x = "Research",
              y = "Citations",
              color = "Research") +
         scale_color_gradient(low = mypal[1], high = mypal[9]) +
         theme(plot.title = element_text(face = "bold", size = 20),
               axis.title = element_text(face = "bold", size = 16),
               legend.title = element_text(face = "bold", size = 14),
               legend.position = "bottom",
               legend.key.size = unit(1.5, "lines"))

# Scatter plot: International vs. International Students
ggplot(df, aes(x = international, y = international_students)) +
         geom_point(aes(color = international), size = 3) +
         labs(title = "International vs. International Students",
              x = "International",
              y = "International Students",
              color = "International") +
         scale_color_gradient(low = mypal[1], high = mypal[9]) +
         theme(plot.title = element_text(face = "bold", size = 20),
               axis.title = element_text(face = "bold", size = 16),
               legend.title = element_text(face = "bold", size = 14),
               legend.position = "bottom",
               legend.key.size = unit(1.5, "lines"))

Variablity

The variance values provide insight into the spread of the variables. A high variance value suggests that the data points are more spread out, while a low variance value suggests that the data points are closer together. In this case, variables with a high variance are teaching, international, research, and citations, while variables with a lower variance are income, num_students, student_staff_ratio, international_students, and female_male_ratio.Variables with high variance are generally better suited for PCA because they contribute more to the overall variability of the data. Variables with low variance, on the other hand, may not contribute significantly to the overall variability of the data and may be less useful for PCA.

# Set numeric output format
options(scipen = 999)
# Calculate variances and print them without scientific notation
variances <- apply(df[,2:10], 2, var)
print(variances)

##               teaching          international               research 
##           216.98466297           561.06879283           367.24669676 
##              citations                 income           num_students 
##           723.93890532           404.23451873     546356050.71519876 
##    student_staff_ratio international_students      female_male_ratio 
##           160.74175969             0.01182961             0.22746894

Principal Component Analysis (PCA) analysis

The prcomp() function performs principal component analysis (PCA) on the data contained in my data frame. The part df[, 2:10] specifies the subset of the data frame that contains the numeric variables that will be used in the PCA analysis. The scale. = TRUE specifies that the variables should be centered and scaled (normalized) before performing the PCA analysis, which is generally recommended to ensure that variables with different scales do not disproportionately influence the analysis.

#PCA
library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# Perform PCA on the data
pca <- prcomp(df[, 2:10], scale. = TRUE)

PCA scatteplot and biplot

Interpretation of the plots that includes information about the percentage of variance explained by each principal component.

Scree plot

The scree plot shows the proportion of variance explained by each principal component (PC). The plot shows that the first three PCs explain a large proportion of the variance in the data, while the remaining PCs explain relatively little. Specifically, the first PC explains about 40% of the variance, the second PC explains about 20% of the variance, and the third PC explains about 17% of the variance. Together, the first three PCs explain about 77% of the variance in the data. This suggests that we could consider retaining the first three PCs in the analysis and dropping the remaining ones.

Biplot

The biplot displays the scores and loadings for the first two principal components. The plot shows the position of each observation in the space defined by the first two PCs, as well as the direction and strength of the relationship between each variable and the PCs. The plot indicates that there is a positive association between research, citations, and international variables, and a negative association with the student-to-staff ratio variable. This suggests that universities that score high on research, citations, and international variables also tend to have a lower student-to-staff ratio. The plot also shows that the variables of teaching, income, and female-to-male ratio do not strongly associate with either of the first two PCs.

Overall, these plots suggest that the first three principal components could be used to summarize the information in the data while retaining most of the variation, and that the variables of research, citations, and international could be used to differentiate between universities.

# Create the scree plot
prop_var <- pca$sdev^2 / sum(pca$sdev^2)
scree_plot <- ggplot() +
         geom_line(aes(x = 1:length(prop_var), y = prop_var), color = mypal[6], size = 1) +
         geom_point(aes(x = 1:length(prop_var), y = prop_var), color = mypal[6], size = 3) +
         scale_x_continuous(breaks = 1:length(prop_var), labels = paste0("PC", 1:length(prop_var))) +
         labs(title = "Scree plot", x = "Principal component", y = "Proportion of variance explained") +
         theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

# Create the biplot
biplot <- ggplot() +
         geom_point(data = as.data.frame(pca$x), aes(x = PC1, y = PC2), size = 3, alpha = 0.5) +
         geom_hline(yintercept = 0, color = "grey50") +
         geom_vline(xintercept = 0, color = "grey50") +
         geom_text(data = as.data.frame(pca$rotation), aes(x = PC1, y = PC2, label = rownames(pca$rotation)), size = 3) +
         labs(title = "Biplot") +
         theme_bw()

# Combine the two plots into a single plot
combined_plot <- grid.arrange(scree_plot, biplot, ncol = 2)

# Print the combined plot
print(combined_plot)

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Eigenvalues

Summary of PCA

The first row of the summary(pca) table shows the standard deviation of each principal component (PC). PC1 has the largest standard deviation (1.7979) followed by PC2 (1.2782) and so on.
The second row shows the proportion of variance explained by each PC. PC1 explains the largest proportion of variance (0.3592) followed by PC2 (0.1815) etc.
The third row shows the cumulative proportion of variance explained by each PC. This row tells us that the first PC explains 35.92% of the variance, the first two PCs together explain 54.07% of the variance, the first three PCs together explain 71.35% of the variance, and so on, until all nine PCs explain 100% of the variance.

Eigen values

The eig.val output provides similar information in a different format. Each row corresponds to a PC, and the columns provide the eigenvalue, the percentage of variance explained by that PC, and the cumulative percentage of variance explained up to that PC.
Overall, these results suggest that the first few PCs, for instance the first three, explain the majority of the variance in the data, and that it may be possible to reduce the dimensionality of the data by focusing on those components. This is also confirmed by the scree plot below.

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7979 1.2782 1.2469 0.9583 0.77097 0.71942 0.58577
## Proportion of Variance 0.3592 0.1815 0.1728 0.1021 0.06604 0.05751 0.03813
## Cumulative Proportion  0.3592 0.5407 0.7135 0.8155 0.88156 0.93906 0.97719
##                            PC8     PC9
## Standard deviation     0.36902 0.26291
## Proportion of Variance 0.01513 0.00768
## Cumulative Proportion  0.99232 1.00000

eig.val <- get_eigenvalue(pca)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.23248269       35.9164744                    35.91647
## Dim.2 1.63388730       18.1543033                    54.07078
## Dim.3 1.55482009       17.2757788                    71.34656
## Dim.4 0.91843206       10.2048006                    81.55136
## Dim.5 0.59438923        6.6043248                    88.15568
## Dim.6 0.51756014        5.7506683                    93.90635
## Dim.7 0.34312959        3.8125510                    97.71890
## Dim.8 0.13617878        1.5130975                    99.23200
## Dim.9 0.06912012        0.7680013                   100.00000

Scree plot PCA

The plot shows the scree plot of the eigenvalues obtained from the PCA. The eigenvalues represent the amount of variance explained by each principal component. The plot has the number of principal components on the x-axis and the corresponding eigenvalues on the y-axis.

The plot suggests that the first three principal components should be retained because their eigenvalues are higher than 1, indicating that they explain a substantial amount of variance in the data. The plot also shows a steep drop in eigenvalues after the third component, which suggests that the remaining components may not be as important in explaining the variability in the data.

library(psych)
library(factoextra)
library(labdsv)

## Loading required package: mgcv

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.

## This is labdsv 2.0-1
## convert existing ordinations with as.dsvord()

## 
## Attaching package: 'labdsv'

## The following object is masked from 'package:psych':
## 
##     pca

## The following object is masked from 'package:stats':
## 
##     density

library(psy)

## 
## Attaching package: 'psy'

## The following object is masked from 'package:psych':
## 
##     wkappa

scree.plot(eig.val[,1], title = "Scree Plot", type = "E",  simu = "P")

Barplot screeplot of PCA

The plot generated by the fviz_eig function shows the scree plot for the PCA analysis. The x-axis represents the principal components (PCs), while the y-axis shows the percentage of variance explained by each PC. The height of the bar for each PC represents the amount of variance explained by that PC, while the blue line represents the cumulative variance explained by all the PCs up to that point.

From the plot, we can see that the first three PCs explain a significant amount of the total variance in the data. Specifically, the first PC explains 35.92% of the total variance, the second PC explains 18.15% of the total variance, and the third PC explains 17.28% of the total variance.In order to account for 71.35% of the variance, a total of 3 components are required.

fviz_eig(pca, 
         geom = "bar", 
         barfill = "lightblue", 
         barcolor = "#990066", 
         linecolor = "blue",
         ylim = c(0, 50),
         ggtheme = theme_classic() + theme(axis.text = element_text(size = 12), 
                                           axis.title = element_text(size = 14, face = "bold"), 
                                           plot.title = element_text(size = 16, face = "bold"), 
                                           legend.position = "none"),
         main = "Scree Plot of PCA",
         xlab = "Principal Component",
         ylab = "Percent Variance Explained"
)

Correlation circle

The plot below shows the contribution of each variable to the principal components. The darker the color of the variable, the greater its contribution to the component. From the plot, we can see that the variables with the darkest colors, namely ‘research’, then ‘teaching’, then ‘citations’ and ‘international’, have the most influence on the principal components. Conversely, the variable ‘income’ has the least contribution to the principal components, as it appears to be the lightest color on the plot.

fviz_pca_var(pca, col.var = "contrib") +
         scale_color_gradient2(low = "lightblue", mid = "blue", high = "black", midpoint = 11)

Loading plots

The first plot shows the loadings of each variable on the first two principal components (PC1 and PC2). The color of each variable represents its contribution to the corresponding PC. The darker the color, the higher the contribution. The plot suggests that the variables ‘research’, ‘teaching’, ‘citations and ’international’ have the most significant contributions to PC1 and PC2, while ‘income’ has the least influence.

The second plot shows the loadings of each variable on the second and third principal components (PC2 and PC3). The color of each variable represents its contribution to the corresponding PC, with a gradient color scale from light blue to dark blue to black. The plot suggests that the variables ‘num_students’, ‘students_staff_ratio’ and ‘female_male_ratio’ have the most significant contributions to PC2 and PC3, while ‘teaching’ and ‘research’ have the least influence. The use of repulsion (repel = TRUE) helps to avoid variable labels overlap, making it easier to read the plot.

fviz_pca_var(pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
         labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
         scale_color_gradient2(low="lightblue", mid="blue", 
                               high="black", midpoint=11)

fviz_pca_var(pca, col.var = "contrib", repel = TRUE, axes = c(2, 3)) +
         labs(title = "Variables Loadings for PC2 and PC3", x="PC2", y="PC3") +
         scale_color_gradientn(colors = mypal, na.value = "grey50", 
                               limits = c(0, 50), breaks = c(seq(0, 20, by = 5), 25, 50))

Aggregate variance plot

The first plot below shows the cumulative percentage of variance explained by each principal component. The x-axis represents the number of principal components, while the y-axis shows the percentage of variance explained. The plot indicates that the first principal component explains the majority of the variance, followed by the second and third principal components. As we move further to the right on the x-axis, each additional principal component explains less and less of the total variance. The plot is useful for determining how many principal components to retain in the analysis, as it helps identify the point of diminishing returns, where additional components explain only a small amount of the total variance.

The second plot below is a scree plot, which also shows the proportion of variance explained by each principal component (PC) in descending order. The x-axis represents the PCs, and the y-axis represents the percentage of variance explained.

The plot shows that the first two PCs explain the majority of the variance, with the first PC explaining the highest proportion of variance. After the second PC, the remaining PCs explain relatively small proportions of variance.

plot(summary(pca)$importance[3,])

# Scree plot with colors
fviz_eig(pca, barfill = mypal[2], barcolor = mypal[9], linecolor = mypal[4])

Individuals PCA plot

The plot shows the distribution of individual scores on the principal component axes, where each point represents an individual observation. The colors of the points represent the quality of the representation of each observation in the principal component analysis, as measured by the squared cosine (cos2) of the variables. The darker the color, the better the representation of the individual observation in the principal component analysis. The color gradient goes from light to dark, with the lightest color representing the lowest cos2 values and the darkest color representing the highest cos2 values.

The plot also shows that there are some individuals that are clearly separated from the rest, indicating that they have distinct characteristics that are well-represented by the principal components.

# Individual plot with color scale for squared cosine
fviz_pca_ind(pca, col.ind = "cos2", geom = "point", 
             gradient.cols = c(mypal[1], mypal[5], mypal[9]), 
             show.ind.names = FALSE) +
         scale_color_gradientn(colors = c(mypal[1], mypal[5], mypal[9]), na.value = "grey50", 
                               breaks = seq(0, 1, by = 0.25))

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

Contributions of variables

This plot shows the contributions of each variable to each principal component (PC). Each point represents a variable, and the position of the point along the axis indicates the contribution of the variable to the respective PC. The distance of a point from the origin represents the quality of the variable’s representation on the PC. The closer a variable is to the origin, the smaller its contribution to the PC.

The plot for* PC1, so the most important component shows that 5 variables ‘research’, ‘teaching’, ‘citations’, ‘international’ and ‘international students’ have a high contribution to PC1.
The PC2 component consists of only 3 variables: ‘student_staff_ratio’, ‘num_students’ and female_male_ratio’.
The PC3 consists of 4 variables: ‘income’, ‘international’, ‘international_students’ and ‘num_students’.

library(factoextra)

PC1 <- fviz_contrib(pca, choice = "var", axes = 1, fill = mypal[1], color = mypal[5], title = "Contributions for PC1")
PC2 <- fviz_contrib(pca, choice = "var", axes = 2, fill = mypal[5], color = mypal[7], title = "Contributions for PC2")
PC3 <- fviz_contrib(pca, choice = "var", axes = 3, fill = mypal[9], color = mypal[9], title = "Contributions for PC3")

grid.arrange(PC1, PC2, PC3, ncol = 3)

Hierarchical clustering

The first plot shows a dendrogram of hierarchical clustering. The height of the branches represents the distance between the clusters or individual observations. The dendrogram can be used to determine the number of clusters in the data. In this case, the dendrogram shows that there are three distinct clusters.

The first cluster includes universities with high scores in all variables, indicating a strong overall performance.
The second cluster includes universities with lower scores in teaching and research but higher scores in international, suggesting a focus on internationalization.
The third cluster includes universities with lower scores in all variables, indicating a weaker overall performance.

The second plot shows a scatter plot of the data points colored by their assigned cluster membership. The clusters were determined using the hierarchical clustering algorithm, and the sub_grp variable was used to assign the data points to one of the three clusters. The fviz_cluster function is used to create the scatter plot, with the data and cluster membership provided as inputs. The different colors represent the different clusters.

Both the hierarchical dendrogram and the second plot show the same division of variables to the clusters, for k=3.

df <- as.data.frame(lapply(df, scale))

distance.m<-dist(t(df))
hc<-hclust(distance.m, method="complete") 
plot(hc, hang=-1)
rect.hclust(hc, k = 3, border='blue')

sub_grp<-cutree(hc, k=3) 
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("aquamarine4", "cyan3", "darkblue" ))

Conclusions

The purpose of this article was to analyze whether university ranking position can be described by a smaller number of variables using dimension reduction method: PCA. PCA analysis reduced the number of variables from 13 to 3 dimensions.

The PCA analysis allowed for the identification of the most important variables that influence the world rank of universities in 2016, namely: teaching, research, citations, and international.

The PCA analysis showed that the 3 dimensions can describe the world rank of universities in 2016:

The first dimension (PC1) is characterized by high loadings of teaching, research, citations, international, and international students, indicating that these variables are important for determining university rank.
The second dimension (PC2) is characterized by high loadings of student_staff_ratio, num_students, and female_male_ratio, suggesting that universities with a higher number of students and lower student-to-staff ratio tend to rank lower.
The third dimension (PC3) is characterized by high loadings of income, international, and international students, indicating that universities with higher income and more international students tend to rank higher.

Hierarchical clustering was applied to the same dataset to explore if these two methods give the same results. It was shown that the final results of PCA and hierarchical clustering do not differ, indicating the robustness of the findings.So, the hierarchical clustering analysis showed that universities can be grouped into three main clusters based on their performance across the identified variables.
Therefore, this study provides insights into the factors that influence the world rank of universities and suggests that a smaller number of variables can accurately describe this complex phenomenon.
Overall, the findings suggest that universities should prioritize investments in teaching, research, citations, and internationalization to improve their world rank.

References

Cord, M., & Cunningham, P. (2014). Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval. Springer.

RPubs - Dimension Reduction. (n.d.). https://rpubs.com/paulinazabska/DimensionReduction

Dimension reduction on university ranking

Anna Czarnocka

2023-02-23