Table of contents
Oh, and here is the code for the first few variables I created as well as the packages and data set used.
un <- read.csv("THE World University Rankings 2016-2024.csv",sep=",",header=TRUE)
library(dplyr)
library(plotly)
library(ggplot2)
library(knitr)
basicD1 <- length(unique(un$Year)) #totally unnecessary, I know
basicD2 <- length(unique(un$Name))
basicD3 <- length(unique(un$Country))
This whole document here revolves around the Data set THE World University Rankings 2016-2024, something that originates from kaggle.com. I’ve attempted to use it in past assignments, but never to my liking so I decided to do a more proper dive deep into it.
My goal here in the broadest sense is to clear up every detail about this data set, and that begins with me laying out the most basic details about it, then I’ll explain what I’m going to with them.
THE World University Rankings 2016-2024 has 14 variables: Rank, Name, Country, Student.Population, Student.to.Staff.Ratio, International.Students, Female.to.Male.Ratio, Overall.Score, Teaching, Research.Environment, Research.Quality, Industry.Impact, International.Outlook, and Year.
According to that kaggle page that I got this data set from, it was derived from the timeshighereducation.com’s list(s) which is where the data and premise of the data set truly originate from.
With each and all of these variables containing 12430 objects, not all of which are unique. What’s included is:
The full methodologies for each year can be found here:
The problems, or rather the questions, I want to answer are:
Answering those questions starts with a mostly simple process, it’s simply a matter of counting the amount of Unique Countries and Universities there are involved in the data set per year.
This is how I did it
#List of Years
LOY <- unique(un$Year)
#Amount of Unique Countries Per Year
AOUCPY <- c()
#Amount of Universities Per Year
AOUPY <- c()
#And here are two 'for' loops for counting
#For AOUCPY
for (c in 1:9) {
Amountyear <- filter(select(un, Year, Country), Year == LOY[c])
Amountyear <- nrow(distinct(Amountyear, Country))
AOUCPY <- append(AOUCPY, Amountyear)
}
#For AOUPY
for (u in 1:9) {
Amountyear <- nrow(filter(select(un, Year, Country), Year == LOY[u]))
AOUPY <- append(AOUPY, Amountyear)
}
#This is the total per year, not the changes. I'm going to be looking at the net change per year from left to right. I could do this in my head, but I'm going to do it in a more convoluted fashion here.
#Net Change in Universities Per Year
NCIUPY <- c(AOUPY[2]-AOUPY[1], AOUPY[3]-AOUPY[2], AOUPY[4]-AOUPY[3], AOUPY[5]-AOUPY[4], AOUPY[6]-AOUPY[5], AOUPY[7]-AOUPY[6], AOUPY[8]-AOUPY[7], AOUPY[9]-AOUPY[8])
#Net Change in Countries Per Year
NCICPY <- c(AOUCPY[2]-AOUCPY[1], AOUCPY[3]-AOUCPY[2], AOUCPY[4]-AOUCPY[3], AOUCPY[5]-AOUCPY[4], AOUCPY[6]-AOUCPY[5], AOUCPY[7]-AOUCPY[6], AOUCPY[8]-AOUCPY[7], AOUCPY[9]-AOUCPY[8])
#I probably should have easily written a loop to do this for me, but I've already done that so I'm doing it manually for a change in pace.
This tells me exactly what the first 35 objects I wanted are and arranged them into 4 nice vectors. But what about those linear relationships I was mentioning earlier and what will I do with them?
Well here’s a bit of the code for that, as well as something to represent the average change between the number of Universities and Countries per year.
#Average Change for Countries
ACFC <- sum(NCICPY) / 8
#Average Change for Universities
ACFU <- sum(NCIUPY)/8
#Now for the Regression shenanigans that I have a lesser understanding off. Using just the lm function with the
#But before that I'll create two tibbles.
UnChange <- tibble(LOY, AOUPY)
CChange <- tibble(LOY, AOUCPY)
#Yes, I realize that the acronyms are getting confusing. Even I, the guy who wrote them has to scroll up occasionally to reassure myself that I'm not doing this all horribly wrong.
UnLinear <- lm(AOUPY ~ LOY, data = UnChange)
CLinear <- lm(AOUCPY ~ LOY, data = CChange)
As far as the math in normal math terms goes
The Average Yearly Net Change In Universities is: \[ A = (181 + 122 + 155 + 139 + 129 + 136 + 137 + 105) / 8 = 138 \] The Formula For the Linear Relationship between the Total Number of Universities and the Years are: \[ Total = b_0 + b_1 * Year + ε ; b_0 = -276570.9 , b_1 = 138 \] The Average Change In Unique Countries is: \[ A = (10 + 1 + 5 + 6 + 1 + 6 + 5 + 4) / 8 = 4.75 \] The Formula For the Linear Relationship between the Total Number of Unique Countries per Year is: \[ Total = b_0 + b_1 * Year + ε ; b_0 = -8898.67, b_1 = 4.75 \]
Four graphs Graphing time! I’ll show the code for all four since that seems to be a big part of this project.
(This is the point where I ran into a few problems solely to do with deciding whether this report should take the form of either a pdf or HTML report. I simply hoped that the latter was acceptable, because otherwise the report would have really boring graphs, or I could have included screenshots of the code and the graphs, or I could have rearranged all of this as an Ioslides Presentation.)
The first two being for the Universities
UnG1 <- plot_ly(x = UnChange$LOY, y = UnChange$AOUPY, type = "scatter", name = "Year vs Universities", mode = "markers") %>% add_lines(x = UnChange$LOY, fitted(UnLinear), name = "Linear Relationship")
UnG1 <- UnG1 %>% layout(title = 'Total Universities Per Year',
xaxis = list(title = 'Years'),
yaxis = list (title = 'Universities'))
UnG1
UnG2 <- plot_ly(x = LOY[-1],y = NCIUPY, type = "scatter", mode = "markers", name = "Change in Universities vs Previous Year") %>% add_lines(y = ACFU, name = "Average Change Per Year")
UnG2 <- UnG2 %>% layout(title = 'Change in Universities Per Year',
xaxis = list(title = 'Years'),
yaxis = list (title = 'Universities'))
UnG2
The second two being for the Countries
CG1 <- plot_ly(x = CChange$LOY, y = CChange$AOUCPY, type = "scatter", name = "Year vs Countries", mode = "markers") %>% add_lines(x = CChange$LOY, fitted(CLinear), name = "Linear Relationship")
CG1 <- CG1 %>% layout(title = 'Total Unique Countries Per Year',
xaxis = list(title = 'Years'),
yaxis = list (title = 'Countries'))
CG1
CG2 <- plot_ly(x = LOY[-1],y = NCICPY, type = "scatter", mode = "markers", name = "Change in Unique Countries vs Previous Year") %>% add_lines(y = ACFC, name = "Average Change Per Year")
CG2 <- CG2 %>% layout(title = 'Change in Unique Countries Per Year',
xaxis = list(title = 'Years'),
yaxis = list (title = 'Countries'))
CG2
And I really have to guess that’s all?! Wait, I still have to do a recording…
This is between me and whoever actually ever reads this.
I don’t think I do very well with these sorts of open ended assignments. I can never ever tell if I’m doing far too much or far too little.
Cookies + Milk + Programming = Neato^4!