Python Assignment

Introduction

This report aims to study a global university ranking dataset to understand from different angles, how different universities perform across a number of focus areas. Every dataset encompasses teaching, research, citations, industry income, and international outlook at the institution level, which together captures a university’s holistic academic achievement and provides a basis for comparison across countries, and highlights the areas of strengths and weaknesses for universities.

In this report, I highlight the use of descriptive statistics, and some data visualization tools to identify structures and relationships in the dataset. I focus mainly on country to country performance comparison, interrelation of the various scoring categories and the scoring categories that are closely related to the top rankers. The data visualizations are to be used primarily to bring out patterns in the datasets such as the interrelation of research and citations, the disparity in the international outlook and performance in various countries. The entire exercise is to highlight the performance of the global higher education system and the major determinants that lead to the ranking of a university.

Dataset

Descriptive Statistics

The data has several variables of different types, both numeric and categorical, and several thousands of observations from institutions of higher learning worldwide. Key numeric variables include teaching score, research score, citations score, industry income score, and international outlook score. Overall, citation scores are higher than average in comparison to the remaining categories, and teaching and research scores are rather modest. The scores, generally, range from lower values around 10–20 up to values close to 100. This indicates the wide range of international university performance.

The data includes university level records, and each row embodies high school data in relation to a particular category. The data is comprised of both categorical and numeric values, in addition, to the university name and its country, making regional comparisons possible. The confluence of these variables enables both a geographic and performance trend analysis.

There are no major missing values in the data set, which provides analytical and visual consistency to the resultant streams. The various performance metrics prevalent in the data set offer a rich ground to explore the patterns, relationships and discrepancies regarding higher learning outcomes globally.

Findings

The analysis shows the impact of global university rankings based on country and category score. It shows the US, UK, and Japan have the highest university rankings which also coincides with their university citations and research impact scores. There also appears to be a correlation between the university scores in teaching, research, and citations. This means the higher the scores in teaching and research of a university, the higher the citations that university will have. On the contrary, countries that have few ranked universities and low scores in teaching research and citations do have some universities that score highly in industry income and international collaboration. From the data of the analysis, citations rank the highest in variability in scores compared to the other criteria which are evenly distributed. This also shows that different countries are ranked on university research and citations differently.

Scatterplot

The scatter plots show a positive correlation for both research and teaching scores, demonstrating universites that are higher-rated in teaching also are higher-rated in research. In the first plot, most of the values for both teaching and research scores fall between 10 and 40, which explains the less value concentration in the higher teaching and research scores.

The color gradient reveals that as teaching and research scores increase, so do citation scores. This shows that the increase in research quality improves the quality of research. Although, there are universities that have a higher citation score and a lower teaching and research score.

For the second plot, score groupings range in increments. In the mid range, specifically 20-40, there is a notable concentration of universities and the marker sizes display an increase in number. This shows that among the best performing universities, the mid range is populated the most and only a few have reached these spots in all of the metrics.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")
path = "/Users/kylie/Desktop/IS 460/python/"
filename = path + "World University Rankings 2023 Dataset.csv"
df = pd.read_csv(filename)

x = df[['Teaching Score', 'Research Score', 'Citations Score']].copy()
x = x.dropna()

x['Teaching'] = pd.cut(x['Teaching Score'], bins=[10,20,30,40,50,60,70,80,90,100],
                             labels=[15,25,35,45,55,65,75,85,95], include_lowest=True)

x['Research'] = pd.cut(x['Research Score'], bins=[0,10,20,30,40,50,60,70,80,90,100],
                             labels=[5,15,25,35,45,55,65,75,85,95], include_lowest=True)
                        
x2 = x.groupby(['Teaching', 'Research']).agg(count=('Citations Score', 'size'),
                                             avg_citations=('Citations Score', 'mean')).reset_index()                     
x2['Teaching'] = x2['Teaching'].astype(int)
x2['Research'] = x2['Research'].astype(int)
x2['count_scaled'] = x2['count'] * 8
                             
plt.figure(figsize=(14,8))

plt.scatter(
    x['Teaching Score'],
    x['Research Score'],
    c=x['Citations Score'],
    s=x['Citations Score'] * 1.0,
    cmap='viridis',
    alpha=0.8,
    edgecolors='black',
    linewidths=0.2)

plt.title('Citations by Teaching and Research Scores', fontsize=18)
plt.xlabel('Teaching Score', fontsize=14)
plt.ylabel('Research Score', fontsize=14)

cbar = plt.colorbar()
cbar.set_label('Citations Score', rotation=270, fontsize=14, color='black', labelpad=30)

plt.grid(alpha=0.2)
plt.show()

plt.figure(figsize=(18,10))

plt.scatter(x2['Teaching'], x2['Research'], marker='8', cmap='viridis', c=x2['avg_citations'], s=x2['count_scaled'], edgecolors='black')

plt.title('Average Citations by Teaching and Research Scores', fontsize=18)
plt.xlabel('Teaching Score', fontsize=14)
plt.ylabel('Research Score', fontsize=14)

cbar = plt.colorbar()
cbar.set_label('Average Citations Score', rotation=270, fontsize=14, color='black', labelpad=30)

my_colorbar_ticks = [*range(0, int(x2['avg_citations'].max()) + 1, 20)]
cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = [f'{each:,}' for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)

plt.xticks([15,25,35,45,55,65,75,85,95],
           ['10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100'],
           fontsize=14, color='black')

plt.yticks([5,15,25,35,45,55,65,75,85,95],
           ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100'],
           fontsize=14, color='black')

plt.show()

Vertical Bar Chart

The provided vertical bar charts show the distribution of universities by country indicating the overall distribution and the performing countries. In the first chart, the distribution is skewed as only a few countries have a large number of universities, a good number of countries fall under the mean of 20.17 overall. This shows that the number of higher education institutions is not spread evenly, as only a few countries have them.

In the second chart of the top 10 countries, the USA is the country with the most number of universities and is performing the highest, followed by Japan and the UK, both of which are above the mean of 96.9. India, China, and the rest are below the mean even when they have large numbers, showing the large gap with the top three.

The charts, on average, show a big disparity in the number of universities globally, as a few countries have a large number of universities, while most countries have very few.

location_cols = [col for col in df.columns if col.startswith('Location_')]

x = pd.DataFrame({'Country': [col.replace('Location_', '') for col in location_cols],
                  'Count': [df[col].sum() for col in location_cols]})

x = x[x['Count'] > 0].copy()
x = x.sort_values('Count', ascending=False).reset_index(drop=True)
top10 = x[x['Country'] != 'Unknown Location'].head(10)
top10

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data['Count'].mean()
    for each in this_data['Count']:
        if each > avg*1.01:
            colors.append('lightcoral')
        elif each < avg*0.99:
            colors.append('green')
        else:
            colors.append('black')
    return colors


import matplotlib.patches as mpatches

d1 = x.copy()
my_colors1 = pick_colors_according_to_mean_count(d1)

d2 = x[x['Country'] != 'Unknown Location'].head(10)
my_colors2 = pick_colors_according_to_mean_count(d2)

Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='green', label='Below Average')

fig = plt.figure(figsize=(18,10))
fig.suptitle('Frequency of Universities by Country:\nAll Countries and Top 10',
             fontsize=18, fontweight='bold')

ax1 = fig.add_subplot(2,1,1)
ax1.bar(d1['Country'], d1['Count'], color=my_colors1)
ax1.legend(handles=[Above, At, Below], fontsize=14)
ax1.axhline(d1['Count'].mean(), color='black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title('All Countries', size=20)
ax1.text(len(d1)-5, d1['Count'].mean()+2, 'Mean = ' + str(round(d1['Count'].mean(),2)), fontsize=14)

ax2 = fig.add_subplot(2,1,2)
ax2.bar(d2['Country'], d2['Count'], color=my_colors2)
ax2.legend(handles=[Above, At, Below], fontsize=14)
ax2.axhline(d2['Count'].mean(), color='black', linestyle='solid')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_title('Top 10 Countries', size=20)
ax2.set_xlabel('Country', fontsize=14)
ax2.text(len(d2)-3, d2['Count'].mean()+2, 'Mean = ' + str(round(d2['Count'].mean(),2)), fontsize=14)

fig.subplots_adjust(hspace=0.45)

plt.show()

Horizontal Bar Chart

The chart shows that the United States leads in the number of universities, followed by Japan and the United Kingdom, all exceeding the mean of 68.9. India, China, Turkey, and Brazil follow at a distance, still above average but at noticeably lower levels. There is a clear drop after these top three. Most of the countries that remain are below mean, and are clustered around 30 and 60. This illustrates the disproportionate global university population, where a small number of countries host a large number of universities and the rest have lower levels.

bottom3 = 0
top3 = 19

d3 = x[x['Country'] != 'Unknown Location'].head(20).copy()
d3 = d3.sort_values('Count', ascending=True).reset_index(drop=True)
my_colors3 = pick_colors_according_to_mean_count(d3)

Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='green', label='Below Average')

fig = plt.figure(figsize=(18, 12))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(d3['Country'], d3['Count'], color=my_colors3)

for row_counter, value_at_row_counter in enumerate(d3['Count']):
    if value_at_row_counter > d3['Count'].mean() * 1.01:
        color = 'lightcoral'
    elif value_at_row_counter < d3['Count'].mean() * 0.99:
        color = 'green'
    else:
        color = 'black'
    ax1.text(value_at_row_counter + 1, row_counter, str(value_at_row_counter), color=color, size=12,
             fontweight='bold', ha='left', va='center', backgroundcolor='white')
plt.xlim(0, d3['Count'].max() * 1.1)

ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=14)
plt.axvline(d3['Count'].mean(), color='black', linestyle='dashed')
ax1.text(d3['Count'].mean() + 1, 0, 'Mean = ' + str(round(d3['Count'].mean(), 2)), rotation=0, fontsize=14)

ax1.set_title('Top 20 Countries by Number of Universities', size=20)
ax1.set_xlabel('Number of Universities', fontsize=16)
ax1.set_ylabel('Country', fontsize=16)
plt.xticks(fontsize=14)

plt.yticks(fontsize=14)

plt.show()

Dual Axis Bar Chart

The chart indicates that the United States triumphs for both total number of universities and total citation score which signals best performance overall. The United Kingdom along with Japan shows that despite being smaller in size and having a lower citation score, the citation impact is high for the former and greater number of universities for the latter. China and Iran represent the opposite situation: having less universities but greater citation score which implies greater research impact. The chart suggests that having a greater number of universities does not always equal greater number of citations.

location_cols = [col for col in df.columns if col.startswith('Location_')]

country_counts = pd.DataFrame({
    'Country': [col.replace('Location_', '') for col in location_cols],
    'Count': [df[col].sum() for col in location_cols]})

avg_citations = pd.DataFrame({
    'Country': [col.replace('Location_', '') for col in location_cols],
    'Average_Citations': [df.loc[df[col] == 1, 'Citations Score'].mean() for col in location_cols]})

x = pd.merge(country_counts, avg_citations, on='Country')
x = x[x['Count'] > 0]
x = x[x['Country'] != 'Unknown Location']

d2 = x.sort_values('Count', ascending=False).head(10).reset_index(drop=True)
d2

def autolabel(these_bars, this_ax, place_of_decimals, symbol):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimals),
                    fontsize=11, color='black', ha='center', va='bottom')
                    
fig = plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4

x_pos = np.arange(len(d2))
count_bars = ax1.bar(x_pos - (0.5 * bar_width), d2['Count'], bar_width, color='gray', edgecolor='black',
                     label='Number of Universities')
citation_bars = ax2.bar(x_pos + (0.5 * bar_width), d2['Average_Citations'], bar_width, color='green',
                        edgecolor='black', label='Average Citations Score')

ax1.set_xlabel('Country', fontsize=18)
ax1.set_ylabel('Number of Universities', fontsize=18, labelpad=20)
ax2.set_ylabel('Average Citations Score', fontsize=18, rotation=270, labelpad=20)
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)

plt.title('Number of Universities and Average Citations Score\nTop 10 Countries', fontsize=18)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(d2['Country'], fontsize=14, rotation=45, ha='right')

count_color, count_label = ax1.get_legend_handles_labels()
citation_color, citation_label = ax2.get_legend_handles_labels()
ax1.legend(count_color + citation_color, count_label + citation_label, loc='upper right',
           frameon=True, ncol=1, shadow=True, borderpad=1, fontsize=14)
ax1.set_ylim(0, d2['Count'].max() * 1.35)

autolabel(count_bars, ax1, '.0f', '')
autolabel(citation_bars, ax2, '.2f', '')

plt.show()

Bump Chart

This bump chart illustrates the ranking movement of the leading universities in each of the scoring metrics (Teaching, Research, Citations, Industry Income, and International Outlook). Each of the lines represents a university and the position is designed in a way to show the strengths and weaknesses of the university in the respective metric. For instance, there are institutions such as MIT and Oxford which are ranked close to the very top when it comes to research and citations, while Imperial College London tends to do much better in International Outlook. The movement of the lines signifies the fact that the university rankings are dynamic. That is, a university can perform very well in one metric but be ranked very low in another. The chart illustrates that there are leading universities which do not top each metric.

university_cols = [col for col in df.columns if col.startswith('Name of University_')]
df['University'] = df[university_cols].idxmax(axis=1).str.replace('Name of University_', '', regex=False)

df['University Rank Numeric'] = pd.to_numeric(df['University Rank'], errors='coerce')

bump_df = df[['University', 'University Rank Numeric', 'Teaching Score',
              'Research Score', 'Citations Score',
              'Industry Income Score', 'International Outlook Score']]

bump_df = bump_df.dropna(subset=['University Rank Numeric'])

bump_df = bump_df.sort_values('University Rank Numeric').head(10)

bump_df = bump_df.set_index('University')
bump_df = bump_df.drop(columns='University Rank Numeric')

bump_df_ranked = bump_df.rank(ascending=False, method='min').T

# Plotting a Bump Chart with text labels inside the markers

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
                    markersize=44, markerfacecolor='white')

ax.invert_yaxis()

num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]

plt.ylabel('University Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Top Universities Across Score Categories\n Bump Chart', fontsize=18, pad=15)
plt.xticks(np.arange(num_rows), bump_df_ranked.index, fontsize=14)

plt.yticks(range(1, num_cols + 1, 1), fontsize=14)

ax.set_xlabel('Score Category', fontsize=18)

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14,
          labelspacing=1,
          markerscale=.4,
          borderpad=1,
          handletextpad=0.8)

i = 0
j = 0
for eachcol in bump_df_ranked.columns:
    for eachrow in bump_df_ranked.index:
        this_rank = bump_df_ranked.iloc[i, j]
        ax.text(i, this_rank, str(round(bump_df.iloc[j, i], 1)),
                ha='center', va='center', fontsize=12)
        i += 1
    j += 1
    i = 0
plt.tight_layout()
plt.show()

Multiple Lines Plot / Stacked Bar Chart

The line plot illustrates the continuing strength of the United States in all the categories noted and particularly in the areas of citations and teaching. The United Kingdom is another country that commands attention with scores in citations and international outlook. The same is not true of the United Kingdom’s counterparts in South America and the Middle East, particularly Brazil and Turkey, who perform poorly in the most number of categories. If we look at citations, we see that scores vary a lot but in teaching and research the reverse is the case, as scores are more closely distributed.

location_cols = [col for col in df.columns if col.startswith("Location_")]
df["Location"] = (df[location_cols].idxmax(axis=1).str.replace("Location_", "", regex=False))
df = df[df["Location"] != "Unknown Location"]

score_cols = ["Teaching Score","Research Score","Citations Score","Industry Income Score","International Outlook Score"]

top_locations = df["Location"].value_counts().head(7).index.tolist()

plot_df = df[df["Location"].isin(top_locations)].melt(id_vars="Location", value_vars=score_cols, var_name="ScoreType", value_name="Score")
plot_df = (plot_df.groupby(["ScoreType", "Location"])["Score"].mean().reset_index())

score_order = score_cols
plot_df["ScoreType"] = pd.Categorical(plot_df["ScoreType"], categories=score_order, ordered=True)
plot_df = plot_df.sort_values(["Location", "ScoreType"])

from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

my_colors = {top_locations[0]: "blue",
             top_locations[1]: "red",
             top_locations[2]: "green",
             top_locations[3]: "gray",
             top_locations[4]: "purple",
             top_locations[5]: "gold",
             top_locations[6]: "brown"}

for key, grp in plot_df.groupby("Location"):
    grp.plot(ax=ax, kind="line", x="ScoreType", y="Score", color=my_colors[key], label=key, marker="8")

plt.title("Average University Scores by Country", fontsize=18)
ax.set_xlabel("Ranking Score Category", fontsize=18)
ax.set_ylabel("Average Score", fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=20)
ax.tick_params(axis='y', labelsize=14)

plt.legend(loc='best', fontsize=14, ncol=1)
plt.subplots_adjust(bottom=0.3)
plt.show()

The stacked bar chart illustrates the overall performance of all the country and categories of scores. From the chart we see that the country with the most performance is the United States, which is driven the most by citations and the industry income scores. The United Kingdom and China received lower scores, while Brazil and Turkey received much smaller scores across the categories, and this indicated the poor performance across the wide number of categories.

stacked_df = (df[df["Location"].isin(top_locations)].groupby("Location")[score_cols].mean())
stacked_df = stacked_df.reindex(top_locations)

stacked_df["Total"] = stacked_df.sum(axis=1)
stacked_df = stacked_df.sort_values("Total", ascending=False).drop(columns="Total")

from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

stacked_df.plot(kind='bar', stacked=True, ax=ax)

plt.ylabel('Average Score', fontsize=18, labelpad=10)
plt.title('Average University Scores by Country and Category\nStacked Bar Plot', fontsize=18)
plt.xticks(rotation=0, ha='center', fontsize=14)

plt.yticks(fontsize=14)

ax.set_xlabel('Country', fontsize=18)

handles, labels = ax.get_legend_handles_labels()
handles = handles[::-1]
labels = labels[::-1]
plt.legend(handles, labels, loc='best', fontsize=14)

plt.show()

Heatmap

The visualization indicates a range of performance and score types across universities. The University of Oxford and Imperial College London consistently score high, particularly in the friction and citation categories, reflecting high academic esteem. Conversely, Indian institutions score significantly lower across the board, although they perform relatively better in industry research compared to industry income. The visualization suggests that of the three categories, industry income is the least variable and has the most consistent score distribution across universities.

# keep only top 5 universities based on rank
top_df = df.sort_values('University Rank').head(5)

university_cols = [col for col in df.columns if col.startswith('Name of University_')]
top_df['University'] = top_df[university_cols].idxmax(axis=1).str.replace('Name of University_', '', regex=False)

bump_df = top_df[['University',
                  'Teaching Score',
                  'Research Score',
                  'Citations Score',
                  'Industry Income Score',
                  'International Outlook Score']]

bump_df = bump_df.set_index('University')

import seaborn as sns
from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(1, 1, 1)

ax = sns.heatmap(bump_df, linewidths=0.2, annot=True, cmap='coolwarm',
                 fmt='.1f', square=True, annot_kws={'size': 14},
                 cbar_kws={'orientation': 'vertical'})

plt.title('Heatmap of Top 5 University Scores Across Categories', fontsize=18, pad=15)
plt.xlabel('Score Category', fontsize=18, labelpad=10)
plt.ylabel('Top 5 Universities', fontsize=18, labelpad=12)
plt.xticks(rotation=45, ha='right', fontsize=12)

plt.yticks(fontsize=12)

cbar = ax.collections[0].colorbar
cbar.set_label('Score', rotation=270, fontsize=14, labelpad=20)
plt.subplots_adjust(bottom=0.3)
plt.show()

Conclusion

The dataset shows clear patterns in the performance of universities worldwide in the visualizations presented. The universities in The United States, The United Kingdom, and Japan have better scores in the research, and citation areas, signaling the need for academic outputs and research activity. Universities in the rest of the countries score lower, and their performance in the areas of industry income and international outlook is relatively better.

These patterns are evident when the analysis compared countries to each other and the country score to each other. The data on citations and research show a strong relationship and therefore, the output of the research workers in a particular university could bring them a lot of recognition in comparison to other universities. The strong relationship on the other data categories in contrast to other data categories could also mean different university areas of focus.

The data shows that there are geographic differences in the performance of the universities on the global ranking, and top universities are less evenly spread in the different range of performance, demonstrating continued the need of the universities to perform on the same areas to improve on the global ranking.