title: “Project 1: Is There a Rise In Bandwidth Usage” output:
pdf_document author: “Nhat Thanh Tran” date: “2025-11-18 —”
I’ve come across a data set measuring international bandwidth usage per
capita by countries (also sorted by regions, continents, etc.) and with
the recent boom of AI, cryptocurrencies, and social media globally, I
think this would be a good opportunity to connect the social phenomenons
to the underlying data. The scope of this project will be focused on the
countries and regions of the bandwidth usage as the data does not
provide other social measurements but this is also a limiting constraint
as future data evaluations and collection could measure bandwidth usage
by other social conditions such as wealth, family size, age group, etc.
As such, many explorations of the data will be backed up by social
context as of 2025.
The original dataset had many attributes that I will choose not to
showcase as it is either irrelevant to the measured data and only
provides identifier for the organization that compiled the data set or
that it is simply a column showing the full label of an acronym used. As
a result, the columns needed are the geographical location where the
data was measured, the time period measured in years, and the observed
value which is in KbPS (Kilobits Per Second). Note that the observation
value is obtained by - as stated by the data collectors - bandwidth
usage divided the population which can cause issues discussed later.
This will be the dataset used for the project and this shows the
selection for the 3 relevant columns of data that will be analyzed. Most
of the plots from this dataset will be plotted with ggplot2 as there is
less of a need for interactive plots through plotly; and dplyr is
standard for data manipulation.
library(readr)
library(ggplot2)
library(dplyr)
bandwidthdf <- readr::read_csv("Band_Per_Cap.csv", col_select = c("REF_AREA_LABEL", "TIME_PERIOD", "OBS_VALUE"))
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
The first 20 entries in the dataset will show is that this data is not
uniformly collected. In other words, not all countries have the same
size of collection nor the same time frame of data collected. This makes
sense as countries go through wars and political landscapes can shift
policies concerning collection of data. Therefore, it is more important
later to specify the time period for analysis.
head(bandwidthdf, 20)
## # A tibble: 20 × 3
## REF_AREA_LABEL TIME_PERIOD OBS_VALUE
## <chr> <dbl> <dbl>
## 1 Aruba 2001 997.
## 2 Aruba 2002 985.
## 3 Aruba 2003 975.
## 4 Aruba 2004 1448.
## 5 Aruba 2005 1910.
## 6 Aruba 2006 6783.
## 7 Aruba 2007 6440.
## 8 Aruba 2008 12699.
## 9 Aruba 2009 40394.
## 10 Aruba 2010 51142.
## 11 Aruba 2011 50703.
## 12 Aruba 2012 50313
## 13 Afghanistan 2005 0.164
## 14 Afghanistan 2006 0.157
## 15 Afghanistan 2007 0.811
## 16 Afghanistan 2008 5.66
## 17 Afghanistan 2009 58.3
## 18 Afghanistan 2010 70.7
## 19 Afghanistan 2011 102.
## 20 Afghanistan 2012 131.
Note that bandwidth is NOT the total amount of data used similar to a
phone plan having “up to 100GB of data per month”. Bandwidth is the
amount of data that a network can handle at a given time which is
measured in data per second. For example, given a website that needs to
transmit 10 kilobits of data, a network that has the bandwidth of 1KbPS
will take 10 seconds to load the website for the user, while the network
at 10KbPS will take a second to load the website. Though in practical
usage, networks usually have multiple devices splitting the bandwidth
and bandwidth is only the utmost limit, not the average.
Let’s now factor in the per capita aspect of the dataset by taking
samples in 2022 of random locale to further illustrate what the
observation value is measuring.
bandwidth2022 <- filter(bandwidthdf, TIME_PERIOD == "2022")
set.seed(1234)
bandwidth2022_random <- bandwidth2022 %>% slice_sample(n = 5)
bandwidth2022_random
## # A tibble: 5 × 3
## REF_AREA_LABEL TIME_PERIOD OBS_VALUE
## <chr> <dbl> <dbl>
## 1 Cote d'Ivoire 2022 16722.
## 2 Singapore 2022 8986660
## 3 Zambia 2022 10604.
## 4 South Sudan 2022 72.6
## 5 Middle East & North Africa (IDA & IBRD) 2022 132939.
Below is the population data in 2023 for the first 4 entries of the data above, the 5th entry is a region which this data set takes the observartional value average of all countries in the specified region which is not useful for this comparison.
population2022_random <- data.frame(Countries = c("Cote D'Ivoire", "Singapore", "South Sudan", "Zambia"), Population = c("31,165,654", "5,917,648", "11,483,374", "21,913,874"))
population2022_random
## Countries Population
## 1 Cote D'Ivoire 31,165,654
## 2 Singapore 5,917,648
## 3 South Sudan 11,483,374
## 4 Zambia 21,913,874
Based on the social context behind the data there are two more issues
to be aware of before analysis. First is that because the data set is
taking the bandwidth per capita, the data can easily be skewed because
of smaller sample size such as Singapore, which is around half of South
Sudan, a quarter of Zambia and a fifth of Cote D’Ivoire. The second is
geographical limitation, a bigger country will find it more difficult to
distribute a consistently high bandwidth across the whole country to
every network even before considering wealth and infrastructure. The
best usage of this data will be accompanied by these contexts.
Here is the list of countries that had observation values for the year
2014-2023 (the list isn’t shown due to length). Based on our
observations for fair comparisons, I will handpick Honduras, Cuba, and
Dominican Republic as countries of roughly similar size, infrastructure,
and population to plot their bandwidth usage over the year range.
bandwidth_by_range <- filter(bandwidthdf, TIME_PERIOD %in% 2014:2023) %>%
group_by(REF_AREA_LABEL) %>%
filter(all(2014:2023 %in% TIME_PERIOD)) %>%
ungroup();
countries_by_range <- distinct(bandwidth_by_range, REF_AREA_LABEL) %>%
pull(REF_AREA_LABEL);
bandwidth_by_range_filtered <- filter(bandwidth_by_range, REF_AREA_LABEL %in% c("Honduras", "Cuba", "Dominican Republic"))
ggplot(bandwidth_by_range_filtered, aes(x = TIME_PERIOD, y = OBS_VALUE, color = REF_AREA_LABEL)) +
geom_line() +
geom_point() +
labs(
title = "Bandwidth Usage by Countries from 2014-2023",
x = "Year",
y = "Bandwidth by KbPS",
color = "Countries"
)
I will now subset the data to take the derivative for the year range
2014-2023 for each country and plot it.
cuba_data <- bandwidth_by_range_filtered %>% filter(REF_AREA_LABEL == "Cuba")
domrep_data <- bandwidth_by_range_filtered %>% filter(REF_AREA_LABEL == "Dominican Republic")
honduras_data <- bandwidth_by_range_filtered %>% filter(REF_AREA_LABEL == "Honduras")
cuba_subset <- cuba_data %>%
filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
arrange(TIME_PERIOD) %>%
mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
domrep_subset <- domrep_data %>%
filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
arrange(TIME_PERIOD) %>%
mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
honduras_subset <- honduras_data %>%
filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
arrange(TIME_PERIOD) %>%
mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
cuba_subset <- cuba_subset %>% filter(!is.na(Derivative))
domrep_subset <- domrep_subset %>% filter(!is.na(Derivative))
honduras_subset <- honduras_subset %>% filter(!is.na(Derivative))
combined_data <- bind_rows(cuba_subset, domrep_subset, honduras_subset)
ggplot(combined_data, aes(x = TIME_PERIOD, y = Derivative, color = REF_AREA_LABEL)) +
geom_line() +
geom_point() +
labs(
title = "Derivative by Countries",
x = "Year",
y = "Derivative"
)
The derivative in this case is the rate of change of bandwidth usage
from year to year. If there was a rise in bandwidth usage in the recent
years, the derivative graph would’ve had an upward trajectory near the
end. Disregarding the outlier year in the Dominican Republic, the
bandwidth usage derivative for these three countries have been
relatively stable from 2014-2023 meaning that in a vacuum with no social
context, these three countries would show that there has not been a rise
in bandwidth usage. However, by knowing social context, lets take a look
at countries that have been the focus of the recent boom in AI,
cryptocurrencies, and social media that also had data from 2014-2023
such as China, Hong Kong, and Singapore through the same process.
Due to the repetitive nature of this code and for cleaner presentation I
will leave the code in this page and only show the derivative graphs
after.The large difference in scale between Hong Kong and Singapore as a
group means China is separated for clearer presentation.
bandwidth_by_range_filtered2 <- filter(bandwidth_by_range, REF_AREA_LABEL %in% c("China", "Singapore", "Hong Kong SAR, China"))
china_data <- bandwidth_by_range_filtered2 %>% filter(REF_AREA_LABEL == "China")
hk_data <- bandwidth_by_range_filtered2 %>% filter(REF_AREA_LABEL == "Hong Kong SAR, China")
sg_data <- bandwidth_by_range_filtered2 %>% filter(REF_AREA_LABEL == "Singapore")
china_subset <- china_data %>%
filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
arrange(TIME_PERIOD) %>%
mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
hk_subset <- hk_data %>%
filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
arrange(TIME_PERIOD) %>%
mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
sg_subset <- sg_data %>%
filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
arrange(TIME_PERIOD) %>%
mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
china_subset <- china_subset %>% filter(!is.na(Derivative))
hk_subset <- hk_subset %>% filter(!is.na(Derivative))
sg_subset <- sg_subset %>% filter(!is.na(Derivative))
combined_data2 <- bind_rows(hk_subset, sg_subset)
ggplot(combined_data2, aes(x = TIME_PERIOD, y = Derivative, color = REF_AREA_LABEL)) +
geom_line() +
geom_point() +
labs(
title = "Derivative by Countries",
x = "Year",
y = "Derivative"
)
ggplot(china_subset, aes(x = TIME_PERIOD, y = Derivative, color = REF_AREA_LABEL)) +
geom_line() +
geom_point() +
labs(
title = "Derivative by Country",
x = "Year",
y = "Derivative"
)
The most consistent data point across the 6 sample countries that had
data points between 2014-2023, only Hong Kong experienced a rise in
usage rate, while the other 5 decreased or stabilized. While neither
sides necessarily proves a rise or a decline, it is more probable that
the boom of AI, cryptocurrencies, and social media in recent years might
have an effect on total data usage but not the speed nor a heightened
development to increase it. This would be the most logical conclusion
with consideration to the social context and the comparison points of
both countries indirectly influenced by the AI boom and countries
directly participating in the phenomenon not having an increasing trend
in regards to their derivative of bandwidth usage.