Project1_ntran51.knit

title: “Project 1: Is There a Rise In Bandwidth Usage” output: pdf_document author: “Nhat Thanh Tran” date: “2025-11-18 —”
I’ve come across a data set measuring international bandwidth usage per capita by countries (also sorted by regions, continents, etc.) and with the recent boom of AI, cryptocurrencies, and social media globally, I think this would be a good opportunity to connect the social phenomenons to the underlying data. The scope of this project will be focused on the countries and regions of the bandwidth usage as the data does not provide other social measurements but this is also a limiting constraint as future data evaluations and collection could measure bandwidth usage by other social conditions such as wealth, family size, age group, etc. As such, many explorations of the data will be backed up by social context as of 2025.

The original dataset had many attributes that I will choose not to showcase as it is either irrelevant to the measured data and only provides identifier for the organization that compiled the data set or that it is simply a column showing the full label of an acronym used. As a result, the columns needed are the geographical location where the data was measured, the time period measured in years, and the observed value which is in KbPS (Kilobits Per Second). Note that the observation value is obtained by - as stated by the data collectors - bandwidth usage divided the population which can cause issues discussed later.

This will be the dataset used for the project and this shows the selection for the 3 relevant columns of data that will be analyzed. Most of the plots from this dataset will be plotted with ggplot2 as there is less of a need for interactive plots through plotly; and dplyr is standard for data manipulation.

library(readr)
library(ggplot2)
library(dplyr)
bandwidthdf <- readr::read_csv("Band_Per_Cap.csv", col_select = c("REF_AREA_LABEL", "TIME_PERIOD", "OBS_VALUE"))

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

The first 20 entries in the dataset will show is that this data is not uniformly collected. In other words, not all countries have the same size of collection nor the same time frame of data collected. This makes sense as countries go through wars and political landscapes can shift policies concerning collection of data. Therefore, it is more important later to specify the time period for analysis.

head(bandwidthdf, 20)

## # A tibble: 20 × 3
##    REF_AREA_LABEL TIME_PERIOD OBS_VALUE
##    <chr>                <dbl>     <dbl>
##  1 Aruba                 2001   997.   
##  2 Aruba                 2002   985.   
##  3 Aruba                 2003   975.   
##  4 Aruba                 2004  1448.   
##  5 Aruba                 2005  1910.   
##  6 Aruba                 2006  6783.   
##  7 Aruba                 2007  6440.   
##  8 Aruba                 2008 12699.   
##  9 Aruba                 2009 40394.   
## 10 Aruba                 2010 51142.   
## 11 Aruba                 2011 50703.   
## 12 Aruba                 2012 50313    
## 13 Afghanistan           2005     0.164
## 14 Afghanistan           2006     0.157
## 15 Afghanistan           2007     0.811
## 16 Afghanistan           2008     5.66 
## 17 Afghanistan           2009    58.3  
## 18 Afghanistan           2010    70.7  
## 19 Afghanistan           2011   102.   
## 20 Afghanistan           2012   131.

Note that bandwidth is NOT the total amount of data used similar to a phone plan having “up to 100GB of data per month”. Bandwidth is the amount of data that a network can handle at a given time which is measured in data per second. For example, given a website that needs to transmit 10 kilobits of data, a network that has the bandwidth of 1KbPS will take 10 seconds to load the website for the user, while the network at 10KbPS will take a second to load the website. Though in practical usage, networks usually have multiple devices splitting the bandwidth and bandwidth is only the utmost limit, not the average.

Let’s now factor in the per capita aspect of the dataset by taking samples in 2022 of random locale to further illustrate what the observation value is measuring.

bandwidth2022 <- filter(bandwidthdf, TIME_PERIOD == "2022")
set.seed(1234)
bandwidth2022_random <- bandwidth2022 %>% slice_sample(n = 5)
bandwidth2022_random

## # A tibble: 5 × 3
##   REF_AREA_LABEL                          TIME_PERIOD OBS_VALUE
##   <chr>                                         <dbl>     <dbl>
## 1 Cote d'Ivoire                                  2022   16722. 
## 2 Singapore                                      2022 8986660  
## 3 Zambia                                         2022   10604. 
## 4 South Sudan                                    2022      72.6
## 5 Middle East & North Africa (IDA & IBRD)        2022  132939.

Below is the population data in 2023 for the first 4 entries of the data above, the 5th entry is a region which this data set takes the observartional value average of all countries in the specified region which is not useful for this comparison.

population2022_random <- data.frame(Countries = c("Cote D'Ivoire", "Singapore", "South Sudan", "Zambia"), Population = c("31,165,654", "5,917,648", "11,483,374", "21,913,874"))
population2022_random

##       Countries Population
## 1 Cote D'Ivoire 31,165,654
## 2     Singapore  5,917,648
## 3   South Sudan 11,483,374
## 4        Zambia 21,913,874

Based on the social context behind the data there are two more issues to be aware of before analysis. First is that because the data set is taking the bandwidth per capita, the data can easily be skewed because of smaller sample size such as Singapore, which is around half of South Sudan, a quarter of Zambia and a fifth of Cote D’Ivoire. The second is geographical limitation, a bigger country will find it more difficult to distribute a consistently high bandwidth across the whole country to every network even before considering wealth and infrastructure. The best usage of this data will be accompanied by these contexts.
Here is the list of countries that had observation values for the year 2014-2023 (the list isn’t shown due to length). Based on our observations for fair comparisons, I will handpick Honduras, Cuba, and Dominican Republic as countries of roughly similar size, infrastructure, and population to plot their bandwidth usage over the year range.

bandwidth_by_range <- filter(bandwidthdf, TIME_PERIOD %in% 2014:2023) %>%
                      group_by(REF_AREA_LABEL) %>%
                      filter(all(2014:2023 %in% TIME_PERIOD)) %>%
                      ungroup();
countries_by_range <- distinct(bandwidth_by_range, REF_AREA_LABEL) %>%
                      pull(REF_AREA_LABEL);
bandwidth_by_range_filtered <- filter(bandwidth_by_range, REF_AREA_LABEL %in% c("Honduras", "Cuba", "Dominican Republic"))
ggplot(bandwidth_by_range_filtered, aes(x = TIME_PERIOD, y = OBS_VALUE, color = REF_AREA_LABEL)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Bandwidth Usage by Countries from 2014-2023",
    x = "Year",
    y = "Bandwidth by KbPS",
    color = "Countries"
  )

I will now subset the data to take the derivative for the year range 2014-2023 for each country and plot it.

cuba_data <- bandwidth_by_range_filtered %>% filter(REF_AREA_LABEL == "Cuba")
domrep_data <- bandwidth_by_range_filtered %>% filter(REF_AREA_LABEL == "Dominican Republic")
honduras_data <- bandwidth_by_range_filtered %>% filter(REF_AREA_LABEL == "Honduras")

cuba_subset <- cuba_data %>%  
  filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
  arrange(TIME_PERIOD) %>%
  mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
domrep_subset <- domrep_data %>%  
  filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
  arrange(TIME_PERIOD) %>%
  mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
honduras_subset <- honduras_data %>%  
  filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
  arrange(TIME_PERIOD) %>%
  mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))

cuba_subset <- cuba_subset %>% filter(!is.na(Derivative))
domrep_subset <- domrep_subset %>% filter(!is.na(Derivative))
honduras_subset <- honduras_subset %>% filter(!is.na(Derivative))

combined_data <- bind_rows(cuba_subset, domrep_subset, honduras_subset)

ggplot(combined_data, aes(x = TIME_PERIOD, y = Derivative, color = REF_AREA_LABEL)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Derivative by Countries",
    x = "Year",
    y = "Derivative"
  )

The derivative in this case is the rate of change of bandwidth usage from year to year. If there was a rise in bandwidth usage in the recent years, the derivative graph would’ve had an upward trajectory near the end. Disregarding the outlier year in the Dominican Republic, the bandwidth usage derivative for these three countries have been relatively stable from 2014-2023 meaning that in a vacuum with no social context, these three countries would show that there has not been a rise in bandwidth usage. However, by knowing social context, lets take a look at countries that have been the focus of the recent boom in AI, cryptocurrencies, and social media that also had data from 2014-2023 such as China, Hong Kong, and Singapore through the same process.
Due to the repetitive nature of this code and for cleaner presentation I will leave the code in this page and only show the derivative graphs after.The large difference in scale between Hong Kong and Singapore as a group means China is separated for clearer presentation.

bandwidth_by_range_filtered2 <- filter(bandwidth_by_range, REF_AREA_LABEL %in% c("China", "Singapore", "Hong Kong SAR, China"))

china_data <- bandwidth_by_range_filtered2 %>% filter(REF_AREA_LABEL == "China")
hk_data <- bandwidth_by_range_filtered2 %>% filter(REF_AREA_LABEL == "Hong Kong SAR, China")
sg_data <- bandwidth_by_range_filtered2 %>% filter(REF_AREA_LABEL == "Singapore")

china_subset <- china_data %>%  
  filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
  arrange(TIME_PERIOD) %>%
  mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
hk_subset <- hk_data %>%  
  filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
  arrange(TIME_PERIOD) %>%
  mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))
sg_subset <- sg_data %>%  
  filter(TIME_PERIOD >= 2014, TIME_PERIOD <= 2023) %>%
  arrange(TIME_PERIOD) %>%
  mutate(Derivative = c(diff(OBS_VALUE) / diff(TIME_PERIOD), NA))

china_subset <- china_subset %>% filter(!is.na(Derivative))
hk_subset <- hk_subset %>% filter(!is.na(Derivative))
sg_subset <- sg_subset %>% filter(!is.na(Derivative))

combined_data2 <- bind_rows(hk_subset, sg_subset)

ggplot(combined_data2, aes(x = TIME_PERIOD, y = Derivative, color = REF_AREA_LABEL)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Derivative by Countries",
    x = "Year",
    y = "Derivative"
  )
ggplot(china_subset, aes(x = TIME_PERIOD, y = Derivative, color = REF_AREA_LABEL)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Derivative by Country",
    x = "Year",
    y = "Derivative"
  )

The most consistent data point across the 6 sample countries that had data points between 2014-2023, only Hong Kong experienced a rise in usage rate, while the other 5 decreased or stabilized. While neither sides necessarily proves a rise or a decline, it is more probable that the boom of AI, cryptocurrencies, and social media in recent years might have an effect on total data usage but not the speed nor a heightened development to increase it. This would be the most logical conclusion with consideration to the social context and the comparison points of both countries indirectly influenced by the AI boom and countries directly participating in the phenomenon not having an increasing trend in regards to their derivative of bandwidth usage.