Data 607 Final Project: Population Internet Access and Economic Growth
Author
Emily El Mouaquite, Long Lin, Pascal Hermann Kouogang Tafo, Zihao Yu
Introduction
For this project, we want to investigate the relationship between internet access and economic opportunity through the following research question:
Is higher internet access associated with stronger economic growth?
To answer this question, we will integrate data from the online dataset Share of the population using the Internet (CSV file). Economic indicators, including GDP (USD) and GDP growth over time, will be retrieved through the World Bank Indicators API, which provides JSON/XML data for all countries. Our motivation for doing this project is to see if, in a world where digital connectivity has become essential for education, commerce, and innovation, whether unequal access to the internet reinforces existing economic inequalities. By analyzing long‑term trends in internet usage alongside GDP growth, this project aims to quantify how strongly digital access correlates with national economic performance.
The workflow for this project will include: cleaning the internet access data from the CSV, and conducting statistical analyses to investigate the correlation between internet access and GDP growth over time. Possible hurdles that we might encounter could be misalignment between the CSV data and the data found in the API, missing data for some countries, and having to filter through the API data, which is very large.
Packages and Libraries
library(httr2)library(xml2)
Attaching package: 'xml2'
The following object is masked from 'package:httr2':
url_parse
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ xml2::url_parse() masks httr2::url_parse()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tmap) #used for creating mapslibrary(sf) #used for tmap coordinates
Linking to GEOS 3.14.1, GDAL 3.12.1, PROJ 9.7.1; sf_use_s2() is TRUE
library(rnaturalearth) #provides map vectors for creating tmaps
Fetching the GDP Data from the World Bank API
req <-request("https://api.worldbank.org/v2/country/all/indicator/NY.GDP.MKTP.KD.ZG") |>req_url_query(date ="1995:2023", per_page =32500 )resp <- req |>req_perform()xml_data <- resp |>resp_body_xml()records <-xml_find_all(xml_data, "//*[local-name()='data' and *[local-name()='indicator']]")GDP_growth_df <-tibble(country = records |>xml_find_first(".//*[local-name()='country']") |>xml_text(),country_iso = records |>xml_find_first(".//*[local-name()='countryiso3code']") |>xml_text(),date = records |>xml_find_first(".//*[local-name()='date']") |>xml_integer(),value = records |>xml_find_first(".//*[local-name()='value']") |>xml_double(),indicator = records |>xml_find_first(".//*[local-name()='indicator']") |>xml_text())head(GDP_growth_df)
# A tibble: 6 × 5
country country_iso date value indicator
<chr> <chr> <int> <dbl> <chr>
1 Africa Eastern and Southern AFE 2023 1.93 GDP growth (annual %)
2 Africa Eastern and Southern AFE 2022 3.72 GDP growth (annual %)
3 Africa Eastern and Southern AFE 2021 4.58 GDP growth (annual %)
4 Africa Eastern and Southern AFE 2020 -2.82 GDP growth (annual %)
5 Africa Eastern and Southern AFE 2019 2.03 GDP growth (annual %)
6 Africa Eastern and Southern AFE 2018 2.71 GDP growth (annual %)
Internet Access Data Cleaning/ Exploration
#read csvinternet_data <-read.csv("https://raw.githubusercontent.com/emilye5/607-final-project/refs/heads/main/share-of-individuals-using-the-internet.csv")#clean datasetinternet_data_clean <- internet_data %>%#rename columnsrename(country = Entity,country_iso = Code,date = Year,internet_usage_share = Share.of.the.population.using.the.Internet, ) %>%#drop rows with missing values drop_na(internet_usage_share) %>%# keep only years after 1995 and before 2024filter(date >=1995, date <2024)#take a look at the cleaned datasetglimpse(internet_data_clean)
#check how many distinct countries are included in the datan_distinct(internet_data_clean$country)
[1] 222
Distribution of Internet Usage Over time
#boxplot showing the distribution of internet usage by yearinternet_data_clean %>%ggplot(aes(x =factor(date), y = internet_usage_share)) +geom_boxplot() +labs(title ="Distribution of Internet Usage Across Countries Over Time",x ="Year",y ="Internet Usage (%)" ) +theme(axis.text.x =element_text(angle =45, hjust =1))
In order to validate our internet usage data we created the above boxplot, which shows an increase in internet usage per year. The median half of the data increases in percentage of internet usage each year, with fewer outliers over time.
Merging Internet & GDP Data
final_data <- internet_data_clean |>left_join(GDP_growth_df, by =c("country_iso", "date"))head(final_data)
country.x country_iso date internet_usage_share country.y value
1 Afghanistan AFG 2001 0.00472257 Afghanistan -9.431974
2 Afghanistan AFG 2002 0.00456140 Afghanistan 28.600001
3 Afghanistan AFG 2003 0.08789130 Afghanistan 8.832278
4 Afghanistan AFG 2004 0.10580900 Afghanistan 1.414118
5 Afghanistan AFG 2005 1.22415000 Afghanistan 11.229715
6 Afghanistan AFG 2006 2.10712000 Afghanistan 5.357403
indicator
1 GDP growth (annual %)
2 GDP growth (annual %)
3 GDP growth (annual %)
4 GDP growth (annual %)
5 GDP growth (annual %)
6 GDP growth (annual %)
Fetching GDP Data from 1995 & 2023
1995:
req <-request("https://api.worldbank.org/v2/country/all/indicator/NY.GDP.MKTP.CD") |>req_url_query(date ="1995", per_page =32500 )resp <- req |>req_perform()xml_data <- resp |>resp_body_xml()records <-xml_find_all(xml_data, "//*[local-name()='data' and *[local-name()='indicator']]")GDP_1995_df <-tibble(country = records |>xml_find_first(".//*[local-name()='country']") |>xml_text(),country_iso = records |>xml_find_first(".//*[local-name()='countryiso3code']") |>xml_text(),date = records |>xml_find_first(".//*[local-name()='date']") |>xml_integer(),value_1995 = records |>xml_find_first(".//*[local-name()='value']") |>xml_double(),indicator = records |>xml_find_first(".//*[local-name()='indicator']") |>xml_text())print(GDP_1995_df)
# A tibble: 266 × 5
country country_iso date value_1995 indicator
<chr> <chr> <int> <dbl> <chr>
1 Africa Eastern and Southern AFE 1995 2.73e11 GDP (cur…
2 Africa Western and Central AFW 1995 2.08e11 GDP (cur…
3 Arab World ARB 1995 5.28e11 GDP (cur…
4 Caribbean small states CSS 1995 1.57e10 GDP (cur…
5 Central Europe and the Baltics CEB 1995 3.95e11 GDP (cur…
6 Early-demographic dividend EAR 1995 2.62e12 GDP (cur…
7 East Asia & Pacific EAS 1995 8.43e12 GDP (cur…
8 East Asia & Pacific (excluding high i… EAP 1995 1.33e12 GDP (cur…
9 East Asia & Pacific (IDA & IBRD count… TEA 1995 1.32e12 GDP (cur…
10 Euro area EMU 1995 7.54e12 GDP (cur…
# ℹ 256 more rows
# A tibble: 217 × 5
country country_iso date value_1995 indicator
<chr> <chr> <int> <dbl> <chr>
1 Afghanistan AFG 1995 NA GDP (current US$)
2 Albania ALB 1995 2905092799. GDP (current US$)
3 Algeria DZA 1995 41764291672. GDP (current US$)
4 American Samoa ASM 1995 NA GDP (current US$)
5 Andorra AND 1995 1178745283. GDP (current US$)
6 Angola AGO 1995 5538749260. GDP (current US$)
7 Antigua and Barbuda ATG 1995 616051852. GDP (current US$)
8 Argentina ARG 1995 258031750000 GDP (current US$)
9 Armenia ARM 1995 1468317435. GDP (current US$)
10 Aruba ABW 1995 1320670391. GDP (current US$)
# ℹ 207 more rows
2023:
req <-request("https://api.worldbank.org/v2/country/all/indicator/NY.GDP.MKTP.CD") |>req_url_query(date ="2023", per_page =32500 )resp <- req |>req_perform()xml_data <- resp |>resp_body_xml()records <-xml_find_all(xml_data, "//*[local-name()='data' and *[local-name()='indicator']]")GDP_2023_df <-tibble(country = records |>xml_find_first(".//*[local-name()='country']") |>xml_text(),country_iso = records |>xml_find_first(".//*[local-name()='countryiso3code']") |>xml_text(),date = records |>xml_find_first(".//*[local-name()='date']") |>xml_integer(),value_2023 = records |>xml_find_first(".//*[local-name()='value']") |>xml_double(),indicator = records |>xml_find_first(".//*[local-name()='indicator']") |>xml_text())print(GDP_2023_df)
# A tibble: 266 × 5
country country_iso date value_2023 indicator
<chr> <chr> <int> <dbl> <chr>
1 Africa Eastern and Southern AFE 2023 1.18e12 GDP (cur…
2 Africa Western and Central AFW 2023 9.38e11 GDP (cur…
3 Arab World ARB 2023 3.62e12 GDP (cur…
4 Caribbean small states CSS 2023 7.95e10 GDP (cur…
5 Central Europe and the Baltics CEB 2023 2.27e12 GDP (cur…
6 Early-demographic dividend EAR 2023 1.49e13 GDP (cur…
7 East Asia & Pacific EAS 2023 3.14e13 GDP (cur…
8 East Asia & Pacific (excluding high i… EAP 2023 2.16e13 GDP (cur…
9 East Asia & Pacific (IDA & IBRD count… TEA 2023 2.16e13 GDP (cur…
10 Euro area EMU 2023 1.59e13 GDP (cur…
# ℹ 256 more rows
# A tibble: 217 × 5
country country_iso date value_2023 indicator
<chr> <chr> <int> <dbl> <chr>
1 Afghanistan AFG 2023 17152234637. GDP (current US$)
2 Albania ALB 2023 23491242727. GDP (current US$)
3 Algeria DZA 2023 247923887215. GDP (current US$)
4 American Samoa ASM 2023 NA GDP (current US$)
5 Andorra AND 2023 3785065525. GDP (current US$)
6 Angola AGO 2023 107167747140. GDP (current US$)
7 Antigua and Barbuda ATG 2023 2005785185. GDP (current US$)
8 Argentina ARG 2023 649461687959. GDP (current US$)
9 Armenia ARM 2023 24185982216. GDP (current US$)
10 Aruba ABW 2023 3834729616. GDP (current US$)
# ℹ 207 more rows
Changed_GDP <- GDP_1995_Remove_Regions_df |>left_join(GDP_2023_Remove_Regions_df, by =c("country_iso"))print(Changed_GDP)
# A tibble: 217 × 9
country.x country_iso date.x value_1995 indicator.x country.y date.y
<chr> <chr> <int> <dbl> <chr> <chr> <int>
1 Afghanistan AFG 1995 NA GDP (curre… Afghanis… 2023
2 Albania ALB 1995 2.91e 9 GDP (curre… Albania 2023
3 Algeria DZA 1995 4.18e10 GDP (curre… Algeria 2023
4 American Samoa ASM 1995 NA GDP (curre… American… 2023
5 Andorra AND 1995 1.18e 9 GDP (curre… Andorra 2023
6 Angola AGO 1995 5.54e 9 GDP (curre… Angola 2023
7 Antigua and Barbu… ATG 1995 6.16e 8 GDP (curre… Antigua … 2023
8 Argentina ARG 1995 2.58e11 GDP (curre… Argentina 2023
9 Armenia ARM 1995 1.47e 9 GDP (curre… Armenia 2023
10 Aruba ABW 1995 1.32e 9 GDP (curre… Aruba 2023
# ℹ 207 more rows
# ℹ 2 more variables: value_2023 <dbl>, indicator.y <chr>
We removed the first 49 observations from the dataframe because the World Bank API includes world regions, which are not a part of our internet usage data.
# A tibble: 20 × 11
country.x country_iso date.x value_1995 indicator.x country.y date.y
<chr> <chr> <int> <dbl> <chr> <chr> <int>
1 United States USA 1995 7.64e12 GDP (curre… United S… 2023
2 China CHN 1995 7.38e11 GDP (curre… China 2023
3 India IND 1995 3.60e11 GDP (curre… India 2023
4 United Kingdom GBR 1995 1.35e12 GDP (curre… United K… 2023
5 Germany DEU 1995 2.59e12 GDP (curre… Germany 2023
6 Russian Federation RUS 1995 3.96e11 GDP (curre… Russian … 2023
7 Canada CAN 1995 6.06e11 GDP (curre… Canada 2023
8 France FRA 1995 1.60e12 GDP (curre… France 2023
9 Brazil BRA 1995 7.69e11 GDP (curre… Brazil 2023
10 Mexico MEX 1995 3.80e11 GDP (curre… Mexico 2023
11 Australia AUS 1995 3.69e11 GDP (curre… Australia 2023
12 Korea, Rep. KOR 1995 5.86e11 GDP (curre… Korea, R… 2023
13 Indonesia IDN 1995 2.02e11 GDP (curre… Indonesia 2023
14 Italy ITA 1995 1.18e12 GDP (curre… Italy 2023
15 Saudi Arabia SAU 1995 1.43e11 GDP (curre… Saudi Ar… 2023
16 Spain ESP 1995 6.14e11 GDP (curre… Spain 2023
17 Turkiye TUR 1995 2.35e11 GDP (curre… Turkiye 2023
18 Netherlands NLD 1995 4.53e11 GDP (curre… Netherla… 2023
19 Poland POL 1995 1.43e11 GDP (curre… Poland 2023
20 Switzerland CHE 1995 3.53e11 GDP (curre… Switzerl… 2023
# ℹ 4 more variables: value_2023 <dbl>, indicator.y <chr>, change_in_gdp <dbl>,
# average_change <dbl>
top_20_avg_gdp_growth_clean <- top_20_avg_gdp_growth_df %>%mutate("Average Annual Change in USD (billions)"= average_change/1000000000 ) %>%rename(Country = country.x) %>%#columns cleaned to remove date and indicator (always the same)select(-country.y, -indicator.x, -indicator.y, -value_1995, -value_2023, -change_in_gdp, -average_change, -date.x, -date.y)top_20_avg_gdp_growth_clean
# A tibble: 20 × 3
Country country_iso `Average Annual Change in USD (billions)`
<chr> <chr> <dbl>
1 United States USA 702.
2 China CHN 626.
3 India IND 117.
4 United Kingdom GBR 74.0
5 Germany DEU 70.3
6 Russian Federation RUS 59.9
7 Canada CAN 56.0
8 France FRA 52.2
9 Brazil BRA 50.8
10 Mexico MEX 50.6
11 Australia AUS 48.8
12 Korea, Rep. KOR 44.9
13 Indonesia IDN 41.8
14 Italy ITA 40.7
15 Saudi Arabia SAU 38.4
16 Spain ESP 35.9
17 Turkiye TUR 32.4
18 Netherlands NLD 24.4
19 Poland POL 23.9
20 Switzerland CHE 19.3
The Pearson Correlation Coefficient is -0.159. That indicates a very weak inverse linear relationship between Internet Usage and Annual GDP Growth. Moreover the effect size is small suggesting the relationship is not practically meaningful.
Countries with the top 10 positive & top 10 negative correlations between internet usage and GDP
# Top 10 positive correlationscat("\n--- Top 10 Positive Correlations ---\n")
--- Top 10 Positive Correlations ---
print(head(country_correlations, 10))
# A tibble: 10 × 4
country country_iso correlation n_years
<chr> <chr> <dbl> <int>
1 Guyana GUY 0.725 28
2 Guinea GIN 0.433 29
3 Benin BEN 0.397 28
4 Kenya KEN 0.379 26
5 Cayman Islands CYM 0.354 11
6 Cote d'Ivoire CIV 0.352 29
7 Bangladesh BGD 0.341 27
8 Democratic Republic of Congo COD 0.248 28
9 Senegal SEN 0.193 29
10 Zimbabwe ZWE 0.192 29
# Top 10 negative correlationscat("\n--- Top 10 Negative Correlations ---\n")
--- Top 10 Negative Correlations ---
print(tail(country_correlations, 10))
# A tibble: 10 × 4
country country_iso correlation n_years
<chr> <chr> <dbl> <int>
1 Bermuda BMU -0.604 23
2 Angola AGO -0.629 28
3 Yemen YEM -0.637 22
4 Mozambique MOZ -0.644 28
5 Sri Lanka LKA -0.673 23
6 Trinidad and Tobago TTO -0.673 29
7 China CHN -0.689 29
8 Laos LAO -0.704 26
9 Sudan SDN -0.760 17
10 Myanmar MMR -0.803 24
Scatter plot: internet usage vs GDP growth (all countries, all years)
# Scatter plotcorrelation_data |>ggplot(aes(x = internet_usage_share, y = value)) +geom_point(alpha =0.2, color ="steelblue") +geom_smooth(method ="lm", color ="red", se =TRUE) +labs(title ="Internet Usage vs GDP Growth (1995–2023)",subtitle =paste("Overall Pearson r =", round(overall_cor, 3)),x ="Internet Usage Share (%)",y ="GDP Growth Rate (%)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
To answer our main research question, we decided to generate a Scatter Plot and compute the Pearson Correlation to specifically quantify how strongly “Internet Usage” and the “ Annual GDP Growth” in all countries move together in a linear direction.
Correlation Between Internet Usage and GDP by Year
yearly_correlation |>ggplot(aes(x = date, y = correlation)) +geom_line(color ="steelblue") +geom_point(color ="steelblue") +geom_hline(yintercept =0, linetype ="dashed") +labs(title ="Yearly Correlation Between Internet Usage and GDP Growth",x ="Year",y ="Pearson Correlation" ) +theme_minimal()
The yearly correlation results show that the relationship between internet usage and GDP growth is not stable over time. Most years show weak or negative correlations, which suggests that higher internet usage is not consistently associated with higher GDP growth. This means that the association between internet usage and GDP growth is not consistent, and does not show a clear causal relationship.
Mapping Internet Usage Over Time
world <-ne_countries(scale ="medium", returnclass ="sf")selected_years <-c(1995, 2005, 2015, 2023)world_years <- world[rep(seq_len(nrow(world)), times =length(selected_years)), ]world_years$date <-rep(selected_years, each =nrow(world))map_internet_over_years <- world_years |>left_join( internet_data_clean |>filter(date %in% selected_years) |>select(country_iso, date, internet_usage_share),by =c("iso_a3"="country_iso", "date"="date") )tmap_mode("plot")
ℹ tmap modes "plot" - "view"
ℹ toggle with `tmap::ttm()`
── tmap v3 code detected ───────────────────────────────────────────────────────
[v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
`tm_scale_intervals()`.
ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values'), 'colorNA'
(rename to 'value.na'), 'textNA' (rename to 'label.na') to
'tm_scale_intervals(<HERE>)'
For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
list: 'fill'.scale = list(<scale1>, <scale2>, ...)'[v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`[tip] Consider a suitable map projection, e.g. by adding `+ tm_crs("auto")`.[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
"brewer.blues"
These maps show the global spread of internet usage from 1995 to 2023. Internet usage was limited in many countries in 1995, but by 2023, most regions show much higher levels of internet access.
tmap_options(component.autoscale =FALSE)tm_shape(map_data_over_years) +tm_polygons("internet_gdp_group",title ="Internet Access and GDP Growth",colorNA ="brown",textNA ="No data" ) +tm_facets(by ="date",ncol =2 ) +tm_layout(main.title ="Internet Access and GDP Growth Over Time",main.title.size =1.2,main.title.position ="center",panel.label.size =1.1,legend.outside =TRUE,legend.outside.position ="right",legend.title.size =0.9,legend.text.size =0.8,inner.margins =c(0.01, 0.01, 0.01, 0.01),outer.margins =c(0.02, 0.02, 0.02, 0.02),frame =FALSE )
── tmap v3 code detected ───────────────────────────────────────────────────────
[v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
the visual variable `fill` namely 'colorNA' (rename to 'value.na'), 'textNA'
(rename to 'label.na') to fill.scale = tm_scale(<HERE>).
ℹ For small multiples, specify a 'tm_scale_' for each multiple, and put them in
a list: 'fill.scale = list(<scale1>, <scale2>, ...)'[v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
These maps compare countries by whether they are above or below the yearly median for internet usage and GDP growth. It shows that the relationship between internet access and GDP growth varies between countries and years, rather than following one clear global pattern.
Conclusion
Our findings suggest a weak inverse linear relationship between internet usage and GDP growth. The overall Pearson correlation is -0.159, which is close to 0 and indicates only a weak association. This means the data does not show a strong relationship between countries’ internet access and their GDP growth, meaning that internet usage, by itself, is not a significant enough parameter to determine the growth of a country’s GDP. Therefore, this correlation analysis does not provide evidence that higher internet access is clearly associated with higher GDP growth.