Research question. Do countries with higher internet access (individuals using the internet, % of population) have higher GDP per capita (PPP)? Dataset. I analyze a 2019 country-level cross-section created from the World Bank’s World Development Indicators (WDI) and saved locally as a CSV for this project. I use a 2019 country-level file created by merging two World Bank World Development Indicators (WDI): GDP per capita, PPP (NY.GDP.PCAP.PP.KD) and Individuals using the Internet (% of population) (IT.NET.USER.ZS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# 1) Read the two files
gdp <- read.csv("WB_WDI_NY_GDP_PCAP_PP_KD.csv", stringsAsFactors = FALSE)
net <- read.csv("WB_WDI_IT_NET_USER_ZS.csv", stringsAsFactors = FALSE)
# 2) Keep 2019 rows and needed columns, then rename
gdp2019 <- gdp[gdp$TIME_PERIOD == 2019, c("REF_AREA","REF_AREA_LABEL","TIME_PERIOD","OBS_VALUE")]
names(gdp2019) <- c("iso3c","country","year","gdp")
net2019 <- net[net$TIME_PERIOD == 2019, c("REF_AREA","REF_AREA_LABEL","TIME_PERIOD","OBS_VALUE")]
names(net2019) <- c("iso3c","country","year","internet")
# 3) Merge and save
combined <- merge(gdp2019, net2019, by = c("iso3c","country","year"), all = TRUE)
write.csv(combined, "wdi_2019_internet_gdp.csv", row.names = FALSE)
# quick check
dim(combined); head(combined, 8)
## [1] 252 5
## iso3c country year gdp internet
## 1 ABW Aruba 2019 38435.427 NA
## 2 AFE Africa Eastern and Southern 2019 4073.654 22.4000
## 3 AFG Afghanistan 2019 2927.245 17.6000
## 4 AFW Africa Western and Central 2019 4822.310 28.7000
## 5 AGO Angola 2019 8274.543 32.1294
## 6 ALB Albania 2019 15079.374 68.5504
## 7 AND Andorra 2019 63215.900 90.7187
## 8 ARB Arab World 2019 16697.636 NA
Data Analysis
I first import the CSV and show its dimensions, column names, and a preview to make the data load explicit. Next, I clean by selecting the columns listed above, removing rows with missing values, and creating log_gdp_pcap_ppp. For EDA, I report summary statistics for internet_users_pct, gdp_pcap_ppp, and the log transform; then I produce a histogram of internet access to show its distribution and a scatterplot of internet access vs. GDP per capita.
raw <- read.csv("wdi_2019_internet_gdp.csv", stringsAsFactors = FALSE)
# Keeping only needed columns, drop NAs
wdi19 <- raw |>
select(
iso3c, country, year,
gdp_pcap_ppp = gdp,
internet_users_pct = internet
) |>
filter(!is.na(gdp_pcap_ppp), !is.na(internet_users_pct)) |>
mutate(
log_gdp_pcap_ppp = log(gdp_pcap_ppp),
above_mean_internet = internet_users_pct > mean(internet_users_pct, na.rm = TRUE),
is_max_gdp_country = gdp_pcap_ppp == max(gdp_pcap_ppp, na.rm = TRUE)
)
# Summaries (summary + mean + max explicitly shown)
dim(wdi19)
## [1] 214 8
head(wdi19, 10)
## iso3c country year gdp_pcap_ppp internet_users_pct
## 1 AFE Africa Eastern and Southern 2019 4073.654 22.4000
## 2 AFG Afghanistan 2019 2927.245 17.6000
## 3 AFW Africa Western and Central 2019 4822.310 28.7000
## 4 AGO Angola 2019 8274.543 32.1294
## 5 ALB Albania 2019 15079.374 68.5504
## 6 AND Andorra 2019 63215.900 90.7187
## 7 ARE United Arab Emirates 2019 68887.845 99.1500
## 8 ARG Argentina 2019 26629.553 79.9470
## 9 ARM Armenia 2019 16215.361 66.5439
## 10 ATG Antigua and Barbuda 2019 29651.864 73.9792
## log_gdp_pcap_ppp above_mean_internet is_max_gdp_country
## 1 8.312296 FALSE FALSE
## 2 7.981817 FALSE FALSE
## 3 8.481008 FALSE FALSE
## 4 9.020939 FALSE FALSE
## 5 9.621083 TRUE FALSE
## 6 11.054311 TRUE FALSE
## 7 11.140235 TRUE FALSE
## 8 10.189777 TRUE FALSE
## 9 9.693714 TRUE FALSE
## 10 10.297280 TRUE FALSE
summary(wdi19[, c("gdp_pcap_ppp", "internet_users_pct", "log_gdp_pcap_ppp")])
## gdp_pcap_ppp internet_users_pct log_gdp_pcap_ppp
## Min. : 855.7 Min. : 6.10 Min. : 6.752
## 1st Qu.: 5830.7 1st Qu.:36.35 1st Qu.: 8.671
## Median : 15719.5 Median :63.85 Median : 9.663
## Mean : 24055.3 Mean :58.67 Mean : 9.555
## 3rd Qu.: 34208.0 3rd Qu.:81.04 3rd Qu.:10.440
## Max. :133549.2 Max. :99.70 Max. :11.802
mean_internet <- mean(wdi19$internet_users_pct, na.rm = TRUE)
max_gdp <- max(wdi19$gdp_pcap_ppp, na.rm = TRUE)
mean_internet
## [1] 58.66679
max_gdp
## [1] 133549.2
hist(
wdi19$internet_users_pct,
breaks = 30,
main = "Distribution of Internet Use (% of Population), 2019",
xlab = "Internet users (% of population)",
ylab = "Number of countries"
)
plot(
wdi19$internet_users_pct,
wdi19$gdp_pcap_ppp,
xlab = "Internet users (% of population)",
ylab = "GDP per capita, PPP (2017 intl $)",
main = "Internet Access vs. GDP per Capita (PPP), 2019",
pch = 19, cex = 0.7
)
Conclusion
Key findings. In 2019, countries with higher shares of internet users generally show higher GDP per capita (PPP). The histogram indicates wide dispersion in connectivity across countries, while the scatterplot reveals a visible positive association. Income-group summaries align with this pattern: higher-income groups tend to have both greater internet penetration and higher average GDP per capita. These results are descriptive, not causal. Implications and next steps. The pattern is consistent with the idea that digital connectivity and economic prosperity move together.
References
World Bank. (2019). World Development Indicators (WDI) [Data set; variables NY.GDP.PCAP.PP.KD and IT.NET.USER.ZS, 2019 cross-section]. The World Bank Group. DataBank: World Development Indicators.