The research question for this data analysis is How does internet usage and literacy rate affect income inequality? The dimensions of the data set 1706 x 5 with a combination of 3 data sets. The variables present in the final combined data set are country, year, literacyRate1950, giniCoefficient, and internetUsage. The year range for analysis ranges from 2000 to 2023 with a constant 1950 for literacy rate to see how the literacy rate from the 1950 has an impact on today’s income inequality. All 3 data sets come from Our World in Data. I chose this topic to see how literate people are and the use of internet has an impact on the income equality or in other words have a better opportunity to get a good income.
The data analysis done for this project is a descriptive analysis and an EDA (exploratory data analysis) is also performed. The plot used to answer my research question is a scatter plot with a linear regression to show the trend and relationship between internet usage and giniCoefficient. The 2 functions I created were search_country() which displays the row data of a specified country and year and summarize_giniCoefficient_of_country() which gives the statistical summary (mean, median, sd, min, max) of the giniCoefficient of a specified country.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
literacy_rate <- read.csv("cross-country-literacy-rates.csv")
income_inequality <- read.csv("economic-inequality-gini-index.csv")
internet_usage <- read.csv("share-of-individuals-using-the-internet.csv")
colnames(literacy_rate)
## [1] "Entity" "Code" "Year" "Literacy.rate"
colnames(income_inequality)
## [1] "Entity" "Code"
## [3] "Year" "Gini.coefficient..2021.prices."
## [5] "World.region.according.to.OWID"
colnames(internet_usage)
## [1] "Entity"
## [2] "Code"
## [3] "Year"
## [4] "Individuals.using.the.Internet....of.population."
literacy_rate <- literacy_rate |>
rename("country" = "Entity",
"year" = "Year",
"literacyRate" = "Literacy.rate")
income_inequality <- income_inequality |>
rename("country" = "Entity",
"year" = "Year",
"giniCoefficient" = "Gini.coefficient..2021.prices.")
internet_usage <- internet_usage |>
rename("country" = "Entity",
"year" = "Year",
"internetUsage" = "Individuals.using.the.Internet....of.population.")
literacy_rate_1950 <- literacy_rate |>
filter(
year == 1950
)
income_inequality <- income_inequality |>
filter(
year >= 2000, year <= 2023
)
internet_usage <- internet_usage |>
filter(
year >= 2000, year <= 2023
)
# I used inner_join and left_join referenced from dplyr.tidyverse.org as it helped me combine columns but keep intact the data that's needed from all data sets.
combined <- inner_join(income_inequality, internet_usage, by = c("country", "year"))
final_dataset <- left_join(combined, literacy_rate_1950,
by = "country")
final_dataset <- final_dataset |>
select(country, year.x, literacyRate, giniCoefficient, internetUsage)
final_dataset <- final_dataset |>
rename("year" = "year.x",
"literacyRate1950" = "literacyRate")
# removes rows that don't have data present for literacyRate1950 and giniCoefficient
final_dataset <- final_dataset |>
filter(!is.na(literacyRate1950), !is.na(giniCoefficient))
head(final_dataset)
## country year literacyRate1950 giniCoefficient internetUsage
## 1 Albania 2002 72.5 0.3173898 0.390081
## 2 Albania 2005 72.5 0.3059565 6.043890
## 3 Albania 2008 72.5 0.2998467 23.860000
## 4 Albania 2012 72.5 0.2896048 49.400000
## 5 Albania 2014 72.5 0.3459890 54.300000
## 6 Albania 2015 72.5 0.3275373 56.900000
# Function 1: get data information for a specific country based on year
search_country <- function(c, yyyy){
if (!(c %in% final_dataset$country) || !is.character(c)) {
cat("Error: Variable", c, "is not a character or not found.\n")
return(NULL)
}
if (!(yyyy %in% final_dataset$year) || !is.numeric(yyyy)) {
cat("Error: Variable", yyyy, "is not numeric or not found.\n")
return(NULL)
}
info <- final_dataset |>
filter(c == country, year == yyyy)
return(info)
}
# Function 2: summarize the giniCoefficient of a specified country
summarize_giniCoefficient_of_country <- function(c){
if (!(c %in% final_dataset$country) || !is.character(c)) {
cat("Error: Variable", c, "is not a character or not found.\n")
return(NULL)
}
info <- final_dataset |>
filter(country == c) |>
summarise(
mean(giniCoefficient),
median(giniCoefficient),
sd(giniCoefficient),
min(giniCoefficient),
max(giniCoefficient)
)
return(info)
}
# performing EDA
dim(final_dataset)
## [1] 1252 5
str(final_dataset)
## 'data.frame': 1252 obs. of 5 variables:
## $ country : chr "Albania" "Albania" "Albania" "Albania" ...
## $ year : int 2002 2005 2008 2012 2014 2015 2016 2017 2018 2019 ...
## $ literacyRate1950: num 72.5 72.5 72.5 72.5 72.5 72.5 72.5 72.5 72.5 72.5 ...
## $ giniCoefficient : num 0.317 0.306 0.3 0.29 0.346 ...
## $ internetUsage : num 0.39 6.04 23.86 49.4 54.3 ...
head(final_dataset)
## country year literacyRate1950 giniCoefficient internetUsage
## 1 Albania 2002 72.5 0.3173898 0.390081
## 2 Albania 2005 72.5 0.3059565 6.043890
## 3 Albania 2008 72.5 0.2998467 23.860000
## 4 Albania 2012 72.5 0.2896048 49.400000
## 5 Albania 2014 72.5 0.3459890 54.300000
## 6 Albania 2015 72.5 0.3275373 56.900000
colSums(is.na(final_dataset))
## country year literacyRate1950 giniCoefficient
## 0 0 0 0
## internetUsage
## 0
# testing function 1 and 2
search_country("United States", 2001)
## country year literacyRate1950 giniCoefficient internetUsage
## 1 United States 2001 96.5 0.4059427 49.0808
summarize_giniCoefficient_of_country("United States")
## mean(giniCoefficient) median(giniCoefficient) sd(giniCoefficient)
## 1 0.4099162 0.4109227 0.006191934
## min(giniCoefficient) max(giniCoefficient)
## 1 0.3969645 0.4190184
ggplot(final_dataset, aes(x = internetUsage, y = giniCoefficient)) +
geom_point(alpha = 0.5, color = "#3B7FD9") +
geom_smooth(method = "lm", color = "#000000", se = FALSE) +
labs(title = "Internet Usage vs. Income Inequality",
x = "Internet Usage (% of Population)",
y = "Income Inequality (Gini Coefficient)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The final model used for this project is a lm() or a multiple linear regression model.
multiple_model <- lm(giniCoefficient ~ internetUsage + literacyRate1950 + year, data = final_dataset)
summary(multiple_model)
##
## Call:
## lm(formula = giniCoefficient ~ internetUsage + literacyRate1950 +
## year, data = final_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18176 -0.04297 -0.01169 0.04609 0.23784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.333e+00 9.585e-01 -5.564 3.23e-08 ***
## internetUsage -1.821e-03 1.287e-04 -14.144 < 2e-16 ***
## literacyRate1950 -9.528e-06 1.045e-04 -0.091 0.927
## year 2.887e-03 4.770e-04 6.054 1.87e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0684 on 1248 degrees of freedom
## Multiple R-squared: 0.3392, Adjusted R-squared: 0.3376
## F-statistic: 213.5 on 3 and 1248 DF, p-value: < 2.2e-16
Interpretations:
The intercept is -5.333 when all predictors are 0. This is not
pratically meaningful, but mathematically its the y-intercept. For the
coefficient, while holding year constant, the internetUsage coefficient
is -1.821e-3 meaning as the internet usage increases by 1%, the
giniCoefficient decreases by -1.821e-3. While holding year constant, the
literacyRate1950 coefficient is -9.528e-6 meaning as the
literacyRate1950 increases by 1%, the giniCoefficient decreases by
-9.528e-6. The coefficient of year is 2.887e-3 meaning as year increases
by 1, the giniCoefficient increases by 2.887e-3.
Looking at the p-values: the internetUsage p-value is < 2e-16 which is < 0.05 therefore it is statistically significant. The literacyRate1950 p-value is 0.927 which is > 0.05 therefore it is not statistically significant. The year p-value is 1.87e-9 which is < 0.05 therefore it is statistically significant.
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
# Linearity Check
crPlots(multiple_model)
# Independence of Observations
plot(resid(multiple_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
# Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
# Check Multicollinearity
cor(final_dataset[, c("giniCoefficient", "internetUsage", "literacyRate1950", "year")], use = "complete.obs")
## giniCoefficient internetUsage literacyRate1950 year
## giniCoefficient 1.0000000 -0.5547734 -0.44925145 -0.19235055
## internetUsage -0.5547734 1.0000000 0.66185467 0.60176911
## literacyRate1950 -0.4492514 0.6618547 1.00000000 0.03391996
## year -0.1923506 0.6017691 0.03391996 1.00000000
# Normality of Residuals
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 0.06829407
Results: For component + residuals plots, the internetUsage shows a downward trend which means there is a negative relationship between internetUsage and the giniCoefficient. For literacyRate1950, there appears to be a straight line with a curve up in center therefore a non linear relationship is possible. The year plot appears to be a very shallow positive relationship between year and giniCoefficient.
For Independence, the plot shows that the values are near 0 however it appears to have a very wavy up and down trend. For Residuals vs Fitted, a flat cloud is present therefore linearity is good. In terms of homescedasticity (equal variance), the spread of residuals is constant which suggests no evidence of heteroscedasticity. For Scale-Location, the spread is also constant so no evidence of heteroscedasticity. For Q-Q Residuals, there is a slight right tail deviation, not critical though as the plot points for still follow the trend line. The Residuals vs Leverage, there is no obvious influential points therefore it is good.
For Multicollinearity, the values present can be referred to as moderate, some being negative while others don’t exist 0.70.
The RMSE is 0.07688281 meaning predictions miss by about 0.07688281 on average for the giniCoefficient.
Overall, we can conclude that as Internet Usage (% of population) increases, the giniCoefficient decreases therefore a negative relationship is present. We can further conclude that the better access to the internet a person is, the better their income and job is since the giniCoefficient goes down. The linearity and homoscedasicity is also good. Same goes for the residual normaility. Furthermore, since the RMSE is at a low of 0.07699281 which means it only misses by that number on average at predicting the giniCoefficient. The adjusted R-squared value is around 33.76% of variance is around the lower end therefore the model is realiable. In terms of limitations, the literacyRate50 component + residuals plot shows a possiblity for a non linear relationship as the line present is not straight but curved.
For future directions, we can also add in gender inequality and how that impacts income inequality. We can also use data on poverty and how that has an impact on the giniCoefficient. In terms of income, different jobs from different industries impact giniCoefficient would also be analysis.
“Income Inequality: Gini Coefficient.” Our World in Data, 2025, ourworldindata.org/grapher/economic-inequality-gini-index?overlay=download-data. Accessed 20 Dec. 2025.
“Literacy Rate.” Our World in Data, 2025, ourworldindata.org/grapher/cross-country-literacy-rates?overlay=download-data. Accessed 20 Dec. 2025.
“Mutating Joins — Mutate-Joins.” Dplyr.tidyverse.org, dplyr.tidyverse.org/reference/mutate-joins.html.
“Share of the Population Using the Internet.” Our World in Data, 2025, ourworldindata.org/grapher/share-of-individuals-using-the-internet?country=WB_SAWB_NAWB_SSA~WB_EAP&overlay=download-data. Accessed 20 Dec. 2025.