Final Project

A) Introduction

The research question for this data analysis is How does internet usage and literacy rate affect income inequality? The dimensions of the data set 1706 x 5 with a combination of 3 data sets. The variables present in the final combined data set are country, year, literacyRate1950, giniCoefficient, and internetUsage. The year range for analysis ranges from 2000 to 2023 with a constant 1950 for literacy rate to see how the literacy rate from the 1950 has an impact on today’s income inequality. All 3 data sets come from Our World in Data. I chose this topic to see how literate people are and the use of internet has an impact on the income equality or in other words have a better opportunity to get a good income.

B) Data Analysis

The data analysis done for this project is a descriptive analysis and an EDA (exploratory data analysis) is also performed. The plot used to answer my research question is a scatter plot with a linear regression to show the trend and relationship between internet usage and giniCoefficient. The 2 functions I created were search_country() which displays the row data of a specified country and year and summarize_giniCoefficient_of_country() which gives the statistical summary (mean, median, sd, min, max) of the giniCoefficient of a specified country.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.95 loaded

literacy_rate <- read.csv("cross-country-literacy-rates.csv")
income_inequality <- read.csv("economic-inequality-gini-index.csv")
internet_usage <- read.csv("share-of-individuals-using-the-internet.csv")

colnames(literacy_rate)

## [1] "Entity"        "Code"          "Year"          "Literacy.rate"

colnames(income_inequality)

## [1] "Entity"                         "Code"                          
## [3] "Year"                           "Gini.coefficient..2021.prices."
## [5] "World.region.according.to.OWID"

colnames(internet_usage)

## [1] "Entity"                                          
## [2] "Code"                                            
## [3] "Year"                                            
## [4] "Individuals.using.the.Internet....of.population."

literacy_rate <- literacy_rate |>
  rename("country" = "Entity",
         "year" = "Year",
         "literacyRate" = "Literacy.rate")

income_inequality <- income_inequality |>
    rename("country" = "Entity",
           "year" = "Year",
           "giniCoefficient" = "Gini.coefficient..2021.prices.")

internet_usage <- internet_usage |>
    rename("country" = "Entity", 
           "year" = "Year",
           "internetUsage" = "Individuals.using.the.Internet....of.population.")

literacy_rate_1950 <- literacy_rate |>
  filter(
    year == 1950
  )

income_inequality <- income_inequality |>
  filter(
    year >= 2000, year <= 2023
  )

internet_usage <- internet_usage |>
  filter(
    year >= 2000, year <= 2023
  )

# I used inner_join and left_join referenced from dplyr.tidyverse.org as it helped me combine columns but keep intact the data that's needed from all data sets.

combined <- inner_join(income_inequality, internet_usage, by = c("country", "year"))

final_dataset <- left_join(combined, literacy_rate_1950, 
                      by = "country")

final_dataset <- final_dataset |>
  select(country, year.x, literacyRate, giniCoefficient, internetUsage)

final_dataset <- final_dataset |>
  rename("year" = "year.x",
         "literacyRate1950" = "literacyRate")

# removes rows that don't have data present for literacyRate1950 and giniCoefficient
final_dataset <- final_dataset |>
  filter(!is.na(literacyRate1950), !is.na(giniCoefficient))

head(final_dataset)

##   country year literacyRate1950 giniCoefficient internetUsage
## 1 Albania 2002             72.5       0.3173898      0.390081
## 2 Albania 2005             72.5       0.3059565      6.043890
## 3 Albania 2008             72.5       0.2998467     23.860000
## 4 Albania 2012             72.5       0.2896048     49.400000
## 5 Albania 2014             72.5       0.3459890     54.300000
## 6 Albania 2015             72.5       0.3275373     56.900000

# Function 1: get data information for a specific country based on year
search_country <- function(c, yyyy){
  
  if (!(c %in% final_dataset$country) || !is.character(c)) {
    cat("Error: Variable", c, "is not a character or not found.\n")
    return(NULL)
  }
  
  if (!(yyyy %in% final_dataset$year) || !is.numeric(yyyy)) {
    cat("Error: Variable", yyyy, "is not numeric or not found.\n")
    return(NULL)
  }
  
  info <- final_dataset |>
    filter(c == country, year == yyyy)
    
  return(info)
}

# Function 2: summarize the giniCoefficient of a specified country
summarize_giniCoefficient_of_country <- function(c){
  if (!(c %in% final_dataset$country) || !is.character(c)) {
    cat("Error: Variable", c, "is not a character or not found.\n")
    return(NULL)
  }
  
  info <- final_dataset |>
    filter(country == c) |>
    summarise(
      mean(giniCoefficient),
      median(giniCoefficient),
      sd(giniCoefficient),
      min(giniCoefficient),
      max(giniCoefficient)
    )
  
  return(info)
}

# performing EDA
dim(final_dataset)

## [1] 1252    5

str(final_dataset)

## 'data.frame':    1252 obs. of  5 variables:
##  $ country         : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ year            : int  2002 2005 2008 2012 2014 2015 2016 2017 2018 2019 ...
##  $ literacyRate1950: num  72.5 72.5 72.5 72.5 72.5 72.5 72.5 72.5 72.5 72.5 ...
##  $ giniCoefficient : num  0.317 0.306 0.3 0.29 0.346 ...
##  $ internetUsage   : num  0.39 6.04 23.86 49.4 54.3 ...

head(final_dataset)

##   country year literacyRate1950 giniCoefficient internetUsage
## 1 Albania 2002             72.5       0.3173898      0.390081
## 2 Albania 2005             72.5       0.3059565      6.043890
## 3 Albania 2008             72.5       0.2998467     23.860000
## 4 Albania 2012             72.5       0.2896048     49.400000
## 5 Albania 2014             72.5       0.3459890     54.300000
## 6 Albania 2015             72.5       0.3275373     56.900000

colSums(is.na(final_dataset))

##          country             year literacyRate1950  giniCoefficient 
##                0                0                0                0 
##    internetUsage 
##                0

# testing function 1 and 2
search_country("United States", 2001)

##         country year literacyRate1950 giniCoefficient internetUsage
## 1 United States 2001             96.5       0.4059427       49.0808

summarize_giniCoefficient_of_country("United States")

##   mean(giniCoefficient) median(giniCoefficient) sd(giniCoefficient)
## 1             0.4099162               0.4109227         0.006191934
##   min(giniCoefficient) max(giniCoefficient)
## 1            0.3969645            0.4190184

ggplot(final_dataset, aes(x = internetUsage, y = giniCoefficient)) +
  geom_point(alpha = 0.5, color = "#3B7FD9") +
  geom_smooth(method = "lm", color = "#000000", se = FALSE) + 
  labs(title = "Internet Usage vs. Income Inequality",
       x = "Internet Usage (% of Population)",
       y = "Income Inequality (Gini Coefficient)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

C) Regression Analysis

The final model used for this project is a lm() or a multiple linear regression model.

multiple_model <- lm(giniCoefficient ~ internetUsage + literacyRate1950 + year, data = final_dataset)

summary(multiple_model)

## 
## Call:
## lm(formula = giniCoefficient ~ internetUsage + literacyRate1950 + 
##     year, data = final_dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18176 -0.04297 -0.01169  0.04609  0.23784 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -5.333e+00  9.585e-01  -5.564 3.23e-08 ***
## internetUsage    -1.821e-03  1.287e-04 -14.144  < 2e-16 ***
## literacyRate1950 -9.528e-06  1.045e-04  -0.091    0.927    
## year              2.887e-03  4.770e-04   6.054 1.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0684 on 1248 degrees of freedom
## Multiple R-squared:  0.3392, Adjusted R-squared:  0.3376 
## F-statistic: 213.5 on 3 and 1248 DF,  p-value: < 2.2e-16

Interpretations:
The intercept is -5.333 when all predictors are 0. This is not pratically meaningful, but mathematically its the y-intercept. For the coefficient, while holding year constant, the internetUsage coefficient is -1.821e-3 meaning as the internet usage increases by 1%, the giniCoefficient decreases by -1.821e-3. While holding year constant, the literacyRate1950 coefficient is -9.528e-6 meaning as the literacyRate1950 increases by 1%, the giniCoefficient decreases by -9.528e-6. The coefficient of year is 2.887e-3 meaning as year increases by 1, the giniCoefficient increases by 2.887e-3.

Looking at the p-values: the internetUsage p-value is < 2e-16 which is < 0.05 therefore it is statistically significant. The literacyRate1950 p-value is 0.927 which is > 0.05 therefore it is not statistically significant. The year p-value is 1.87e-9 which is < 0.05 therefore it is statistically significant.

D) Model Assumptions and Diagnostics

library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

# Linearity Check
crPlots(multiple_model)

# Independence of Observations
plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

# Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

# Check Multicollinearity
cor(final_dataset[, c("giniCoefficient", "internetUsage",  "literacyRate1950", "year")], use = "complete.obs")

##                  giniCoefficient internetUsage literacyRate1950        year
## giniCoefficient        1.0000000    -0.5547734      -0.44925145 -0.19235055
## internetUsage         -0.5547734     1.0000000       0.66185467  0.60176911
## literacyRate1950      -0.4492514     0.6618547       1.00000000  0.03391996
## year                  -0.1923506     0.6017691       0.03391996  1.00000000

# Normality of Residuals
# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 0.06829407

Results: For component + residuals plots, the internetUsage shows a downward trend which means there is a negative relationship between internetUsage and the giniCoefficient. For literacyRate1950, there appears to be a straight line with a curve up in center therefore a non linear relationship is possible. The year plot appears to be a very shallow positive relationship between year and giniCoefficient.

For Independence, the plot shows that the values are near 0 however it appears to have a very wavy up and down trend. For Residuals vs Fitted, a flat cloud is present therefore linearity is good. In terms of homescedasticity (equal variance), the spread of residuals is constant which suggests no evidence of heteroscedasticity. For Scale-Location, the spread is also constant so no evidence of heteroscedasticity. For Q-Q Residuals, there is a slight right tail deviation, not critical though as the plot points for still follow the trend line. The Residuals vs Leverage, there is no obvious influential points therefore it is good.

For Multicollinearity, the values present can be referred to as moderate, some being negative while others don’t exist 0.70.

The RMSE is 0.07688281 meaning predictions miss by about 0.07688281 on average for the giniCoefficient.

E) Conclusion and Future Directions

Overall, we can conclude that as Internet Usage (% of population) increases, the giniCoefficient decreases therefore a negative relationship is present. We can further conclude that the better access to the internet a person is, the better their income and job is since the giniCoefficient goes down. The linearity and homoscedasicity is also good. Same goes for the residual normaility. Furthermore, since the RMSE is at a low of 0.07699281 which means it only misses by that number on average at predicting the giniCoefficient. The adjusted R-squared value is around 33.76% of variance is around the lower end therefore the model is realiable. In terms of limitations, the literacyRate50 component + residuals plot shows a possiblity for a non linear relationship as the line present is not straight but curved.

For future directions, we can also add in gender inequality and how that impacts income inequality. We can also use data on poverty and how that has an impact on the giniCoefficient. In terms of income, different jobs from different industries impact giniCoefficient would also be analysis.

F) Citations

“Income Inequality: Gini Coefficient.” Our World in Data, 2025, ourworldindata.org/grapher/economic-inequality-gini-index?overlay=download-data. Accessed 20 Dec. 2025.

“Literacy Rate.” Our World in Data, 2025, ourworldindata.org/grapher/cross-country-literacy-rates?overlay=download-data. Accessed 20 Dec. 2025.

“Mutating Joins — Mutate-Joins.” Dplyr.tidyverse.org, dplyr.tidyverse.org/reference/mutate-joins.html.

“Share of the Population Using the Internet.” Our World in Data, 2025, ourworldindata.org/grapher/share-of-individuals-using-the-internet?country=WB_SA_{WB_NA}WB_SSA~WB_EAP&overlay=download-data. Accessed 20 Dec. 2025.