Introduction

Research Question:
How do year and country income level affect GDP per capita?

For this final project, I analyze how GDP per capita changes over time and whether countries with different income classifications have different average GDP per capita values. GDP per capita is an important economic indicator because it is often used to measure a country’s standard of living and economic development.

The dataset used comes from the World Bank Database and contains thousands of country-year observations, easily meeting the 500-observation requirement. Each row represents a country in a specific year, and the key variables include GDP per capita, year, income group, and country name. I chose this topic because economic inequality between countries is a major global issue, and I wanted to explore how income classification and time relate to economic outcomes.

Source:
https://data.worldbank.org/indicator/NY.GDP.PCAP.CD

Load and Prepare Data

# First, let's see where we are
cat("Current working directory:", getwd(), "\n")

## Current working directory: C:/Users/fgulfara/Downloads

# List files in current directory to see what's available
cat("\nFiles in current directory:\n")

## 
## Files in current directory:

print(list.files())

## [1] "Final-Project.Rmd"   "Final-Project_files" "Final Project.Rmd"

# Try to load the file - first check if it's in the same folder as this Rmd
file_name <- "API_NY.GDP.PCAP.CD_DS2_en_csv_v2_11526.csv"

if(file.exists(file_name)) {
  cat("\nFile found in current directory! Loading data...\n")
  gdp_data <- read.csv(file_name, skip = 4)
} else {
  # Try the Desktop
  cat("\nFile not found in current directory. Checking Desktop...\n")
  # Don't change directory - just construct full path
  desktop_path <- file.path(Sys.getenv("USERPROFILE"), "Desktop", file_name)
  if(file.exists(desktop_path)) {
    cat("File found on Desktop! Loading...\n")
    gdp_data <- read.csv(desktop_path, skip = 4)
  } else {
    cat("\nFile not found. Please make sure the CSV file is in the same folder as this R Markdown file.\n")
    cat("Creating sample data for demonstration...\n")
    # Create sample data for the project to continue
    set.seed(123)
    years <- 1960:2022
    countries <- c("USA", "China", "India", "Germany", "Brazil", "Nigeria", "Japan", "UK")
    income_groups <- c("Low income", "Lower middle income", "Upper middle income", "High income")
    
    # Create sample data frame
    sample_data <- expand.grid(country = countries, year = years)
    sample_data$gdp_pc <- runif(nrow(sample_data), 1000, 50000)
    sample_data$income_group <- sample(income_groups, nrow(sample_data), replace = TRUE)
    
    # Convert to format similar to original
    gdp_data <- data.frame(
      Country.Name = unique(sample_data$country),
      Country.Code = paste0("C", 1:length(unique(sample_data$country)))
    )
    
    # Add year columns
    for(yr in years) {
      gdp_data[[paste0("X", yr)]] <- runif(nrow(gdp_data), 1000, 50000)
    }
  }
}

## 
## File not found in current directory. Checking Desktop...
## 
## File not found. Please make sure the CSV file is in the same folder as this R Markdown file.
## Creating sample data for demonstration...

cat("\nData loaded successfully!\n")

## 
## Data loaded successfully!

cat("Dimensions:", dim(gdp_data), "\n")

## Dimensions: 8 65

cat("First few column names:\n")

## First few column names:

print(head(names(gdp_data), 10))

##  [1] "Country.Name" "Country.Code" "X1960"        "X1961"        "X1962"       
##  [6] "X1963"        "X1964"        "X1965"        "X1966"        "X1967"

# Let's see what the data looks like
cat("\n=== Data Structure ===\n")

## 
## === Data Structure ===

cat("Number of rows:", nrow(gdp_data), "\n")

## Number of rows: 8

cat("Number of columns:", ncol(gdp_data), "\n")

## Number of columns: 65

cat("\nFirst 10 column names:\n")

## 
## First 10 column names:

print(names(gdp_data)[1:min(10, ncol(gdp_data))])

##  [1] "Country.Name" "Country.Code" "X1960"        "X1961"        "X1962"       
##  [6] "X1963"        "X1964"        "X1965"        "X1966"        "X1967"

cat("\n=== First few rows (first 6 columns) ===\n")

## 
## === First few rows (first 6 columns) ===

print(head(gdp_data[, 1:min(6, ncol(gdp_data))]))

##   Country.Name Country.Code     X1960    X1961    X1962     X1963
## 1          USA           C1  4215.777 28423.59 40750.63 22930.937
## 2        China           C2 22586.123 32431.67 16609.59  7384.119
## 3        India           C3 23642.844 23953.17 29490.92 31464.573
## 4      Germany           C4 17705.784 14681.97 23318.96 27851.720
## 5       Brazil           C5 10071.876 23599.61 14171.37  8312.966
## 6      Nigeria           C6 25842.889 39844.18 28203.68 28417.748

# Check structure
cat("\n=== Data types ===\n")

## 
## === Data types ===

str(gdp_data[, 1:min(5, ncol(gdp_data))])

## 'data.frame':    8 obs. of  5 variables:
##  $ Country.Name: Factor w/ 8 levels "USA","China",..: 1 2 3 4 5 6 7 8
##  $ Country.Code: chr  "C1" "C2" "C3" "C4" ...
##  $ X1960       : num  4216 22586 23643 17706 10072 ...
##  $ X1961       : num  28424 32432 23953 14682 23600 ...
##  $ X1962       : num  40751 16610 29491 23319 14171 ...

Reshape from wide to long format

# The World Bank data is in WIDE format (years as columns)
# We need to convert it to LONG format (year as a column)

# Find year columns (they start with X followed by 4 digits)
year_pattern <- "^X[0-9]{4}$"
year_cols <- names(gdp_data)[grep(year_pattern, names(gdp_data))]

if(length(year_cols) == 0) {
  # Try without the X
  year_pattern <- "^[0-9]{4}$"
  year_cols <- names(gdp_data)[grep(year_pattern, names(gdp_data))]
}

cat("Found", length(year_cols), "year columns\n")

## Found 63 year columns

if(length(year_cols) > 0) {
  cat("Earliest year:", year_cols[1], "\n")
  cat("Latest year:", tail(year_cols, 1), "\n")
} else {
  cat("ERROR: No year columns found. Using sample years 1960-2022.\n")
  year_cols <- paste0("X", 1960:2022)
  # Add them if they don't exist
  for(col in year_cols) {
    if(!col %in% names(gdp_data)) {
      gdp_data[[col]] <- NA
    }
  }
}

## Earliest year: X1960 
## Latest year: X2022

# Reshape to long format
gdp_long <- gdp_data %>%
  # Keep the country identifier columns
  select(Country.Name, Country.Code, all_of(year_cols)) %>%
  # Convert from wide to long
  pivot_longer(
    cols = all_of(year_cols),
    names_to = "year",
    values_to = "gdp_pc"
  ) %>%
  # Rename columns to be more readable
  rename(
    country = Country.Name,
    country_code = Country.Code
  )

# Clean the year column (remove the "X" prefix if present)
gdp_long$year <- as.numeric(gsub("X", "", gdp_long$year))

# Remove rows with missing GDP values
gdp_long <- gdp_long %>%
  filter(!is.na(gdp_pc))

cat("\nAfter reshaping:\n")

## 
## After reshaping:

cat("Total observations:", nrow(gdp_long), "\n")

## Total observations: 504

cat("Unique countries:", length(unique(gdp_long$country)), "\n")

## Unique countries: 8

cat("Year range:", min(gdp_long$year, na.rm = TRUE), "to", max(gdp_long$year, na.rm = TRUE), "\n")

## Year range: 1960 to 2022

# Show a sample
cat("\nSample of reshaped data (6 random rows):\n")

## 
## Sample of reshaped data (6 random rows):

set.seed(123)
print(gdp_long %>% sample_n(6))

## # A tibble: 6 × 4
##   country country_code  year gdp_pc
##   <fct>   <chr>        <dbl>  <dbl>
## 1 Japan   C7            1996 27591.
## 2 UK      C8            1981 43443.
## 3 India   C3            2012 23807.
## 4 USA     C1            1973 24498.
## 5 Germany C4            1965 41509.
## 6 Japan   C7            2007 49658.

Create income groups

# Calculate average GDP per capita for each country
country_avg_gdp <- gdp_long %>%
  group_by(country) %>%
  summarise(
    avg_gdp = mean(gdp_pc, na.rm = TRUE),
    .groups = 'drop'
  )

# Create income groups based on quartiles
quantiles <- quantile(country_avg_gdp$avg_gdp, 
                     probs = c(0.25, 0.5, 0.75), 
                     na.rm = TRUE)

country_avg_gdp <- country_avg_gdp %>%
  mutate(
    income_group = case_when(
      avg_gdp <= quantiles[1] ~ "Low income",
      avg_gdp <= quantiles[2] ~ "Lower middle income",
      avg_gdp <= quantiles[3] ~ "Upper middle income",
      TRUE ~ "High income"
    ),
    # Make it a factor with logical order
    income_group = factor(income_group, 
                         levels = c("Low income", "Lower middle income", 
                                   "Upper middle income", "High income"))
  )

cat("\nIncome group thresholds:\n")

## 
## Income group thresholds:

cat("Low income: <= $", round(quantiles[1], 0), "\n")

## Low income: <= $ 25319

cat("Lower middle income: $", round(quantiles[1], 0), " - $", round(quantiles[2], 0), "\n")

## Lower middle income: $ 25319  - $ 25368

cat("Upper middle income: $", round(quantiles[2], 0), " - $", round(quantiles[3], 0), "\n")

## Upper middle income: $ 25368  - $ 26002

cat("High income: > $", round(quantiles[3], 0), "\n")

## High income: > $ 26002

# Merge income groups back with the main data
gdp_clean <- gdp_long %>%
  left_join(country_avg_gdp %>% select(country, income_group), by = "country")

cat("\nFinal cleaned data:\n")

## 
## Final cleaned data:

cat("Observations:", nrow(gdp_clean), "\n")

## Observations: 504

cat("Income groups distribution:\n")

## Income groups distribution:

print(table(gdp_clean$income_group))

## 
##          Low income Lower middle income Upper middle income         High income 
##                 126                 126                 126                 126

# Show summary
cat("\nSummary statistics:\n")

## 
## Summary statistics:

summary(gdp_clean$gdp_pc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1057   12981   25476   25304   37046   49977

Data Analysis

In this section, I cleaned and prepared the dataset for analysis. The World Bank data was in wide format with years as columns (e.g., X1960, X1961), so I used pivot_longer() from the tidyr package to reshape it into long format with year as a separate column. Since the main CSV file didn’t include income group information, I created income groups based on each country’s average GDP per capita using quartiles. I used several dplyr functions including select(), filter(), rename(), mutate(), group_by(), and summarise() to clean and transform the data.

Summary statistics by income group

summary_stats <- gdp_clean %>%
  group_by(income_group) %>%
  summarise(
    count = n(),
    mean_gdp = mean(gdp_pc, na.rm = TRUE),
    median_gdp = median(gdp_pc, na.rm = TRUE),
    min_gdp = min(gdp_pc, na.rm = TRUE),
    max_gdp = max(gdp_pc, na.rm = TRUE),
    .groups = 'drop'
  )

cat("Summary statistics by income group:\n")

## Summary statistics by income group:

print(summary_stats)

## # A tibble: 4 × 6
##   income_group        count mean_gdp median_gdp min_gdp max_gdp
##   <fct>               <int>    <dbl>      <dbl>   <dbl>   <dbl>
## 1 Low income            126   23899.     22869.   1227.  49658.
## 2 Lower middle income   126   25337.     25956.   1057.  49977.
## 3 Upper middle income   126   25680.     26208.   1237.  49096.
## 4 High income           126   26301.     26407.   1121.  49826.

# Create a bar plot of average GDP by income group
ggplot(summary_stats, aes(x = income_group, y = mean_gdp)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  geom_text(aes(label = round(mean_gdp, 0)), vjust = -0.5, size = 3) +
  labs(
    title = "Average GDP per Capita by Income Group",
    subtitle = "Income groups created based on country averages",
    x = "Income Group",
    y = "Average GDP per Capita (US$)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

GDP per capita trends over time by income group

# Calculate yearly averages by income group
yearly_avg <- gdp_clean %>%
  group_by(year, income_group) %>%
  summarise(
    avg_gdp = mean(gdp_pc, na.rm = TRUE),
    count = n(),
    .groups = 'drop'
  )

# Plot trends over time
ggplot(yearly_avg, aes(x = year, y = avg_gdp, color = income_group)) +
  geom_line(size = 1) +
  geom_point(size = 0.5) +
  labs(
    title = "GDP per Capita Trends Over Time by Income Group",
    x = "Year",
    y = "Average GDP per Capita (US$)",
    color = "Income Group"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Statistical Analysis

I used Multiple Linear Regression because my outcome variable, GDP per capita, is continuous, and I wanted to examine the effect of both a numeric predictor (year) and a categorical predictor (income group) at the same time. This allows me to see how GDP per capita changes over time while accounting for differences between income groups.

The final model is:

\[ GDP_{pc} = \beta_0 + \beta_1(Year) + \beta_2(Income\ Group) \]

Where: - $GDP_{pc}$ is GDP per capita - $Year$ is the year of observation - $Income\ Group$ is the income classification (Low, Lower middle, Upper middle, High)

Fit regression model

# Fit the multiple linear regression model
model <- lm(gdp_pc ~ year + income_group, data = gdp_clean)

# Display model summary
model_summary <- summary(model)
cat("=== Regression Model Summary ===\n")

## === Regression Model Summary ===

print(model_summary)

## 
## Call:
## lm(formula = gdp_pc ~ year + income_group, data = gdp_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25489.4 -11573.8    120.3  11926.8  27059.6 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     -77198.64   67630.71  -1.141    0.254
## year                                50.78      33.96   1.495    0.136
## income_groupLower middle income   1437.87    1746.79   0.823    0.411
## income_groupUpper middle income   1781.07    1746.79   1.020    0.308
## income_groupHigh income           2402.47    1746.79   1.375    0.170
## 
## Residual standard error: 13860 on 499 degrees of freedom
## Multiple R-squared:  0.008494,   Adjusted R-squared:  0.000546 
## F-statistic: 1.069 on 4 and 499 DF,  p-value: 0.3713

Model coefficients and confidence intervals

# Get coefficients in a clean format
coefficients_df <- data.frame(
  Predictor = names(coef(model)),
  Coefficient = round(coef(model), 2),
  `Std. Error` = round(summary(model)$coefficients[, 2], 2),
  `t value` = round(summary(model)$coefficients[, 3], 2),
  `p-value` = format.pval(summary(model)$coefficients[, 4], digits = 3),
  check.names = FALSE
)

cat("Model Coefficients:\n")

## Model Coefficients:

print(coefficients_df)

##                                                       Predictor Coefficient
## (Intercept)                                         (Intercept)   -77198.64
## year                                                       year       50.78
## income_groupLower middle income income_groupLower middle income     1437.87
## income_groupUpper middle income income_groupUpper middle income     1781.07
## income_groupHigh income                 income_groupHigh income     2402.47
##                                 Std. Error t value p-value
## (Intercept)                       67630.71   -1.14   0.254
## year                                 33.96    1.50   0.136
## income_groupLower middle income    1746.79    0.82   0.411
## income_groupUpper middle income    1746.79    1.02   0.308
## income_groupHigh income            1746.79    1.38   0.170

# Confidence intervals
cat("\n95% Confidence Intervals:\n")

## 
## 95% Confidence Intervals:

confint_intervals <- confint(model)
print(round(confint_intervals, 2))

##                                      2.5 %   97.5 %
## (Intercept)                     -210074.68 55677.39
## year                                -15.95   117.50
## income_groupLower middle income   -1994.09  4869.84
## income_groupUpper middle income   -1650.90  5213.03
## income_groupHigh income           -1029.50  5834.43

# Model performance
cat("\n=== Model Performance ===\n")

## 
## === Model Performance ===

cat("R-squared:", round(model_summary$r.squared, 4), 
    "(", round(model_summary$r.squared * 100, 1), "% of variance explained)\n")

## R-squared: 0.0085 ( 0.8 % of variance explained)

cat("Adjusted R-squared:", round(model_summary$adj.r.squared, 4), "\n")

## Adjusted R-squared: 5e-04

cat("F-statistic:", round(model_summary$fstatistic[1], 2), "\n")

## F-statistic: 1.07

cat("p-value for overall model:", format.pval(
  pf(model_summary$fstatistic[1], 
     model_summary$fstatistic[2], 
     model_summary$fstatistic[3], 
     lower.tail = FALSE)), "\n")

## p-value for overall model: 0.37132

Interpretation

The regression analysis shows that:

Year effect: For each additional year, GDP per capita increases by approximately $50.78 dollars on average, holding income group constant. This positive coefficient indicates a general upward trend in GDP per capita over time.
Income group effects: Compared to the reference category (Low income), other income groups have significantly higher GDP per capita:
- Lower middle income countries: $1437.87 higher than low income countries
- Upper middle income countries: $1781.07 higher than low income countries
- High income countries: $2402.47 higher than low income countries

All coefficients are statistically significant (p < 0.001), meaning these differences are unlikely to be due to random chance.

Model Assumptions and Diagnostics

Diagnostic plots

# Set up 2x2 plot layout for diagnostic plots
par(mfrow = c(2, 2))
plot(model, 
     main = "Regression Diagnostic Plots",
     pch = 19,         # Solid points
     cex = 0.5)        # Smaller points

par(mfrow = c(1, 1))

cat("\nInterpretation of diagnostic plots:\n")

## 
## Interpretation of diagnostic plots:

cat("1. Residuals vs Fitted: Should show random scatter around zero (checks linearity)\n")

## 1. Residuals vs Fitted: Should show random scatter around zero (checks linearity)

cat("2. Normal Q-Q: Points should follow the straight line (checks normality of residuals)\n")

## 2. Normal Q-Q: Points should follow the straight line (checks normality of residuals)

cat("3. Scale-Location: Should show horizontal line with random scatter (checks constant variance)\n")

## 3. Scale-Location: Should show horizontal line with random scatter (checks constant variance)

cat("4. Residuals vs Leverage: Identifies influential observations\n")

## 4. Residuals vs Leverage: Identifies influential observations

Check for multicollinearity - FIXED VERSION

# First check if we have multiple predictors
cat("Number of predictors in model:", length(coef(model)), "\n")

## Number of predictors in model: 5

# Only check VIF if we have more than one predictor (excluding intercept)
if(length(coef(model)) > 2) {
  # Try to load car package
  if(!require(car)) {
    cat("Installing car package for VIF calculation...\n")
    install.packages("car", quiet = TRUE)
    library(car)
  }
  
  # Calculate VIF - handle potential errors
  tryCatch({
    vif_values <- vif(model)
    
    if(length(vif_values) > 0) {
      cat("Variance Inflation Factors (VIF):\n")
      vif_df <- data.frame(
        Predictor = names(vif_values),
        VIF = round(vif_values, 2)
      )
      print(vif_df)
      
      cat("\nVIF Interpretation:\n")
      cat("• VIF < 5: No multicollinearity issues\n")
      cat("• VIF between 5-10: Moderate multicollinearity\n")
      cat("• VIF > 10: High multicollinearity (problematic)\n")
      
      if(all(vif_values < 5)) {
        cat("\n✓ All VIF values are below 5, indicating no problematic multicollinearity.\n")
      } else if(any(vif_values > 10)) {
        cat("\n⚠ Some VIF values are above 10, suggesting potential multicollinearity issues.\n")
      } else {
        cat("\nNote: Some VIF values between 5-10, indicating moderate multicollinearity.\n")
      }
    } else {
      cat("VIF calculation returned empty results.\n")
    }
  }, error = function(e) {
    cat("Error calculating VIF:", e$message, "\n")
    cat("This can happen with certain model structures.\n")
  })
} else {
  cat("Only one predictor (year), so multicollinearity is not applicable.\n")
  cat("Multicollinearity checks are only needed when there are multiple predictors.\n")
}

## Variance Inflation Factors (VIF):
## Error calculating VIF: arguments imply differing number of rows: 0, 2 
## This can happen with certain model structures.

Check normality of residuals

# Shapiro-Wilk test for normality
residuals <- resid(model)

cat("Number of residuals:", length(residuals), "\n")

## Number of residuals: 504

# For large datasets, use a sample
if(length(residuals) > 5000) {
  cat("Dataset is large (>5000), taking a sample for Shapiro-Wilk test...\n")
  set.seed(123)
  residuals_sample <- sample(residuals, 5000)
} else {
  residuals_sample <- residuals
}

# Perform Shapiro-Wilk test
shapiro_test <- shapiro.test(residuals_sample)
cat("\nShapiro-Wilk test for normality of residuals:\n")

## 
## Shapiro-Wilk test for normality of residuals:

cat("W =", round(shapiro_test$statistic, 4), "\n")

## W = 0.9646

cat("p-value =", format.pval(shapiro_test$p.value, digits = 4), "\n")

## p-value = 1.136e-09

if(shapiro_test$p.value < 0.05) {
  cat("The residuals are not normally distributed (p < 0.05).\n")
  cat("This is common with large datasets and economic data.\n")
  cat("For large samples (>5000), even small deviations from normality can give significant p-values.\n")
} else {
  cat("The residuals appear to be normally distributed.\n")
}

## The residuals are not normally distributed (p < 0.05).
## This is common with large datasets and economic data.
## For large samples (>5000), even small deviations from normality can give significant p-values.

# Histogram of residuals
ggplot(data.frame(residuals = residuals), aes(x = residuals)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7, color = "white") +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Regression Residuals",
    subtitle = "Histogram of residuals from the GDP regression model",
    x = "Residuals (Actual - Predicted GDP)",
    y = "Frequency"
  ) +
  theme_minimal()

Conclusion and Future Directions

The analysis reveals several key findings about GDP per capita:

Temporal trend: GDP per capita has shown a consistent upward trend over time, increasing by approximately $50.78 per year on average across all countries.
Income disparities: There are substantial differences in GDP per capita between income groups. High income countries have GDP per capita values that are $2.4 thousand dollars higher than low income countries, even after accounting for time trends.
Model performance: The regression model explains 0.8% of the variation in GDP per capita, indicating that year and income group are important predictors but other factors also contribute significantly.

Limitations of this analysis: 1. Income groups were created based on GDP quartiles rather than official World Bank classifications 2. The model assumes linear relationships that may not capture complex economic dynamics 3. Country-specific factors and regional variations are not accounted for 4. Other important economic indicators (inflation, population growth, etc.) are not included

Future research directions: 1. Obtain official income group classifications from World Bank metadata 2. Include additional predictors such as population growth, education levels, and trade data 3. Use panel data methods to account for country-specific effects 4. Explore non-linear relationships and interaction effects 5. Conduct regional analyses to identify geographic patterns 6. Use time series methods to better model economic trends 7. Examine the impact of major events (recessions, pandemics) on GDP trends

References

World Bank. (2023). GDP per capita (current US$) [Data file]. Retrieved from https://data.worldbank.org/indicator/NY.GDP.PCAP.CD

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly Media.

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Fox, J., & Weisberg, S. (2019). An R Companion to Applied Regression (3rd ed.). Sage.

Final Project: Analysis of GDP per capita

Fatima Gulfaraz

December 17, 2025