Project 1

Author

R Kwan

Introduction: The datasets that I have chosen show the gross savings rate for each country apart of the World Bank Group (189 countries) in % of GDP. Savings is defined by the amount of disposable income that is not used on final consumption. So gross savings for a country is calculated by the % of savings divided by the total country GDP. The four variables being used in my datasets are year, gross savings %, income level, and country.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
setwd("C:/Users/ronnk/OneDrive/Desktop/DATA 110")
gross_savings <- read_csv("C:/Users/ronnk/OneDrive/Desktop/DATA 110/gross_savings.csv")
New names:
• `` -> `...3`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 268 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Data Source, World Development Indicators, ...3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
income_level <- read_csv("C:/Users/ronnk/OneDrive/Desktop/DATA 110/income_level.csv")
New names:
Rows: 265 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(5): Country Code, Region, IncomeGroup, SpecialNotes, TableName lgl (1): ...6
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...6`

Use pivot longer to create years and gross savings in separate columns, had to do a google search to find out how to select a specific column that had a weird name such as the ...70 one (cite google gemini search for help)

savings <- read_csv("gross_savings.csv", skip = 4)
New names:
Rows: 266 Columns: 70
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (65): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (1): ...70
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...70`
savings_long <- savings %>%
  select(-`...70`) %>%   
  pivot_longer(
    cols = `1960`:`2024`,   
    names_to = "year",
    values_to = "gross_savings"
  ) %>%
  mutate(year = as.integer(year))  

Use the merge function to merge income_level and savings_long

merged_data <- merge(savings_long, income_level, by = "Country Code")

Use the select function to delete unwanted columns (cite google gemini search for help)

merged_data <- merged_data %>%
  select(-TableName, -...6)

Filter data by low income countries

low_income <- merged_data %>%
  filter(IncomeGroup == "Low income") %>%
  na.omit(low_income)

Linear Regression (x = year, y = gross savings)

fit <- lm(gross_savings ~ year, data = low_income)
plot(fit)

summary(fit)

Call:
lm(formula = gross_savings ~ year, data = low_income)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.413  -6.745  -2.155   5.481  56.183 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -228.38013  110.38556  -2.069   0.0398 *
year           0.12082    0.05513   2.192   0.0295 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.814 on 204 degrees of freedom
Multiple R-squared:  0.023, Adjusted R-squared:  0.01821 
F-statistic: 4.803 on 1 and 204 DF,  p-value: 0.02954

Create a new dataset with the average gross savings per year of each income level(low income, lower middle, upper middle, and high income) after the year 1980. Before 1980 there were too many NAs and left the graphs looking weird.

avg_data <- merged_data %>%
  filter(year >= 1980) %>%
  group_by(year, IncomeGroup) %>%
  summarise(avg_gross_savings = mean(gross_savings, na.rm = TRUE)) %>%
  na.omit(avg_data)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Plot on average gross savings

ggplot(avg_data, aes(x = year, y = avg_gross_savings, color = IncomeGroup)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Average Gross Savings per Year by Income Level",
    x = "Year",
    y = "Average Gross Savings (% of GDP)",
    color = "Income Level",
    caption = "Source: World Bank Data"
  ) +
  scale_color_manual(values = c(
    "Low income" = "#1b9e77",
    "Lower middle income" = "#d95f02",
    "Upper middle income" = "#7570b3",
    "High income" = "#e7298a"
  )) +
  theme_classic()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

During this project, the cleaning part was one of the hardest parts of my work. I was familiar with the pivot_longer function, but the row name I was trying to mention had a weird name “…70” and I had a hard time trying to change it until I did a google search and found out I could use a backtick. I chose this visualization because I wanted to see how significant the difference in savings rate was for countries considered low, lower middle, upper middle, and high income. I got a lot of inspiration from the Nations Charts HW where it was comparing different countries, which is why my graph looks relatively similar in format. What was surprising to me was that in modern day, the savings rate for each income group does not seem to differ more than around 5%. I wanted to make a graph about the gross savings rate for the Philippines, but for the years in the 1900s, there were just too many NAs to create a meaningful graph in my opinion.