Project 2 Code Base Submission Dataset 3

Author

Long Lin

Overview

For this project, I used three different datasets from the Week 5 Discussion 5A post. With these three datasets, I prepared each of them by creating a .csv file and importing the data. Then I worked on tidying the data, and performing an analysis on the dataset. I also made sure that the code within the Quarto Markdown file is reproducible in a clean environment. I used a similar process to what we did in Assignment 5A with the Airline Delays, as I feel like that is very similar assignment to this.

Dataset 3:

World GDP by Country: 1960-2022 posted by Sinem Kilicdere

source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/World%20GDP%20by%20Country%201960-2022.csv

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(gt)

world_GDP_url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/World%20GDP%20by%20Country%201960-2022.csv"

world_GDP_df <- read_csv(
  file = world_GDP_url,
  show_col_types = FALSE,
  progress = FALSE
)

head(world_GDP_df)
# A tibble: 6 × 65
  Country   `Country Code`   `1960`   `1961`   `1962`   `1963`   `1964`   `1965`
  <chr>     <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Aruba     ABW            NA       NA       NA       NA       NA       NA      
2 Africa E… AFE             2.11e10  2.16e10  2.35e10  2.80e10  2.59e10  2.95e10
3 Afghanis… AFG             5.38e 8  5.49e 8  5.47e 8  7.51e 8  8.00e 8  1.01e 9
4 Africa W… AFW             1.04e10  1.12e10  1.20e10  1.27e10  1.39e10  1.49e10
5 Angola    AGO            NA       NA       NA       NA       NA       NA      
6 Albania   ALB            NA       NA       NA       NA       NA       NA      
# ℹ 57 more variables: `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
#   `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …

I fixed the missing values in this dataset by replacing them with 0 using the following code chunk.

world_GDP_df <- world_GDP_df |>
  mutate(across("1960":"2022", ~replace_na(.x, 0)))

head(world_GDP_df)
# A tibble: 6 × 65
  Country `Country Code`  `1960`  `1961`  `1962`  `1963`  `1964`  `1965`  `1966`
  <chr>   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Aruba   ABW            0       0       0       0       0       0       0      
2 Africa… AFE            2.11e10 2.16e10 2.35e10 2.80e10 2.59e10 2.95e10 3.20e10
3 Afghan… AFG            5.38e 8 5.49e 8 5.47e 8 7.51e 8 8.00e 8 1.01e 9 1.40e 9
4 Africa… AFW            1.04e10 1.12e10 1.20e10 1.27e10 1.39e10 1.49e10 1.59e10
5 Angola  AGO            0       0       0       0       0       0       0      
6 Albania ALB            0       0       0       0       0       0       0      
# ℹ 56 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
#   `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, …

I converted the data from a wide data format to a long data format using pivot_longer.

long_world_GDP_df <- world_GDP_df |>
  pivot_longer(
    cols = c("1960":"2022"),
    names_to = "Year",
    values_to = "GDP"
  )
head(long_world_GDP_df, 70)
# A tibble: 70 × 4
   Country `Country Code` Year    GDP
   <chr>   <chr>          <chr> <dbl>
 1 Aruba   ABW            1960      0
 2 Aruba   ABW            1961      0
 3 Aruba   ABW            1962      0
 4 Aruba   ABW            1963      0
 5 Aruba   ABW            1964      0
 6 Aruba   ABW            1965      0
 7 Aruba   ABW            1966      0
 8 Aruba   ABW            1967      0
 9 Aruba   ABW            1968      0
10 Aruba   ABW            1969      0
# ℹ 60 more rows

Next I filtered the data for US only data using the following code chunk. I also converted the GDP value to billions by dividing the value by 1000000000.

US_only_GDP_df <- long_world_GDP_df |>
  filter(`Country` == 'United States')
head(US_only_GDP_df, 70)
# A tibble: 63 × 4
   Country       `Country Code` Year            GDP
   <chr>         <chr>          <chr>         <dbl>
 1 United States USA            1960   543300000000
 2 United States USA            1961   563300000000
 3 United States USA            1962   605100000000
 4 United States USA            1963   638600000000
 5 United States USA            1964   685800000000
 6 United States USA            1965   743700000000
 7 United States USA            1966   815000000000
 8 United States USA            1967   861700000000
 9 United States USA            1968   942500000000
10 United States USA            1969  1019900000000
# ℹ 53 more rows
US_only_GDP_df <- US_only_GDP_df |>
  mutate(GDP = as.numeric(GDP) / 1000000000)

US_only_GDP_df |>
  gt() |>
  cols_hide(columns = c(`Country`, `Country Code`)) |>
  tab_header(
    title = "US GDP by Year (billions)",
  )
US GDP by Year (billions)
Year GDP
1960 543.300
1961 563.300
1962 605.100
1963 638.600
1964 685.800
1965 743.700
1966 815.000
1967 861.700
1968 942.500
1969 1019.900
1970 1073.303
1971 1164.850
1972 1279.110
1973 1425.376
1974 1545.243
1975 1684.904
1976 1873.412
1977 2081.826
1978 2351.599
1979 2627.333
1980 2857.307
1981 3207.041
1982 3343.789
1983 3634.038
1984 4037.613
1985 4338.979
1986 4579.631
1987 4855.215
1988 5236.438
1989 5641.580
1990 5963.144
1991 6158.129
1992 6520.327
1993 6858.559
1994 7287.236
1995 7639.749
1996 8073.122
1997 8577.554
1998 9062.818
1999 9631.174
2000 10250.948
2001 10581.930
2002 10929.113
2003 11456.442
2004 12217.193
2005 13039.199
2006 13815.587
2007 14474.227
2008 14769.858
2009 14478.065
2010 15048.964
2011 15599.728
2012 16253.972
2013 16843.191
2014 17550.680
2015 18206.021
2016 18695.111
2017 19477.337
2018 20533.057
2019 21380.976
2020 21060.474
2021 23315.081
2022 25462.700

I created a line chart for the US GDP data using the following code chunk. This was done so that it is easier to see trends.

library(ggplot2)

ggplot(US_only_GDP_df, aes(x = as.numeric(Year), y = GDP)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "US GDP Over Time",
    subtitle = "GDP (billions)",
    x = "Year",
    y = "GDP (millions)",
    color = "Region"
  )
Ignoring unknown labels:
• colour : "Region"

The plot for the US’ GDP trends upwards between 1960 and 2022.

Next I created a data frame of China’s GDP data by using a filter. I also converted the GDP into billions for China’s values by dividing the GDP value by 1000000000.

China_only_GDP_df <- long_world_GDP_df |>
  filter(`Country` == 'China')
head(US_only_GDP_df, 70)
# A tibble: 63 × 4
   Country       `Country Code` Year    GDP
   <chr>         <chr>          <chr> <dbl>
 1 United States USA            1960   543.
 2 United States USA            1961   563.
 3 United States USA            1962   605.
 4 United States USA            1963   639.
 5 United States USA            1964   686.
 6 United States USA            1965   744.
 7 United States USA            1966   815 
 8 United States USA            1967   862.
 9 United States USA            1968   942.
10 United States USA            1969  1020.
# ℹ 53 more rows
China_only_GDP_df <- China_only_GDP_df |>
  mutate(GDP = as.numeric(GDP) / 1000000000)

China_only_GDP_df |>
  gt() |>
  cols_hide(columns = c(`Country`, `Country Code`)) |>
  tab_header(
    title = "China GDP by Year (billions)",
  )
China GDP by Year (billions)
Year GDP
1960 59.71625
1961 50.05669
1962 47.20919
1963 50.70662
1964 59.70813
1965 70.43601
1966 76.72001
1967 72.88137
1968 70.84628
1969 79.70562
1970 92.60264
1971 99.80060
1972 113.68929
1973 138.54320
1974 144.18896
1975 163.42950
1976 153.93924
1977 174.93590
1978 149.54075
1979 178.28059
1980 191.14921
1981 195.86638
1982 205.08970
1983 230.68675
1984 259.94651
1985 309.48803
1986 300.75810
1987 272.97297
1988 312.35363
1989 347.76805
1990 360.85791
1991 383.37332
1992 426.91571
1993 444.73128
1994 564.32188
1995 734.48486
1996 863.74931
1997 961.60202
1998 1029.06071
1999 1094.01048
2000 1211.33163
2001 1339.40084
2002 1470.55757
2003 1660.28061
2004 1955.34681
2005 2285.96124
2006 2752.11854
2007 3550.32757
2008 4594.33679
2009 5101.69109
2010 6087.19172
2011 7551.54532
2012 8532.18562
2013 9570.47058
2014 10475.62478
2015 11061.57320
2016 11233.31402
2017 12310.49118
2018 13894.90749
2019 14279.96849
2020 14687.74356
2021 17820.45934
2022 17963.17052

I created a line chart for China’s GDP data using the following code chunk. This was done so that it is easier to see trends.

library(ggplot2)

ggplot(China_only_GDP_df, aes(x = as.numeric(Year), y = GDP)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "China GDP Over Time (billions)",
    x = "Year",
    y = "GDP (billions)",
    color = "Region"
  )
Ignoring unknown labels:
• colour : "Region"

The plot for China’s GDP also trends upwards between 1960 and 2022.

Finally, I created a plot for both US and China’s GDP data in order to make it easier to compare both countries together.

US_only_GDP_df <- US_only_GDP_df |>
  mutate(Source = "US")

China_only_GDP_df <- China_only_GDP_df |>
  mutate(Source = "China")

combined_df <- bind_rows(US_only_GDP_df, China_only_GDP_df) 

ggplot(combined_df, aes(x = as.numeric(Year), y = GDP, color = Source, group = Source)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "China GDP Over Time (billions)",
    x = "Year",
    y = "GDP (billions)",
    color = "Region"
  )

Combining the plots for the US and China shows that the US’ GDP grows at a faster rate than China’s GDP.

Conclusion

The approach for each of the three datasets was very similar but different at the same time. Each data set had different conditional factors to account for like the scale of the values, GDP numbers were very large so I had to divide them by a billion while the values for the takeout spending were very easy to manage.