Project 2 Code Base Submission Dataset 3

Author

Long Lin

Overview

For this project, I used three different datasets from the Week 5 Discussion 5A post. With these three datasets, I prepared each of them by creating a .csv file and importing the data. Then I worked on tidying the data, and performing an analysis on the dataset. I also made sure that the code within the Quarto Markdown file is reproducible in a clean environment. I used a similar process to what we did in Assignment 5A with the Airline Delays, as I feel like that is very similar assignment to this.

Dataset 3:

World GDP by Country: 1960-2022 posted by Sinem Kilicdere

source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/World%20GDP%20by%20Country%201960-2022.csv

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(dplyr)
library(gt)

world_GDP_url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Project%202/World%20GDP%20by%20Country%201960-2022.csv"

world_GDP_df <- read_csv(
  file = world_GDP_url,
  show_col_types = FALSE,
  progress = FALSE
)

head(world_GDP_df)

# A tibble: 6 × 65
  Country   `Country Code`   `1960`   `1961`   `1962`   `1963`   `1964`   `1965`
  <chr>     <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Aruba     ABW            NA       NA       NA       NA       NA       NA      
2 Africa E… AFE             2.11e10  2.16e10  2.35e10  2.80e10  2.59e10  2.95e10
3 Afghanis… AFG             5.38e 8  5.49e 8  5.47e 8  7.51e 8  8.00e 8  1.01e 9
4 Africa W… AFW             1.04e10  1.12e10  1.20e10  1.27e10  1.39e10  1.49e10
5 Angola    AGO            NA       NA       NA       NA       NA       NA      
6 Albania   ALB            NA       NA       NA       NA       NA       NA      
# ℹ 57 more variables: `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
#   `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …

I fixed the missing values in this dataset by replacing them with 0 using the following code chunk.

world_GDP_df <- world_GDP_df |>
  mutate(across("1960":"2022", ~replace_na(.x, 0)))

head(world_GDP_df)

# A tibble: 6 × 65
  Country `Country Code`  `1960`  `1961`  `1962`  `1963`  `1964`  `1965`  `1966`
  <chr>   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Aruba   ABW            0       0       0       0       0       0       0      
2 Africa… AFE            2.11e10 2.16e10 2.35e10 2.80e10 2.59e10 2.95e10 3.20e10
3 Afghan… AFG            5.38e 8 5.49e 8 5.47e 8 7.51e 8 8.00e 8 1.01e 9 1.40e 9
4 Africa… AFW            1.04e10 1.12e10 1.20e10 1.27e10 1.39e10 1.49e10 1.59e10
5 Angola  AGO            0       0       0       0       0       0       0      
6 Albania ALB            0       0       0       0       0       0       0      
# ℹ 56 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
#   `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, …

I converted the data from a wide data format to a long data format using pivot_longer.

long_world_GDP_df <- world_GDP_df |>
  pivot_longer(
    cols = c("1960":"2022"),
    names_to = "Year",
    values_to = "GDP"
  )
head(long_world_GDP_df, 70)

# A tibble: 70 × 4
   Country `Country Code` Year    GDP
   <chr>   <chr>          <chr> <dbl>
 1 Aruba   ABW            1960      0
 2 Aruba   ABW            1961      0
 3 Aruba   ABW            1962      0
 4 Aruba   ABW            1963      0
 5 Aruba   ABW            1964      0
 6 Aruba   ABW            1965      0
 7 Aruba   ABW            1966      0
 8 Aruba   ABW            1967      0
 9 Aruba   ABW            1968      0
10 Aruba   ABW            1969      0
# ℹ 60 more rows

Next I filtered the data for US only data using the following code chunk. I also converted the GDP value to billions by dividing the value by 1000000000.

US_only_GDP_df <- long_world_GDP_df |>
  filter(`Country` == 'United States')
head(US_only_GDP_df, 70)

# A tibble: 63 × 4
   Country       `Country Code` Year            GDP
   <chr>         <chr>          <chr>         <dbl>
 1 United States USA            1960   543300000000
 2 United States USA            1961   563300000000
 3 United States USA            1962   605100000000
 4 United States USA            1963   638600000000
 5 United States USA            1964   685800000000
 6 United States USA            1965   743700000000
 7 United States USA            1966   815000000000
 8 United States USA            1967   861700000000
 9 United States USA            1968   942500000000
10 United States USA            1969  1019900000000
# ℹ 53 more rows

US_only_GDP_df <- US_only_GDP_df |>
  mutate(GDP = as.numeric(GDP) / 1000000000)

US_only_GDP_df |>
  gt() |>
  cols_hide(columns = c(`Country`, `Country Code`)) |>
  tab_header(
    title = "US GDP by Year (billions)",
  )

Year	GDP
US GDP by Year (billions)
1960	543.300
1961	563.300
1962	605.100
1963	638.600
1964	685.800
1965	743.700
1966	815.000
1967	861.700
1968	942.500
1969	1019.900
1970	1073.303
1971	1164.850
1972	1279.110
1973	1425.376
1974	1545.243
1975	1684.904
1976	1873.412
1977	2081.826
1978	2351.599
1979	2627.333
1980	2857.307
1981	3207.041
1982	3343.789
1983	3634.038
1984	4037.613
1985	4338.979
1986	4579.631
1987	4855.215
1988	5236.438
1989	5641.580
1990	5963.144
1991	6158.129
1992	6520.327
1993	6858.559
1994	7287.236
1995	7639.749
1996	8073.122
1997	8577.554
1998	9062.818
1999	9631.174
2000	10250.948
2001	10581.930
2002	10929.113
2003	11456.442
2004	12217.193
2005	13039.199
2006	13815.587
2007	14474.227
2008	14769.858
2009	14478.065
2010	15048.964
2011	15599.728
2012	16253.972
2013	16843.191
2014	17550.680
2015	18206.021
2016	18695.111
2017	19477.337
2018	20533.057
2019	21380.976
2020	21060.474
2021	23315.081
2022	25462.700

I created a line chart for the US GDP data using the following code chunk. This was done so that it is easier to see trends.

library(ggplot2)

ggplot(US_only_GDP_df, aes(x = as.numeric(Year), y = GDP)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "US GDP Over Time",
    subtitle = "GDP (billions)",
    x = "Year",
    y = "GDP (millions)",
    color = "Region"
  )

Ignoring unknown labels:
• colour : "Region"

The plot for the US’ GDP trends upwards between 1960 and 2022.

Next I created a data frame of China’s GDP data by using a filter. I also converted the GDP into billions for China’s values by dividing the GDP value by 1000000000.

China_only_GDP_df <- long_world_GDP_df |>
  filter(`Country` == 'China')
head(US_only_GDP_df, 70)

# A tibble: 63 × 4
   Country       `Country Code` Year    GDP
   <chr>         <chr>          <chr> <dbl>
 1 United States USA            1960   543.
 2 United States USA            1961   563.
 3 United States USA            1962   605.
 4 United States USA            1963   639.
 5 United States USA            1964   686.
 6 United States USA            1965   744.
 7 United States USA            1966   815 
 8 United States USA            1967   862.
 9 United States USA            1968   942.
10 United States USA            1969  1020.
# ℹ 53 more rows

China_only_GDP_df <- China_only_GDP_df |>
  mutate(GDP = as.numeric(GDP) / 1000000000)

China_only_GDP_df |>
  gt() |>
  cols_hide(columns = c(`Country`, `Country Code`)) |>
  tab_header(
    title = "China GDP by Year (billions)",
  )

Year	GDP
China GDP by Year (billions)
1960	59.71625
1961	50.05669
1962	47.20919
1963	50.70662
1964	59.70813
1965	70.43601
1966	76.72001
1967	72.88137
1968	70.84628
1969	79.70562
1970	92.60264
1971	99.80060
1972	113.68929
1973	138.54320
1974	144.18896
1975	163.42950
1976	153.93924
1977	174.93590
1978	149.54075
1979	178.28059
1980	191.14921
1981	195.86638
1982	205.08970
1983	230.68675
1984	259.94651
1985	309.48803
1986	300.75810
1987	272.97297
1988	312.35363
1989	347.76805
1990	360.85791
1991	383.37332
1992	426.91571
1993	444.73128
1994	564.32188
1995	734.48486
1996	863.74931
1997	961.60202
1998	1029.06071
1999	1094.01048
2000	1211.33163
2001	1339.40084
2002	1470.55757
2003	1660.28061
2004	1955.34681
2005	2285.96124
2006	2752.11854
2007	3550.32757
2008	4594.33679
2009	5101.69109
2010	6087.19172
2011	7551.54532
2012	8532.18562
2013	9570.47058
2014	10475.62478
2015	11061.57320
2016	11233.31402
2017	12310.49118
2018	13894.90749
2019	14279.96849
2020	14687.74356
2021	17820.45934
2022	17963.17052

I created a line chart for China’s GDP data using the following code chunk. This was done so that it is easier to see trends.

library(ggplot2)

ggplot(China_only_GDP_df, aes(x = as.numeric(Year), y = GDP)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "China GDP Over Time (billions)",
    x = "Year",
    y = "GDP (billions)",
    color = "Region"
  )

Ignoring unknown labels:
• colour : "Region"

The plot for China’s GDP also trends upwards between 1960 and 2022.

Finally, I created a plot for both US and China’s GDP data in order to make it easier to compare both countries together.

US_only_GDP_df <- US_only_GDP_df |>
  mutate(Source = "US")

China_only_GDP_df <- China_only_GDP_df |>
  mutate(Source = "China")

combined_df <- bind_rows(US_only_GDP_df, China_only_GDP_df) 

ggplot(combined_df, aes(x = as.numeric(Year), y = GDP, color = Source, group = Source)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "China GDP Over Time (billions)",
    x = "Year",
    y = "GDP (billions)",
    color = "Region"
  )

Combining the plots for the US and China shows that the US’ GDP grows at a faster rate than China’s GDP.

Conclusion

The approach for each of the three datasets was very similar but different at the same time. Each data set had different conditional factors to account for like the scale of the values, GDP numbers were very large so I had to divide them by a billion while the values for the takeout spending were very easy to manage.