Proj2_2

Author

ZIHAO YU

1.How will I tackle the problem?

Upload the dataset to GitHub and export it. After cleaning and analyzing the data, generate visualizations through code and draw conclusions.

2.What data challenges do I anticipate?

Data cleaning may be challenging. If the data is complex and requires creating appropriate charts, I would utilize an LLM to assist with this task.

source: “https://github.com/XxY-coder/data607-Proj.2Y/raw/refs/heads/main/wide_format_co2_emission_dataset.csv”

3. Data source

The dataset is from Kaggle,“https://www.kaggle.com/datasets/mabdullahsajid/tracking-global-co2-emissions-1990-2023”

I used the janitor package to clean and standardize the column names.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

data_2 <- read.csv("https://github.com/XxY-coder/data607-Proj.2Y/raw/refs/heads/main/wide_format_co2_emission_dataset.csv") |>
  clean_names()
names(data_2)

 [1] "country" "x1990"   "x1991"   "x1992"   "x1993"   "x1994"   "x1995"  
 [8] "x1996"   "x1997"   "x1998"   "x1999"   "x2000"   "x2001"   "x2002"  
[15] "x2003"   "x2004"   "x2005"   "x2006"   "x2007"   "x2008"   "x2009"  
[22] "x2010"   "x2011"   "x2012"   "x2013"   "x2014"   "x2015"   "x2018"  
[29] "x2021"

4.Raw data structure

There are 199 rows and 29 colunms.

dim(data_2)

[1] 199  29

head(data_2)

              country x1990 x1991 x1992 x1993 x1994 x1995 x1996 x1997 x1998
1         Afghanistan   0.2   0.2   0.1   0.1   0.1   0.1   0.1   0.1   0.1
2             Albania   2.3   1.2   0.7   0.7   0.6   0.7   0.6   0.5   0.6
3             Algeria     3     3     3     3     3   3.3   3.3     3   3.5
4              Angola   0.4   0.4   0.4   0.5   0.3   0.9   0.8   0.6   0.5
5 Antigua and Barbuda   4.9   4.7   4.6   4.8   4.7   4.7   4.6   4.7   4.5
6           Argentina   3.5   3.6   3.6   3.5   3.6   3.5   3.7   3.8   3.8
  x1999 x2000 x2001 x2002 x2003 x2004 x2005 x2006 x2007 x2008 x2009 x2010 x2011
1     0   0.0   0.0   0.0   0.0   0.0   0.1   0.1   0.1   0.2   0.2   0.3   0.4
2     1   1.0   1.1   1.2   1.4   1.6   1.4   1.3   1.5   1.6   1.5   1.5   1.6
3     3   2.8   2.7   2.8   2.9   2.7   3.2   3.0   3.2   3.2   3.4   3.3   3.3
4   0.7   0.7   0.7   0.8   0.6   1.1   1.2   1.2   1.4   1.4   1.4   1.4   1.4
5   4.6   4.5   4.4   4.5   4.8   4.9   4.9   5.0   5.1   5.2   5.9     6   5.8
6     4   3.8   3.8   3.5   3.8   4.1   4.1   4.4   4.7   4.8   4.4   4.3   4.6
  x2012 x2013 x2014 x2015 x2018 x2021
1   0.4   0.3   0.3   0.3   0.3  8.35
2   1.7   1.7     2   1.6   1.6  4.59
3   3.5   3.5   3.7   3.9   3.9   173
4   1.3   1.3   1.3   1.2   1.0 24.45
5   5.4   5.4   5.4   6.2   6.2  0.78
6   4.6   4.5   4.7   4.9   4.7   189

glimpse(data_2)

Rows: 199
Columns: 29
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Antigua and Ba…
$ x1990   <chr> "0.2", "2.3", "3", "0.4", "4.9", "3.5", "-", "29.1", "17.2", "…
$ x1991   <chr> "0.2", "1.2", "3", "0.4", "4.7", "3.6", "-", "29.4", "16.6", "…
$ x1992   <chr> "0.1", "0.7", "3", "0.4", "4.6", "3.6", "1.1", "25", "16.9", "…
$ x1993   <chr> "0.1", "0.7", "3", "0.5", "4.8", "3.5", "0.9", "24.3", "17.2",…
$ x1994   <chr> "0.1", "0.6", "3", "0.3", "4.7", "3.6", "0.9", "23.2", "17.1",…
$ x1995   <dbl> 0.1, 0.7, 3.3, 0.9, 4.7, 3.5, 1.1, 22.5, 17.1, 7.6, 4.3, 6.0, …
$ x1996   <chr> "0.1", "0.6", "3.3", "0.8", "4.6", "3.7", "0.8", "22.1", "18.1…
$ x1997   <chr> "0.1", "0.5", "3", "0.6", "4.7", "3.8", "1", "22", "18", "7.7"…
$ x1998   <chr> "0.1", "0.6", "3.5", "0.5", "4.5", "3.8", "1.1", "19.5", "18.7…
$ x1999   <chr> "0", "1", "3", "0.7", "4.6", "4", "1", "19.3", "17.3", "7.7", …
$ x2000   <dbl> 0.0, 1.0, 2.8, 0.7, 4.5, 3.8, 1.1, 24.9, 17.2, 7.7, 3.8, 5.6, …
$ x2001   <dbl> 0.0, 1.1, 2.7, 0.7, 4.4, 3.8, 1.2, 24.4, 16.6, 7.9, 3.6, 5.2, …
$ x2002   <dbl> 0.0, 1.2, 2.8, 0.8, 4.5, 3.5, 1.0, 24.1, 17.1, 8.1, 3.6, 5.1, …
$ x2003   <dbl> 0.0, 1.4, 2.9, 0.6, 4.8, 3.8, 1.1, 23.6, 17.1, 8.6, 3.8, 4.8, …
$ x2004   <dbl> 0.0, 1.6, 2.7, 1.1, 4.9, 4.1, 1.2, 23.1, 16.9, 8.5, 4.0, 5.3, …
$ x2005   <dbl> 0.1, 1.4, 3.2, 1.2, 4.9, 4.1, 1.4, 22.9, 17.9, 8.8, 4.2, 4.9, …
$ x2006   <dbl> 0.1, 1.3, 3.0, 1.2, 5.0, 4.4, 1.5, 22.5, 18.0, 8.7, 4.1, 4.5, …
$ x2007   <dbl> 0.1, 1.5, 3.2, 1.4, 5.1, 4.7, 1.7, 23.0, 18.1, 8.3, 4.9, 4.5, …
$ x2008   <chr> "0.2", "1.6", "3.2", "1.4", "5.2", "4.8", "1.9", "21.7", "18.3…
$ x2009   <chr> "0.2", "1.5", "3.4", "1.4", "5.9", "4.4", "1.5", "21.5", "18.3…
$ x2010   <chr> "0.3", "1.5", "3.3", "1.4", "6", "4.3", "1.4", "24.2", "16.7",…
$ x2011   <chr> "0.4", "1.6", "3.3", "1.4", "5.8", "4.6", "1.7", "23.9", "16.5…
$ x2012   <chr> "0.4", "1.7", "3.5", "1.3", "5.4", "4.6", "2", "13.2", "17.1",…
$ x2013   <chr> "0.3", "1.7", "3.5", "1.3", "5.4", "4.5", "1.9", "8.4", "16.1"…
$ x2014   <chr> "0.3", "2", "3.7", "1.3", "5.4", "4.7", "1.9", "8.4", "15.4", …
$ x2015   <dbl> 0.3, 1.6, 3.9, 1.2, 6.2, 4.9, 1.7, 9.2, 16.9, 8.0, 3.4, 7.7, 2…
$ x2018   <dbl> 0.3, 1.6, 3.9, 1.0, 6.2, 4.7, 2.0, 9.3, 16.8, 8.2, 3.5, 7.7, 2…
$ x2021   <chr> "8.35", "4.59", "173", "24.45", "0.78", "189", "6.77", "1.27",…

5.Transformation steps

Some values are decimals and need to be standardized as decimals to create a long table.
The years should delete the ‘x’ first then make it as long table.
Some values in the year column were read as

Co2_emission <-
  data_2 |>
  pivot_longer(
    cols = -country,
    names_to = "year",
    values_to = "co2_emission",
    values_transform = list(co2_emission = as.character)
  )|>
  mutate(
    year = parse_number(year),
    co2_emission = as.numeric(co2_emission)
) |>
  drop_na(co2_emission)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `co2_emission = as.numeric(co2_emission)`.
Caused by warning:
! NAs introduced by coercion

Co2_emission

# A tibble: 5,445 × 3
   country      year co2_emission
   <chr>       <dbl>        <dbl>
 1 Afghanistan  1990          0.2
 2 Afghanistan  1991          0.2
 3 Afghanistan  1992          0.1
 4 Afghanistan  1993          0.1
 5 Afghanistan  1994          0.1
 6 Afghanistan  1995          0.1
 7 Afghanistan  1996          0.1
 8 Afghanistan  1997          0.1
 9 Afghanistan  1998          0.1
10 Afghanistan  1999          0  
# ℹ 5,435 more rows

When I finish clean up the years, but co2 also read as chr, so I add a line “co2_emission = as.numeric(co2_emission)”. And also drop NA casuing by as.numeric.

dim(Co2_emission)

[1] 5445    3

There are 5445 rows and 3 columns.

6.Analysis and Conclusions

For the coe emission from 1990-2023, see it with Horizontal Bar Chart.

overall_Co2_emission <-
  Co2_emission |>
  group_by(year) |>
  summarize(total_emission = sum(co2_emission, na.rm = TRUE))

head(overall_Co2_emission)

# A tibble: 6 × 2
   year total_emission
  <dbl>          <dbl>
1  1990           805.
2  1991           821.
3  1992           964.
4  1993           976.
5  1994           971.
6  1995           969.

ggplot(
  overall_Co2_emission,
  aes(x = factor(year), y = total_emission, fill = total_emission)
) +
  geom_col() +
  geom_text(
    aes(label = round(total_emission, 1)),
    hjust = -0.1,
    size = 3
  ) +
  coord_flip() +
  labs(
    title = "CO2 Emissions from 1990 to 2023",
    x = "Year",
    y = "CO2 Emission"
  ) +
  theme_minimal() +
  scale_fill_gradient(low = "skyblue", high = "darkblue") +
  expand_limits(y = max(overall_Co2_emission$total_emission) * 1.1)

The chart shows that total CO2 emissions remained relatively stable from 1990 to 2015, with most years ranging between 800 and 1100. Total emissions in 2021 reached 13,821.3, significantly higher than other years; however, due to the absence of data for 2019 and 2020 (data is only recorded up to 2021), it is unable to accurately determine the trend of change.