TidyTuesay-Emmissions.knit

# read  file from our current directory
edata <- read.csv("emissions_medium_granularity.csv")

# display csv file
head(edata)

##   year                            parent_entity        parent_type
## 1 2005 Shaanxi Coal and Chemical Industry Group State-owned Entity
## 2 2006 Shaanxi Coal and Chemical Industry Group State-owned Entity
## 3 2007 Shaanxi Coal and Chemical Industry Group State-owned Entity
## 4 2008 Shaanxi Coal and Chemical Industry Group State-owned Entity
## 5 2009 Shaanxi Coal and Chemical Industry Group State-owned Entity
## 6 2010 Shaanxi Coal and Chemical Industry Group State-owned Entity
##         commodity production_value   production_unit total_emissions_MtCO2e
## 1 Bituminous Coal           24.656 Million tonnes/yr               66.91981
## 2 Bituminous Coal           30.920 Million tonnes/yr               83.92118
## 3 Bituminous Coal           40.208 Million tonnes/yr              109.13010
## 4 Bituminous Coal           48.320 Million tonnes/yr              131.14720
## 5 Bituminous Coal           56.800 Million tonnes/yr              154.16310
## 6 Bituminous Coal           80.312 Million tonnes/yr              217.97793

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tinytex)
library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(hrbrthemes)
library(viridis)

## Loading required package: viridisLite

# print total number of columns
cat("Total Columns: ", ncol(edata))

## Total Columns:  7

# print total number of rows
cat("Total Rows:", nrow(edata))

## Total Rows: 14891

#do both at once
dim(edata)

## [1] 14891     7

names(edata)

## [1] "year"                   "parent_entity"          "parent_type"           
## [4] "commodity"              "production_value"       "production_unit"       
## [7] "total_emissions_MtCO2e"

The phrase ‘MtCO2e’ is the unit ‘Metric tons of carbon dioxide equivalent’ where the effect of a greenhouse gas has been standardized to the unit of carbon dioxide.

#Renaming emissions column into something I can remember
data <- edata %>%
  rename(total_emissions = total_emissions_MtCO2e)

str(data)

## 'data.frame':    14891 obs. of  7 variables:
##  $ year            : int  2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 ...
##  $ parent_entity   : chr  "Shaanxi Coal and Chemical Industry Group" "Shaanxi Coal and Chemical Industry Group" "Shaanxi Coal and Chemical Industry Group" "Shaanxi Coal and Chemical Industry Group" ...
##  $ parent_type     : chr  "State-owned Entity" "State-owned Entity" "State-owned Entity" "State-owned Entity" ...
##  $ commodity       : chr  "Bituminous Coal" "Bituminous Coal" "Bituminous Coal" "Bituminous Coal" ...
##  $ production_value: num  24.7 30.9 40.2 48.3 56.8 ...
##  $ production_unit : chr  "Million tonnes/yr" "Million tonnes/yr" "Million tonnes/yr" "Million tonnes/yr" ...
##  $ total_emissions : num  66.9 83.9 109.1 131.1 154.2 ...

summary(data)

##       year      parent_entity      parent_type         commodity        
##  Min.   :1854   Length:14891       Length:14891       Length:14891      
##  1st Qu.:1977   Class :character   Class :character   Class :character  
##  Median :2000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1992                                                           
##  3rd Qu.:2013                                                           
##  Max.   :2023                                                           
##  production_value    production_unit    total_emissions    
##  Min.   :4.400e-03   Length:14891       Min.   :   0.0003  
##  1st Qu.:8.162e+00   Class :character   1st Qu.:   9.0211  
##  Median :4.062e+01   Mode  :character   Median :  30.5957  
##  Mean   :3.575e+02                      Mean   :  93.2057  
##  3rd Qu.:2.337e+02                      3rd Qu.:  91.2782  
##  Max.   :2.719e+04                      Max.   :3797.9593

(What do these stats mean and imply) (Note the difference between emissions mean and 3rd quartile and max) (Note the same with producion value. How do we visualize these) (what is the unit of the production value? currency?)

library(vtable)

## Loading required package: kableExtra

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

st(data)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
year	14891	1992	29	1854	1977	2013	2023
parent_type	14891
… Investor-owned Company	7751	52%
… Nation State	1400	9%
… State-owned Entity	5740	39%
production_value	14891	358	1263	0.0044	8.2	234	27192
production_unit	14891
… Bcf/yr	3639	24%
… Million bbl/yr	3923	26%
… Million Tonnes CO2	290	2%
… Million tonnes/yr	7039	47%
total_emissions	14891	93	200	0.00032	9	91	3798

#Check for missing values
sum(is.na(data))

## [1] 0

No missing values so this is a high quality dataset

t = unique(data) #new dataset with only the unique rows
dim(t) #no. of cols and rows in the unique dataset

## [1] 14891     7

Checking for duplicates returns a dataset with the same dimensions so we have confimed that the original has no duplicates. (A dataset with columns like these will be hard-pressed to have any duplicates)

 min_data <- min(data$production_value)  
 max_data <- max(data$production_value) 
 
 print(min_data)

## [1] 0.004398386

 print(max_data)

## [1] 27192

 min_data1 <- min(data$total_emissions)  
 max_data1 <- max(data$total_emissions)
 
 print(min_data1)

## [1] 0.0003206

 print(max_data1)

## [1] 3797.959

The Categorical Data Types all have distinct entries. ‘year’ is treated as a numeric type by the program but is categorical as it is not a continuous value but rather distinct records of occurrences within the dataset

Companies <- data %>% distinct(parent_entity) %>% arrange(parent_entity)
print(Companies)

paste("There are", n_distinct(data$parent_entity), "companies in this data set")

## [1] "There are 180 companies in this data set"

(Talk briefly about the companies)

Company_Types <- data %>% distinct(parent_type) %>% arrange(parent_type)
print(Company_Types)

##              parent_type
## 1 Investor-owned Company
## 2           Nation State
## 3     State-owned Entity

(What does each entity type mean and do for this set; what does it mean for emmissions, production value, unit, commodity and area of operation)

Commodities <- data %>% distinct(commodity) %>% arrange(commodity)
print(Commodities)

##             commodity
## 1     Anthracite Coal
## 2     Bituminous Coal
## 3              Cement
## 4        Lignite Coal
## 5  Metallurgical Coal
## 6         Natural Gas
## 7           Oil & NGL
## 8 Sub-Bituminous Coal
## 9        Thermal Coal

(one line summary fo each commodity what it does and its contribution to emissions an the climate crisis)

Unit <- data %>% distinct(production_unit) %>% arrange(production_unit)
print(Unit)

##      production_unit
## 1             Bcf/yr
## 2 Million Tonnes CO2
## 3     Million bbl/yr
## 4  Million tonnes/yr

(Explain units)

Year <- data %>% distinct(year) %>% arrange(year)

(170 years on record)

#Looking for likely Nigerian Companies
NG_Companies <- data[grep("Nigeria", data$parent_entity), ]
print(NG_Companies)

#(Other Nigerian-owned companies not just companies with a subsidiary in the nation)

#Graphing
# Install and load necessary packages
library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Testing the graph first with the year I was born

filter(data, year == 1999) %>%
  ggplot(aes(production_value, total_emissions, color = commodity)) + 
  geom_point()

filter(data, year%in%c(1854, 2023)) %>%
  ggplot(aes(production_value, total_emissions, color = commodity)) + 
  geom_point() +
  facet_grid(. ~ year)

ggplot(data, aes(production_value, total_emissions)) +
  geom_point(aes(colour = commodity))

 data %>%
  ggplot(aes(commodity)) + 
  coord_flip() + 
  geom_bar(color = "darkorange")

df <- data.frame('PV' = data$production_value,
                 'TE' = data$total_emissions)

scaled_df <- scale(df)

head(scaled_df)

##              PV          TE
## [1,] -0.2634585 -0.13144665
## [2,] -0.2585004 -0.04642884
## [3,] -0.2511487  0.07963206
## [4,] -0.2447279  0.18973176
## [5,] -0.2380158  0.30482612
## [6,] -0.2194055  0.62394151

ggplot(scaled_df, aes(x=PV, y = TE)) +
  geom_bin2d(aes(fill = ..count..), binwidth = c(0.5,0.2)) +
  scale_fill_gradient(name = "Density", low = "darkseagreen1", high = "darkseagreen4") +
  labs(title = "Production Value as Correlates to Emissions", x = "Production Value of the Commodity", y = "Emission Recorded") +
  facet_wrap(~data$commodity) +
  theme_minimal()

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(data=data, aes(x=year, fill=total_emissions)) +
  geom_histogram(binwidth = 10, alpha=0.5, position="identity", color = 'darkblue')

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

ggplot(data, aes(x = total_emissions, fill = parent_type)) + 
  geom_histogram(binwidth = 70, position = "dodge")

(Do Nation States and State Entities have less because they are younger or do Investor-Owned Companies pump out that much)

data %>%
  ggplot(aes(total_emissions, ..density..)) +
  geom_histogram(binwidth = 30, color = "red") + 
  facet_grid(parent_type~.)

(Remember maximum value from total_emissions = 3797.959 MtCO2e How many values are above 3,000)

filter(data, total_emissions > 3000)

##   year parent_entity  parent_type       commodity production_value
## 1 2004  China (Coal) Nation State Bituminous Coal         1399.324
## 2 2003  China (Coal) Nation State Bituminous Coal         1209.651
##     production_unit total_emissions
## 1 Million tonnes/yr        3797.959
## 2 Million tonnes/yr        3283.160

ggplot(data) + 
  aes(x = commodity, fill = production_value, y = total_emissions) +
  coord_flip() +
  geom_col()

ggplot(data) + 
  aes(x = commodity, fill = total_emissions, y = production_value) +
  coord_flip() +
  geom_col()

data_years <- c(data$year)

data_years1 <- discretize(data_years, method='frequency', breaks=10, labels=NULL, include.lowest=TRUE, right=FALSE)

ggplot(data,aes(x=year,y=production_value)) + geom_line()

ggplot(data,aes(x=year,y=total_emissions)) + geom_line()

ggplot(data, aes(x= data_years1, y = total_emissions, color = commodity)) + 
    geom_point() + 
    coord_flip() +
    geom_jitter() +
    geom_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(x= data_years1, y = production_value, color = commodity)) + 
    geom_point() + 
    coord_flip() +
    geom_jitter() +
    geom_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(x= data_years1, y = total_emissions, color = parent_type)) + 
    geom_point() + 
    coord_flip() +
    geom_jitter() +
    geom_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(x= data_years1, y = production_value, color = parent_type)) + 
    geom_point() + 
    coord_flip() +
    geom_jitter() +
    geom_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(NG_Companies, aes(x = total_emissions, y = year, color = commodity)) + 
    geom_point() + 
    coord_flip() +
    geom_jitter() +
    geom_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data=data, aes(x=year, fill=production_value)) +
  geom_histogram(binwidth = 10, alpha=0.5, position="identity", color = 'darkgreen')

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

ggplot(data, aes (x = year, y = total_emissions)) +
  geom_line()

#Mean 

#copy, section out for year, commodity, type, entity

select(data, year, production_value, total_emissions)
#print(mean_years)

mean_years =
  data %>%
  group_by(year) %>%
  summarize(mean_pv = mean(production_value),
            mean_te = mean(total_emissions))

print(mean_years)

## # A tibble: 170 × 3
##     year mean_pv mean_te
##    <int>   <dbl>   <dbl>
##  1  1854  0.0170  0.0331
##  2  1855  0.0221  0.0430
##  3  1856  0.0272  0.0529
##  4  1857  0.0316  0.0615
##  5  1858  0.0361  0.0701
##  6  1859  0.0405  0.0787
##  7  1860  0.0449  0.0873
##  8  1861  0.0493  0.0959
##  9  1862  0.0537  0.105 
## 10  1863  0.0582  0.113 
## # ℹ 160 more rows

ggplot(mean_years, aes (x = year, y = mean_te)) +
  geom_line()

ggplot(mean_years, aes (x = year, y = mean_pv)) +
  geom_line()

#line stacked

# Create an empty plot
plot(mean_years$year, mean_years$mean_pv, type = "n", 
     xlab = "X", ylab = "Y", main = "Production Value and Emissions Over the Years")

# Plot each line one by one
lines(mean_years$year, mean_years$mean_pv, type = "l", col = "black")
lines(mean_years$year, mean_years$mean_te, type = "l", col = "red")


# Add a legend
legend("topright", legend = c("Mean Production Value", "Mean Emissions"), 
       col = c("black", "red"), lty = 1)

library(treemapify)

mean_comm =
  data %>%
  group_by(commodity) %>%
  summarize(mean_pvc = mean(production_value),
            mean_tec = mean(total_emissions))

print(mean_comm)

## # A tibble: 9 × 3
##   commodity           mean_pvc mean_tec
##   <chr>                  <dbl>    <dbl>
## 1 Anthracite Coal         19.2     56.1
## 2 Bituminous Coal         54.2    147. 
## 3 Cement                 206.     109. 
## 4 Lignite Coal            23.0     30.8
## 5 Metallurgical Coal      16.0     47.3
## 6 Natural Gas           1013.      73.8
## 7 Oil & NGL              346.     138. 
## 8 Sub-Bituminous Coal     40.7     82.2
## 9 Thermal Coal            26.1     61.8

ggplot2::ggplot(mean_comm,aes(area=mean_pvc,fill=commodity,label=commodity))+
  treemapify::geom_treemap(layout="squarified")+
  geom_treemap_text(place = "centre",size = 12)+
  labs(title="Production Value Tree Plot of Commodities")

ggplot2::ggplot(mean_comm,aes(area=mean_tec,fill=commodity,label=commodity))+
  treemapify::geom_treemap(layout="squarified")+
  geom_treemap_text(place = "centre",size = 12)+
  labs(title="Emissions per Commodity Tree Plot")

mean_type =
  data %>%
  group_by(parent_type) %>%
  summarize(mean_pvt = mean(production_value),
            mean_tet = mean(total_emissions))

print(mean_type)

## # A tibble: 3 × 3
##   parent_type            mean_pvt mean_tet
##   <chr>                     <dbl>    <dbl>
## 1 Investor-owned Company     272.     62.2
## 2 Nation State               403.    215. 
## 3 State-owned Entity         462.    105.

#Doughnut Chart

# Compute percentages
mean_type$fraction = mean_type$mean_tet / sum(mean_type$mean_tet)

# Compute the cumulative percentages (top of each rectangle)
mean_type$ymax = cumsum(mean_type$fraction)

# Compute the bottom of each rectangle
mean_type$ymin = c(0, head(mean_type$ymax, n=-1))

ggplot(mean_type, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=parent_type)) +
     geom_rect() +
     coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
     xlim(c(2, 4)) # Try to remove that to see how to make a pie chart

(Where there more nation states? how many companies for each type) (have the non-IO simply been around longer, are they even still here, is it because of commodity type?)

mean_company =
  data %>%
  group_by(parent_entity) %>%
  summarize(company_mean_pv = mean(production_value),
            company_mean_te = mean(total_emissions))

print(mean_company)

## # A tibble: 180 × 3
##    parent_entity                           company_mean_pv company_mean_te
##    <chr>                                             <dbl>           <dbl>
##  1 APA Corporation                                   207.             25.9
##  2 ARC Resources                                      95.3            10.5
##  3 Abu Dhabi National Oil Company                    729.            146. 
##  4 Adani Enterprises                                  18.3            31.8
##  5 Adaro Energy                                       27.0            55.1
##  6 Alliance Resource Partners                         18.8            49.0
##  7 Alpha Metallurgical Resources                      25.3            70.1
##  8 American Consolidated Natural Resources            24.2            65.6
##  9 Anglo American                                     20.3            49.8
## 10 Antero                                            310.             28.9
## # ℹ 170 more rows

min_row_index <- which.min(mean_company$company_mean_te)

# Print the row
print(mean_company[min_row_index, ])

## # A tibble: 1 × 3
##   parent_entity       company_mean_pv company_mean_te
##   <chr>                         <dbl>           <dbl>
## 1 Westmoreland Mining            2.37            4.62

max_row_index <- which.max(mean_company$company_mean_te)

# Print the index
#print(max_row_index)
print(mean_company[max_row_index, ])

## # A tibble: 1 × 3
##   parent_entity company_mean_pv company_mean_te
##   <chr>                   <dbl>           <dbl>
## 1 Gazprom                 9470.            740.

#Lollipop chart of emission per company

#ggplot(mean_company, aes(x=parent_entity, y=company_mean_te)) +
  #geom_segment( aes(x=parent_entity, xend=parent_entity, y=0, yend=company_mean_te), color="skyblue") +
  #geom_point( color="blue", size=4, alpha=0.6) +
  #theme_light() +
  #coord_flip() +
  #theme(
    #panel.grid.major.y = element_blank(),
    #panel.border = element_blank(),
   # axis.ticks.y = element_blank()
  #)

Only some companies produced over 200 emissions in the recording.

#Heatmap showing how things expanded post Industrial revolution
#Time series graphs
#Correlation heatmpaps of emmissions vs production value

cols <- c(
  colorRampPalette(c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421"))(20),
  colorRampPalette(c("#e29421", "#f05336", "#ce472e"))(80))



ggplot(data, aes(x = year, y = commodity, fill = total_emissions)) + 
geom_tile(colour = "white", linewidth = 0.5, 
            width = .9, height = .9) + 
theme_minimal() +
scale_fill_gradientn(colours = cols,
                    limits = c(0, 500),
                    breaks = seq(0, 500, by = 100),
                    labels = c("0k", "0.5k", "1k", "2k", "3k", "4k"),
                    na.value = rgb(246/255, 246/255, 246/255),
                    guide = guide_colourbar(ticks = TRUE, 
                                          nbin = 10,
                                          barheight = .5, 
                                          label = TRUE, 
                                          barwidth = 10,
                                          title = "Emissions in MtCo2",
                                          title.position = "top",
                                          title.hjust = 0.5)) +
scale_x_continuous(expand = c(0,0), 
                  breaks = seq(1854, 2023, by = 40),
                  limits = c(1854, 2023)) +
geom_vline(xintercept = 1997, color = "black", size = 0.5) +
  theme(legend.position.inside = c(.5, -.13),
      legend.direction = "horizontal",
      legend.text = element_text(colour = "grey20"),
      plot.margin = grid::unit(c(.5,.5,1.5,.5), "cm"),
      axis.text.y = element_text(size = 6, family = "Roboto", 
                                hjust = 1),
      axis.text.x = element_text(size = 8),
      axis.ticks.y = element_blank(),
      panel.grid = element_blank(),
      title = element_text(hjust = -.07, face = "bold", vjust = 1, 
                          family = "Roboto"),
      text = element_text(family = "Roboto")) +
ggtitle("Emissions in Unit, 1854-2023") +
annotate("text", label = "Kyoto Protocol", x = 1997, y = 2, 
         vjust = 0.6, hjust = 0.7, size = I(3), family = "Roboto") +
  xlab("") + ylab("")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 351 rows containing missing values or values outside the scale range
## (`geom_tile()`).

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

The range of colours matches the quartiles of emissions