Part B1. Sampling Data

Import Data

If your R Markdown is NOT in the same folder as your data, please set your working directory using setwd() first. Here is an example setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1"). You will need to change the code to reflect your personal directory. Then load your data. Note that there are kilotons of carbon dioxide equivalents (kt CO2 eq) emissions with NA records and lat/long with missing locations, we will first filter them out from the data.

setwd("/Users/amroopbains/Downloads") 
GHG_emission <- read.csv("greenhousegasemissions.csv", sep = ',', header = TRUE)
filtered_GHG_emission <- subset(GHG_emission, !(Latitude == 0 & Longitude == 0) & !(is.na(CO2_eq)))

Data exploration

Before analyzing the data, we will explore it to become familiar with its structure and content, which will help you answer the questions more effectively. We will examine the column names, review the first few records, and identify the types of facilities included in the dataset. Try to run the code and examine the outputs. You can add a ? in front of a function to learn about its functionality. For example, type ?colnames in the R Console.

colnames(filtered_GHG_emission)
head(filtered_GHG_emission)
unique(filtered_GHG_emission$E_NAIC_Name)

Simple Random Sampling

First, we will randomly sample 100 records.

random_sample <- filtered_GHG_emission[sample(1:nrow(filtered_GHG_emission), size = 100, replace = FALSE), ]
head(random_sample)
##      CO2_eq
## 2191  50.01
## 1389  18.88
## 2565 739.95
## 1010  17.28
## 1739  43.15
## 2247  73.14
##                                                                                                                 E_DetailPageURL
## 2191 <a href=https://indicators-map.canada.ca/App/Detail?id=0111593&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 1389 <a href=https://indicators-map.canada.ca/App/Detail?id=0111335&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 2565 <a href=https://indicators-map.canada.ca/App/Detail?id=0110254&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 1010 <a href=https://indicators-map.canada.ca/App/Detail?id=0111978&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 1739 <a href=https://indicators-map.canada.ca/App/Detail?id=0110782&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 2247 <a href=https://indicators-map.canada.ca/App/Detail?id=0110769&GoCTemplateCulture=en-CA target=_blank>More information</a>
##                                                   E_Units Report_Year
## 2191 kilotonnes of carbon dioxide equivalents (kt CO2 eq)        2022
## 1389 kilotonnes of carbon dioxide equivalents (kt CO2 eq)        2022
## 2565 kilotonnes of carbon dioxide equivalents (kt CO2 eq)        2022
## 1010 kilotonnes of carbon dioxide equivalents (kt CO2 eq)        2022
## 1739 kilotonnes of carbon dioxide equivalents (kt CO2 eq)        2022
## 2247 kilotonnes of carbon dioxide equivalents (kt CO2 eq)        2022
##                                      CompanyName               City E_Province
## 2191 Redcliff Cypress Waste Management Authority     Cypress County    Alberta
## 1389                      Olymel S.E.C. Red Deer           Red Deer    Alberta
## 2565          Suncor Energy Products Partnership             Sarnia    Ontario
## 1010                           Linde Canada Inc.  Fort Saskatchewan    Alberta
## 1739                     Les Forges de Sorel Cie St Joseph De Sorel     Quebec
## 2247                  Meridian Technologies Inc.          Strathroy    Ontario
##                            E_NAIC_Name Latitude Longitude Symbol
## 2191      Waste Treatment and Disposal 50.09602 -110.8483      3
## 1389                  Hog slaughtering 52.30342 -113.7914      2
## 2565              Petroleum Refineries 42.93060  -82.4433      5
## 1010      Industrial Gas Manufacturing 53.72126 -113.1795      2
## 1739                           Forging 46.04500  -73.1239      2
## 2247 Non-Ferrous Die-Casting Foundries 42.99070  -81.6208      3

Stratified Sampling

Next, we will try stratified sampling using dplyr package. In this exercise, we are only interested in Oil and Gas Extraction (E_NAIC_Name == Oil and gas extraction (except oil sands)) and Fossil-Fuel Electric Power Generation (E_NAIC_Name = Fossil-Fuel Electric Power Generation) facilities. Therefore, we create a subset first.

#Create a subset consisting only available Electric and Hydrogen stations 
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'))

#Stratified Sampling
library(dplyr)
stratified_sample <- subset_facilities %>%
  group_by(E_NAIC_Name) %>%
  sample_n(size = 50, replace = FALSE) %>%
  ungroup()

(4 marks) Now, it is your turn. Please create a stratified sample consisting of Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal (E_NAIC_Name = Waste Treatment and Disposal) facilities. For each category, make the sample size equal to 60. To make the result reproducible, please add code set.seed(123) prior to your sampling.

In programming, especially in statistical simulations, setting a seed make sure that random generation produces the same sequence of random numbers every time the code is run. This is useful and important for reproducibility and consistency in research or debugging.

#TODO

library(dplyr)

filtered_GHG_emission <- GHG_emission %>% filter(!is.na(E_NAIC_Name))  

set.seed(123)

subset_facilities <- filtered_GHG_emission %>%
  filter(E_NAIC_Name %in% c('Oil and gas extraction (except oil sands)', 'Fossil-Fuel Electric Power Generation', 'Waste Treatment and Disposal'))

stratified_sample <- subset_facilities %>%
  group_by(E_NAIC_Name) %>%
  sample_n(size = min(60, n()), replace = FALSE) %>%
  ungroup()

head(stratified_sample)
## # A tibble: 6 × 11
##   CO2_eq E_DetailPageURL        E_Units Report_Year CompanyName City  E_Province
##    <dbl> <chr>                  <chr>         <int> <chr>       <chr> <chr>     
## 1    NA  <a href=https://indic… kiloto…        2022 Ontario Po… ntic… Ontario   
## 2    NA  <a href=https://indic… kiloto…        2022 NWT Power … Tuli… Northwest…
## 3   370. <a href=https://indic… kiloto…        2022 Grande Pra… Sexs… Alberta   
## 4    NA  <a href=https://indic… kiloto…        2022 Genalta GP… Peac… Alberta   
## 5  1065. <a href=https://indic… kiloto…        2022 Nova Scoti… Dart… Nova Scot…
## 6  1110. <a href=https://indic… kiloto…        2022 TransAlta … Sarn… Ontario   
## # ℹ 4 more variables: E_NAIC_Name <chr>, Latitude <dbl>, Longitude <dbl>,
## #   Symbol <int>

Compare Sampling Technique

(4 marks) What is the difference between simple random sampling and stratified sampling? In this case, would you use simple random sampling or stratified sampling?

Type your response here: In simple random sampling every unit has an equal chance of being selected, and it does not consider distinct groups. In stratified sampling the data is seperated into groups based on a feature which in our example were the different type of facilities. Samples are then taken from each of these groups. In this case I would certainly use stratified sampling since we are only interested in oil and gas extraction, fossil fuel, and waste treatment. This will ensure that each group has an an equal representation in which random sampling may over represent one specific group.

Part B2. Inferential Statistics

Two-sample difference of mean test

(2 marks) I wonder if the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Write the null hypothesis and alternate hypothesis.

Type your response here: The null hypothesis would be that there is no difference in the mean carbon dioxide equivalent emissions between Oil and Gas Extraction facilities, and Fossil Fuel Electric Power Generation facilities. The alternate hypothesis would be that there is a difference in the carbon dioxide equivalent emissions between Oil and Gas Extraction facilities, and Fossil Fuel Electric Power Generation facilities.

(4 marks) Please select an appropriate statistical test, and provide justifications by visualizing the distribution.

Type your response here: Looking at the distribution on the graphs I would use the Wilcox Rank Sum W test as we can see the data highly skewed to the right, and there is the presence of outliers as well. The fossil fuel data has a much larger range of values.

#Visualize the Distribution
#TODO (2 marks)
library (dplyr)

oil_gas <- filtered_GHG_emission %>% filter(E_NAIC_Name == 'Oil and gas extraction (except oil sands)')
fossil_fuel <- filtered_GHG_emission %>% filter(E_NAIC_Name == 'Fossil-Fuel Electric Power Generation')

hist(oil_gas$CO2_eq, main = "CO2 Emissions - Oil & Gas", col = "blue", xlab = "kt CO2 eq")

hist(fossil_fuel$CO2_eq, main = "CO2 Emissions - Fossil Fuel", col = "red", xlab = "kt CO2 eq")

(4 marks) Please write R script to conduct the statistic test and interpret the results.

#TODO (2 marks)
shapiro.test(oil_gas$CO2_eq)  
## 
##  Shapiro-Wilk normality test
## 
## data:  oil_gas$CO2_eq
## W = 0.47406, p-value < 2.2e-16
shapiro.test(fossil_fuel$CO2_eq)
## 
##  Shapiro-Wilk normality test
## 
## data:  fossil_fuel$CO2_eq
## W = 0.51818, p-value < 2.2e-16
test <- wilcox.test(oil_gas$CO2_eq, fossil_fuel$CO2_eq)
print(test)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  oil_gas$CO2_eq and fossil_fuel$CO2_eq
## W = 17652, p-value = 7.158e-15
## alternative hypothesis: true location shift is not equal to 0

Type your response here: Since the P value is much lower than 0.05 this suggests the results are statistically significant and you reject the null hypothesis that there is no difference in the carbon dioxide equivalents

Three-sample Mean Difference Test

(8 marks) Please follow the same procedure above to test whether the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same.

Type your response here: The p value obtained using the Kruskal-Wallis Test is below 0.05 so we reject the null hypothesis meaning the there is a difference in the CO2 eq emissions between stations.

#TODO 
#Visualize Distribution (1 mark)
#Statistical Test (2 marks)
subset_facilities <- filtered_GHG_emission %>%
  filter(E_NAIC_Name %in% c('Oil and gas extraction (except oil sands)','Fossil-Fuel Electric Power Generation', 'Waste Treatment and Disposal'))

boxplot(CO2_eq ~ E_NAIC_Name, data = subset_facilities,
        main = "CO2 Emissions Across Facility Types",
        ylab = "kt CO2 eq", col = c("blue", "red", "green"))

shapiro.test(subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == 'Oil and gas extraction (except oil sands)'])
## 
##  Shapiro-Wilk normality test
## 
## data:  subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == "Oil and gas extraction (except oil sands)"]
## W = 0.47406, p-value < 2.2e-16
shapiro.test(subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'])
## 
##  Shapiro-Wilk normality test
## 
## data:  subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == "Fossil-Fuel Electric Power Generation"]
## W = 0.51818, p-value < 2.2e-16
shapiro.test(subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == 'Waste Treatment and Disposal'])
## 
##  Shapiro-Wilk normality test
## 
## data:  subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == "Waste Treatment and Disposal"]
## W = 0.57345, p-value < 2.2e-16
k_test <- kruskal.test(CO2_eq ~ E_NAIC_Name, data = subset_facilities)
  print(k_test)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  CO2_eq by E_NAIC_Name
## Kruskal-Wallis chi-squared = 86.451, df = 2, p-value < 2.2e-16