rm(list=ls()); gc()
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528519 28.3    1175327 62.8         NA   669454 35.8
## Vcells 974453  7.5    8388608 64.0      16384  1851679 14.2

Part B1. Sampling Data

Import Data

If your R Markdown is NOT in the same folder as your data, please set your working directory using setwd() first. Here is an example setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1"). You will need to change the code to reflect your personal directory. Then load your data. Note that there are kilotons of carbon dioxide equivalents (kt CO2 eq) emissions with NA records and lat/long with missing locations, we will first filter them out from the data.

GHG_emission <- read.csv("GHG_Emission.csv", sep = ',', header = TRUE)
filtered_GHG_emission <- subset(GHG_emission, !(Latitude == 0 & Longitude == 0) & !(is.na(CO2_eq)))

Data exploration

Before analyzing the data, we will explore it to become familiar with its structure and content, which will help you answer the questions more effectively. We will examine the column names, review the first few records, and identify the types of facilities included in the dataset. Try to run the code and examine the outputs. You can add a ? in front of a function to learn about its functionality. For example, type ?colnames in the R Console.

colnames(filtered_GHG_emission)
head(filtered_GHG_emission)
unique(filtered_GHG_emission$E_NAIC_Name)

Simple Random Sampling

First, we will randomly sample 100 records.

random_sample <- filtered_GHG_emission[sample(1:nrow(filtered_GHG_emission), size = 100, replace = FALSE), ]

Stratified Sampling

Next, we will try stratified sampling using dplyr package. In this exercise, we are only interested in Oil and Gas Extraction (E_NAIC_Name == Oil and gas extraction (except oil sands)) and Fossil-Fuel Electric Power Generation (E_NAIC_Name = Fossil-Fuel Electric Power Generation) facilities. Therefore, we create a subset first.

#Create a subset consisting only available Electric and Hydrogen stations 
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'))

#Stratified Sampling
library(dplyr)
stratified_sample <- subset_facilities %>%
  group_by(E_NAIC_Name) %>%
  sample_n(size = 50, replace = FALSE) %>%
  ungroup()

(4 marks) Now, it is your turn. Please create a stratified sample consisting of Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal (E_NAIC_Name = Waste Treatment and Disposal) facilities. For each category, make the sample size equal to 60. To make the result reproducible, please add code set.seed(123) prior to your sampling.

In programming, especially in statistical simulations, setting a seed make sure that random generation produces the same sequence of random numbers every time the code is run. This is useful and important for reproducibility and consistency in research or debugging.

set.seed(123)
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'|E_NAIC_Name == 'Waste Treatment and Disposal'))

library(dplyr)
stratified_sample <- subset_facilities %>%
  group_by(E_NAIC_Name) %>%
  sample_n(size = 60, replace = FALSE) %>%
  ungroup()

Compare Sampling Technique

(4 marks) What is the difference between simple random sampling and stratified sampling? In this case, would you use simple random sampling or stratified sampling?

Type your response here:
Simple random sampling is that all data in the dataset has the same probability of being selected. Stratified sampling is dividing the data or population into categories or strata and select a random sample from each category or strata. Stratified sampling is more complex than simple random sampling. I would use stratified sampling. Since I am studying greenhouse gases emissions from different types of large facilities, using stratified sampling allows me to ensure all types of large facilities are represented. Also, in simple random sampling, if the sample size is too small, it may not be included. To ensure the accuracy, stratified sampling is preferred.

Part B2. Inferential Statistics

Two-sample difference of mean test

(2 marks) I wonder if the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Write the null hypothesis and alternate hypothesis.

Type your response here: Null hypothesis: the mean carbon dioxide equivalents (kt CO2 eq) emissions has no difference between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Alternate hypothesis: the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities.

(4 marks) Please select an appropriate statistical test, and provide justifications by visualizing the distribution.

Type your response here: According to the visualization, both histograms show extremely skewed distributions, which are not normal distributions. Also, Oil and Gas Extraction facilities and Fossil-Fuel Electric Power Generation facilities are two independent samples. Therefore, t-test is the most appropriate statistical test.

#Visualize the Distribution
set.seed(123)
oil_and_gas <- subset(filtered_GHG_emission, E_NAIC_Name == 'Oil and gas extraction (except oil sands)')
fossil_fuel <- subset(filtered_GHG_emission, E_NAIC_Name == 'Fossil-Fuel Electric Power Generation')

hist(oil_and_gas$CO2_eq, 
     main = "The Mean Carbon Dioxide Equivalents Emissions for Oil and Gas Extraction",
     xlab = "kt CO2 eq emissions" )

hist(fossil_fuel$CO2_eq, 
     main = "The Mean Carbon Dioxide Equivalents Emissions for Fossil-Fuel Power Generation",
     xlab = "kt CO2 eq emissions" )

(4 marks) Please write R script to conduct the statistic test and interpret the results.

t_test_result <- t.test(oil_and_gas$CO2_eq, fossil_fuel$CO2_eq, var.equal = FALSE)

print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  oil_and_gas$CO2_eq and fossil_fuel$CO2_eq
## t = -4.7939, df = 98.144, p-value = 5.822e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -755.0944 -312.9731
## sample estimates:
## mean of x mean of y 
##  48.14184 582.17556

Type your response here: According to the result, the t-value is -4.7939, degrees of freedom is 98.144, and the p-value is 5.822e-06. Since p < 0.05, the null hypothesis is rejected. ALso, “true difference in means is not equal to 0” refers that means of Oil and Gas Extraction facilities and Fossil-Fuel Electric Power Generation facilities are not 0. The mean of Oil and Gas Extraction facilities is 48.14184, while the mean of Fossil-Fuel Electric Power Generation facilities is 582.17556. Therefore, the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. The later one is significantly higher than the former one.

Three-sample Mean Difference Test

(8 marks) Please follow the same procedure above to test whether the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same.

Type your response here: Null hypothesis: the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same Alternate hypothesis: the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are not the same

#Visualize the Distribution
set.seed(123)
oil_and_gas <- subset(filtered_GHG_emission, E_NAIC_Name == 'Oil and gas extraction (except oil sands)')
fossil_fuel <- subset(filtered_GHG_emission, E_NAIC_Name == 'Fossil-Fuel Electric Power Generation')
waste_treatment <- subset(filtered_GHG_emission, E_NAIC_Name == 'Waste Treatment and Disposal')

hist(oil_and_gas$CO2_eq, 
     main = "The Mean Carbon Dioxide Equivalents Emissions for Oil and gas extraction",
     xlab = "kt CO2 eq emissions" )

hist(fossil_fuel$CO2_eq, 
     main = "The Mean Carbon Dioxide Equivalents Emissions for Fossil-Fuel Power Generation",
     xlab = "kt CO2 eq emissions" )

hist(waste_treatment$CO2_eq, 
     main = "The Mean Carbon Dioxide Equivalents Emissions for Waste Treatment and Disposal",
     xlab = "kt CO2 eq emissions" )

According to the visualization, anova test is the most appropriate statistical test. Since all histograms show skewed distributions, they are not normal distributions. Also, anova test is to assess whether there are statistically significant difference between means of two or more independent groups, which is the case.

anova_result <- aov(CO2_eq ~ E_NAIC_Name, data = filtered_GHG_emission)

summary(anova_result)
##               Df    Sum Sq Mean Sq F value Pr(>F)    
## E_NAIC_Name  153 293399893 1917646   8.304 <2e-16 ***
## Residuals   1648 380573319  230930                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the result, degrees of freedom for E_NAIC_Name is 153 and for residuals is 1648.The sum of sqaure for E_NAIC_Name is 293399893 and for residuals is 380573319. The mean sqaure for E_NAIC_Name is 1917646 and for residuals is 230930. The F value is 8.304 and the p-value is <2e-16. Since p < 0.05, the null hypothesis is rejected. There is a significant difference between all three facilities. Therefore, the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are not the same.