Part B1. Sampling Data

Import Data

If your R Markdown is NOT in the same folder as your data, please set your working directory using setwd() first. Here is an example setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1"). You will need to change the code to reflect your personal directory. Then load your data. Note that there are kilotons of carbon dioxide equivalents (kt CO2 eq) emissions with NA records and lat/long with missing locations, we will first filter them out from the data.

library(readr)

## Warning: package 'readr' was built under R version 4.4.3

install.packages("readr")

## Warning: package 'readr' is in use and will not be installed

setwd("C:/Users/shunhok/Desktop")
GHG_emissions <- read_csv("GGR276P2.csv")

## Rows: 2629 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): E_DetailPageURL, E_Units, CompanyName, City, E_Province, E_NAIC_Name
## dbl (5): CO2_eq, Report_Year, Latitude, Longitude, Symbol
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

filtered_GHG_emission <- subset(GHG_emissions, !(Latitude == 0 & Longitude == 0) & !(is.na(CO2_eq)))
GHG_emissions

## # A tibble: 2,629 × 11
##    CO2_eq E_DetailPageURL       E_Units Report_Year CompanyName City  E_Province
##     <dbl> <chr>                 <chr>         <dbl> <chr>       <chr> <chr>     
##  1     NA "<a href=\"\"https:/… kiloto…        2022 Teine Ener… Oakd… Saskatche…
##  2     NA "<a href=\"\"https:/… kiloto…        2022 Torxen Ene… <NA>  Alberta   
##  3     NA "<a href=\"\"https:/… kiloto…        2022 NuVista En… <NA>  Alberta   
##  4     NA "<a href=\"\"https:/… kiloto…        2022 Société en… Dolb… Quebec    
##  5     NA "<a href=\"\"https:/… kiloto…        2022 Black Swan… Not … British C…
##  6     NA "<a href=\"\"https:/… kiloto…        2022 Produits f… Baie… Quebec    
##  7     NA "<a href=\"\"https:/… kiloto…        2022 ARC RESOUR… Cutb… Alberta   
##  8     NA "<a href=\"\"https:/… kiloto…        2022 Canlin Ene… <NA>  Alberta   
##  9     NA "<a href=\"\"https:/… kiloto…        2022 Baytex Ene… <NA>  Saskatche…
## 10     NA "<a href=\"\"https:/… kiloto…        2022 Tourmaline… Gran… Alberta   
## # ℹ 2,619 more rows
## # ℹ 4 more variables: E_NAIC_Name <chr>, Latitude <dbl>, Longitude <dbl>,
## #   Symbol <dbl>

Data exploration

Before analyzing the data, we will explore it to become familiar with its structure and content, which will help you answer the questions more effectively. We will examine the column names, review the first few records, and identify the types of facilities included in the dataset. Try to run the code and examine the outputs. You can add a ? in front of a function to learn about its functionality. For example, type ?colnames in the R Console.

colnames(filtered_GHG_emission)
head(filtered_GHG_emission)
unique(filtered_GHG_emission$E_NAIC_Name)

Simple Random Sampling

First, we will randomly sample 100 records.

random_sample <- filtered_GHG_emission[sample(1:nrow(filtered_GHG_emission), size = 100, replace = FALSE), ]

Stratified Sampling

Next, we will try stratified sampling using dplyr package. In this exercise, we are only interested in Oil and Gas Extraction (E_NAIC_Name == Oil and gas extraction (except oil sands)) and Fossil-Fuel Electric Power Generation (E_NAIC_Name = Fossil-Fuel Electric Power Generation) facilities. Therefore, we create a subset first.

#Create a subset consisting only available Electric and Hydrogen stations 
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'))

#Stratified Sampling
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

stratified_sample <- subset_facilities %>%
  group_by(E_NAIC_Name) %>%
  sample_n(size = 50, replace = FALSE) %>%
  ungroup()

(4 marks) Now, it is your turn. Please create a stratified sample consisting of Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal (E_NAIC_Name = Waste Treatment and Disposal) facilities. For each category, make the sample size equal to 60. To make the result reproducible, please add code set.seed(123) prior to your sampling.

In programming, especially in statistical simulations, setting a seed make sure that random generation produces the same sequence of random numbers every time the code is run. This is useful and important for reproducibility and consistency in research or debugging.

#TODO
set.seed(123)
#Create a subset consisting only available Electric and Hydrogen stations 
subset_facilities1 <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'| E_NAIC_Name == 'Waste Treatment and Disposa'))

#Stratified Sampling
library(dplyr)
stratified_sample_1 <- subset_facilities %>%
  group_by(E_NAIC_Name) %>%
  sample_n(size = 60, replace = FALSE) %>%
  ungroup()

Compare Sampling Technique

(4 marks) What is the difference between simple random sampling and stratified sampling? In this case, would you use simple random sampling or stratified sampling?

Random sampling is when the a set of sample from a population is selected randomly where everyone in the population have the equal chance of being selected. It is done through mechanism like computer programming to ensure that the sample is selected from the population completly randomly. Stratified sampling is when the population is divided into separate but homogenous group based on certain characteristic and sample is selected by randomly from each individual group. In the case of group size being not equal between different group, sampling may need to be done according to a ratio to avoid oversampling/undersampling. In this case, stratifed sampling should be used as the population groups is very diverse such as from different city/province and different energy generation method. So if random sampling is used underrepresented groups may occur. Where by dividing the groups based on the E_NAIC name it ensures that all groups are equally represented.

Type your response here:

Part B2. Inferential Statistics

Two-sample difference of mean test

(2 marks) I wonder if the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Write the null hypothesis and alternate hypothesis.

Type your response here:

H0: Mean Carbon Dioxide does not differ between between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities

H1: Mean Carbon Dioxide differ between between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities

(4 marks) Please select an appropriate statistical test, and provide justifications by visualizing the distribution.

To compare between two difference of means for two samples we use either t-test or w-test. Where if the data is not normal distributed we use w-test since the shapiro test for normality where if the test data shows a p-value lower than the critical value 0.05 the null hypothesis for the shapiro test for normality is rejected ie not normally distributed Since p value = 2.2e^-16. Which is much less than 0.05, the data is not normally distributed. So w-test is used

Type your response here:

#Visualize the Distribution
#TODO (2 marks)
qqnorm(stratified_sample$CO2_eq)
qqline(stratified_sample$CO2_eq)

hist(stratified_sample$CO2_eq)

shapiro.test(stratified_sample$CO2_eq)

## 
##  Shapiro-Wilk normality test
## 
## data:  stratified_sample$CO2_eq
## W = 0.57856, p-value = 1.716e-15

(4 marks) Please write R script to conduct the statistic test and interpret the results.

#TODO (2 marks)
oil_subset<-subset(stratified_sample, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'))
Fuel_subset<-subset(stratified_sample, (E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'))
result<-wilcox.test(oil_subset$CO2_eq, Fuel_subset$CO2_eq, alternative = c("two.sided",
"less", "greater"), mu = 0, paired = FALSE, exact =
NULL, correct = TRUE, conf.int = FALSE, conf.level =
0.95, tol.root = 1e-4, digits.rank = Inf,)
print(result)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  oil_subset$CO2_eq and Fuel_subset$CO2_eq
## W = 679, p-value = 8.392e-05
## alternative hypothesis: true location shift is not equal to 0

Type your response here:

The w-test shows a p-value =0.0003242 which is lower than the critical value of 0.05 hence the null hypotheis is rejected. Therefore Mean Carbon Dioxide differ between between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities

Three-sample Mean Difference Test

(8 marks) Please follow the same procedure above to test whether the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same.

H0: There is no relationship of Mean Carbon Dioxide between between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities and Waste Treatment and Disposal facilities

H1: At least one group mean of Carbon Dioxide between CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities is not equal to each other.

Type your response here:

To compare between two difference of means for 3 samples we use either One way Anova test or Kruskal-Wallis Test . Where in order for the One Way Anova test to be viable, there are 4 condition data from a unbiased randomly selected sample, dependent variable from an approximately normal distribution, groups independent from each other and groups have equal variance. For the first conditon as the sample collected through straifed sampling from a computer. The data is randomly selected. For the second condition we use the shapiro test for normality to see if data are normal distributed. For the fourth condition, we use the levene test to test for the equal variance assumption, where if the p-value is lower than 0.05 data does not have a equal variance

As the test data shows a p-value for the shapiro test for normality (p=2.2e^-16) and levene test (p=0.000426) lower than the critical value 0.05 the null hypothesis for the shapiro test for normality and levene test for equal variance is rejected ie not normally distributed. and not equal variance. So Kruskal-Wallis Test is used

The Kruskal-Wallis-test shows a p-value =1.172e^-6 which is lower than the critical value of 0.05 hence the null hypotheis is rejected. Therefore at least one group mean of Carbon Dioxide between CO2 emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities is not equal to each others.

#TODO 
library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

#Visualize Distribution (1 mark)
qqnorm(stratified_sample_1$CO2_eq)
qqline(stratified_sample_1$CO2_eq)

hist(stratified_sample_1$CO2_eq)

shapiro.test(subset_facilities1$CO2_eq)

## 
##  Shapiro-Wilk normality test
## 
## data:  subset_facilities1$CO2_eq
## W = 0.21486, p-value < 2.2e-16

levene<-leveneTest(CO2_eq~E_NAIC_Name,data=stratified_sample_1)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

print(levene)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   1  13.176 0.0004206 ***
##       118                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Statistical Test (2 marks)
kruskal<-kruskal.test(CO2_eq~E_NAIC_Name,data=stratified_sample_1)
print(kruskal)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  CO2_eq by E_NAIC_Name
## Kruskal-Wallis chi-squared = 23.622, df = 1, p-value = 1.172e-06

GGR276 Lab 2 Part 2 Sampling Data and Inferential Statistics

Shun Hok Lun Stu Number:1009913827

2025-02-28

Part B1. Sampling Data

Import Data

Data exploration

Simple Random Sampling

Stratified Sampling

Compare Sampling Technique

Part B2. Inferential Statistics

Two-sample difference of mean test

Three-sample Mean Difference Test