If your R Markdown is NOT in the same folder as your
data, please set your working directory using setwd()
first. Here is an example
setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1")
.
You will need to change the code to reflect your personal directory.
Then load your data. Note that there are kilotons of carbon dioxide
equivalents (kt CO2 eq) emissions with NA records and lat/long with
missing locations, we will first filter them out from the data.
setwd("/Users/amroopbains/Downloads")
GHG_emission <- read.csv("greenhousegasemissions.csv", sep = ',', header = TRUE)
filtered_GHG_emission <- subset(GHG_emission, !(Latitude == 0 & Longitude == 0) & !(is.na(CO2_eq)))
Before analyzing the data, we will explore it to become familiar with its structure and content, which will help you answer the questions more effectively. We will examine the column names, review the first few records, and identify the types of facilities included in the dataset. Try to run the code and examine the outputs. You can add a ? in front of a function to learn about its functionality. For example, type ?colnames in the R Console.
colnames(filtered_GHG_emission)
head(filtered_GHG_emission)
unique(filtered_GHG_emission$E_NAIC_Name)
First, we will randomly sample 100 records.
random_sample <- filtered_GHG_emission[sample(1:nrow(filtered_GHG_emission), size = 100, replace = FALSE), ]
head(random_sample)
## CO2_eq
## 2191 50.01
## 1389 18.88
## 2565 739.95
## 1010 17.28
## 1739 43.15
## 2247 73.14
## E_DetailPageURL
## 2191 <a href=https://indicators-map.canada.ca/App/Detail?id=0111593&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 1389 <a href=https://indicators-map.canada.ca/App/Detail?id=0111335&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 2565 <a href=https://indicators-map.canada.ca/App/Detail?id=0110254&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 1010 <a href=https://indicators-map.canada.ca/App/Detail?id=0111978&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 1739 <a href=https://indicators-map.canada.ca/App/Detail?id=0110782&GoCTemplateCulture=en-CA target=_blank>More information</a>
## 2247 <a href=https://indicators-map.canada.ca/App/Detail?id=0110769&GoCTemplateCulture=en-CA target=_blank>More information</a>
## E_Units Report_Year
## 2191 kilotonnes of carbon dioxide equivalents (kt CO2 eq) 2022
## 1389 kilotonnes of carbon dioxide equivalents (kt CO2 eq) 2022
## 2565 kilotonnes of carbon dioxide equivalents (kt CO2 eq) 2022
## 1010 kilotonnes of carbon dioxide equivalents (kt CO2 eq) 2022
## 1739 kilotonnes of carbon dioxide equivalents (kt CO2 eq) 2022
## 2247 kilotonnes of carbon dioxide equivalents (kt CO2 eq) 2022
## CompanyName City E_Province
## 2191 Redcliff Cypress Waste Management Authority Cypress County Alberta
## 1389 Olymel S.E.C. Red Deer Red Deer Alberta
## 2565 Suncor Energy Products Partnership Sarnia Ontario
## 1010 Linde Canada Inc. Fort Saskatchewan Alberta
## 1739 Les Forges de Sorel Cie St Joseph De Sorel Quebec
## 2247 Meridian Technologies Inc. Strathroy Ontario
## E_NAIC_Name Latitude Longitude Symbol
## 2191 Waste Treatment and Disposal 50.09602 -110.8483 3
## 1389 Hog slaughtering 52.30342 -113.7914 2
## 2565 Petroleum Refineries 42.93060 -82.4433 5
## 1010 Industrial Gas Manufacturing 53.72126 -113.1795 2
## 1739 Forging 46.04500 -73.1239 2
## 2247 Non-Ferrous Die-Casting Foundries 42.99070 -81.6208 3
Next, we will try stratified sampling using dplyr package. In this exercise, we are only interested in Oil and Gas Extraction (E_NAIC_Name == Oil and gas extraction (except oil sands)) and Fossil-Fuel Electric Power Generation (E_NAIC_Name = Fossil-Fuel Electric Power Generation) facilities. Therefore, we create a subset first.
#Create a subset consisting only available Electric and Hydrogen stations
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'))
#Stratified Sampling
library(dplyr)
stratified_sample <- subset_facilities %>%
group_by(E_NAIC_Name) %>%
sample_n(size = 50, replace = FALSE) %>%
ungroup()
(4 marks) Now, it is your turn. Please create a stratified sample
consisting of Oil and Gas Extraction, Fossil-Fuel Electric Power
Generation, and Waste Treatment and Disposal (E_NAIC_Name = Waste
Treatment and Disposal) facilities. For each category, make the sample
size equal to 60. To make the result reproducible, please add code
set.seed(123)
prior to your sampling.
In programming, especially in statistical simulations, setting a seed make sure that random generation produces the same sequence of random numbers every time the code is run. This is useful and important for reproducibility and consistency in research or debugging.
#TODO
library(dplyr)
filtered_GHG_emission <- GHG_emission %>% filter(!is.na(E_NAIC_Name))
set.seed(123)
subset_facilities <- filtered_GHG_emission %>%
filter(E_NAIC_Name %in% c('Oil and gas extraction (except oil sands)', 'Fossil-Fuel Electric Power Generation', 'Waste Treatment and Disposal'))
stratified_sample <- subset_facilities %>%
group_by(E_NAIC_Name) %>%
sample_n(size = min(60, n()), replace = FALSE) %>%
ungroup()
head(stratified_sample)
## # A tibble: 6 × 11
## CO2_eq E_DetailPageURL E_Units Report_Year CompanyName City E_Province
## <dbl> <chr> <chr> <int> <chr> <chr> <chr>
## 1 NA <a href=https://indic… kiloto… 2022 Ontario Po… ntic… Ontario
## 2 NA <a href=https://indic… kiloto… 2022 NWT Power … Tuli… Northwest…
## 3 370. <a href=https://indic… kiloto… 2022 Grande Pra… Sexs… Alberta
## 4 NA <a href=https://indic… kiloto… 2022 Genalta GP… Peac… Alberta
## 5 1065. <a href=https://indic… kiloto… 2022 Nova Scoti… Dart… Nova Scot…
## 6 1110. <a href=https://indic… kiloto… 2022 TransAlta … Sarn… Ontario
## # ℹ 4 more variables: E_NAIC_Name <chr>, Latitude <dbl>, Longitude <dbl>,
## # Symbol <int>
(4 marks) What is the difference between simple random sampling and stratified sampling? In this case, would you use simple random sampling or stratified sampling?
Type your response here: In simple random sampling every unit has an equal chance of being selected, and it does not consider distinct groups. In stratified sampling the data is seperated into groups based on a feature which in our example were the different type of facilities. Samples are then taken from each of these groups. In this case I would certainly use stratified sampling since we are only interested in oil and gas extraction, fossil fuel, and waste treatment. This will ensure that each group has an an equal representation in which random sampling may over represent one specific group.
(2 marks) I wonder if the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Write the null hypothesis and alternate hypothesis.
Type your response here: The null hypothesis would be that there is no difference in the mean carbon dioxide equivalent emissions between Oil and Gas Extraction facilities, and Fossil Fuel Electric Power Generation facilities. The alternate hypothesis would be that there is a difference in the carbon dioxide equivalent emissions between Oil and Gas Extraction facilities, and Fossil Fuel Electric Power Generation facilities.
(4 marks) Please select an appropriate statistical test, and provide justifications by visualizing the distribution.
Type your response here: Looking at the distribution on the graphs I would use the Wilcox Rank Sum W test as we can see the data highly skewed to the right, and there is the presence of outliers as well. The fossil fuel data has a much larger range of values.
#Visualize the Distribution
#TODO (2 marks)
library (dplyr)
oil_gas <- filtered_GHG_emission %>% filter(E_NAIC_Name == 'Oil and gas extraction (except oil sands)')
fossil_fuel <- filtered_GHG_emission %>% filter(E_NAIC_Name == 'Fossil-Fuel Electric Power Generation')
hist(oil_gas$CO2_eq, main = "CO2 Emissions - Oil & Gas", col = "blue", xlab = "kt CO2 eq")
hist(fossil_fuel$CO2_eq, main = "CO2 Emissions - Fossil Fuel", col = "red", xlab = "kt CO2 eq")
(4 marks) Please write R script to conduct the statistic test and interpret the results.
#TODO (2 marks)
shapiro.test(oil_gas$CO2_eq)
##
## Shapiro-Wilk normality test
##
## data: oil_gas$CO2_eq
## W = 0.47406, p-value < 2.2e-16
shapiro.test(fossil_fuel$CO2_eq)
##
## Shapiro-Wilk normality test
##
## data: fossil_fuel$CO2_eq
## W = 0.51818, p-value < 2.2e-16
test <- wilcox.test(oil_gas$CO2_eq, fossil_fuel$CO2_eq)
print(test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: oil_gas$CO2_eq and fossil_fuel$CO2_eq
## W = 17652, p-value = 7.158e-15
## alternative hypothesis: true location shift is not equal to 0
Type your response here: Since the P value is much lower than 0.05 this suggests the results are statistically significant and you reject the null hypothesis that there is no difference in the carbon dioxide equivalents
(8 marks) Please follow the same procedure above to test whether the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same.
Type your response here: The p value obtained using the Kruskal-Wallis Test is below 0.05 so we reject the null hypothesis meaning the there is a difference in the CO2 eq emissions between stations.
#TODO
#Visualize Distribution (1 mark)
#Statistical Test (2 marks)
subset_facilities <- filtered_GHG_emission %>%
filter(E_NAIC_Name %in% c('Oil and gas extraction (except oil sands)','Fossil-Fuel Electric Power Generation', 'Waste Treatment and Disposal'))
boxplot(CO2_eq ~ E_NAIC_Name, data = subset_facilities,
main = "CO2 Emissions Across Facility Types",
ylab = "kt CO2 eq", col = c("blue", "red", "green"))
shapiro.test(subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == 'Oil and gas extraction (except oil sands)'])
##
## Shapiro-Wilk normality test
##
## data: subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == "Oil and gas extraction (except oil sands)"]
## W = 0.47406, p-value < 2.2e-16
shapiro.test(subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'])
##
## Shapiro-Wilk normality test
##
## data: subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == "Fossil-Fuel Electric Power Generation"]
## W = 0.51818, p-value < 2.2e-16
shapiro.test(subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == 'Waste Treatment and Disposal'])
##
## Shapiro-Wilk normality test
##
## data: subset_facilities$CO2_eq[subset_facilities$E_NAIC_Name == "Waste Treatment and Disposal"]
## W = 0.57345, p-value < 2.2e-16
k_test <- kruskal.test(CO2_eq ~ E_NAIC_Name, data = subset_facilities)
print(k_test)
##
## Kruskal-Wallis rank sum test
##
## data: CO2_eq by E_NAIC_Name
## Kruskal-Wallis chi-squared = 86.451, df = 2, p-value < 2.2e-16