rm(list=ls()); gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528519 28.3 1175327 62.8 NA 669454 35.8
## Vcells 974453 7.5 8388608 64.0 16384 1851679 14.2
If your R Markdown is NOT in the same folder as your
data, please set your working directory using setwd()
first. Here is an example
setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1").
You will need to change the code to reflect your personal directory.
Then load your data. Note that there are kilotons of carbon dioxide
equivalents (kt CO2 eq) emissions with NA records and lat/long with
missing locations, we will first filter them out from the data.
GHG_emission <- read.csv("GHG_Emission.csv", sep = ',', header = TRUE)
filtered_GHG_emission <- subset(GHG_emission, !(Latitude == 0 & Longitude == 0) & !(is.na(CO2_eq)))
Before analyzing the data, we will explore it to become familiar with its structure and content, which will help you answer the questions more effectively. We will examine the column names, review the first few records, and identify the types of facilities included in the dataset. Try to run the code and examine the outputs. You can add a ? in front of a function to learn about its functionality. For example, type ?colnames in the R Console.
colnames(filtered_GHG_emission)
head(filtered_GHG_emission)
unique(filtered_GHG_emission$E_NAIC_Name)
First, we will randomly sample 100 records.
random_sample <- filtered_GHG_emission[sample(1:nrow(filtered_GHG_emission), size = 100, replace = FALSE), ]
Next, we will try stratified sampling using dplyr package. In this exercise, we are only interested in Oil and Gas Extraction (E_NAIC_Name == Oil and gas extraction (except oil sands)) and Fossil-Fuel Electric Power Generation (E_NAIC_Name = Fossil-Fuel Electric Power Generation) facilities. Therefore, we create a subset first.
#Create a subset consisting only available Electric and Hydrogen stations
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'))
#Stratified Sampling
library(dplyr)
stratified_sample <- subset_facilities %>%
group_by(E_NAIC_Name) %>%
sample_n(size = 50, replace = FALSE) %>%
ungroup()
(4 marks) Now, it is your turn. Please create a stratified sample
consisting of Oil and Gas Extraction, Fossil-Fuel Electric Power
Generation, and Waste Treatment and Disposal (E_NAIC_Name = Waste
Treatment and Disposal) facilities. For each category, make the sample
size equal to 60. To make the result reproducible, please add code
set.seed(123) prior to your sampling.
In programming, especially in statistical simulations, setting a seed make sure that random generation produces the same sequence of random numbers every time the code is run. This is useful and important for reproducibility and consistency in research or debugging.
set.seed(123)
subset_facilities <- subset(filtered_GHG_emission, (E_NAIC_Name == 'Oil and gas extraction (except oil sands)'| E_NAIC_Name == 'Fossil-Fuel Electric Power Generation'|E_NAIC_Name == 'Waste Treatment and Disposal'))
library(dplyr)
stratified_sample <- subset_facilities %>%
group_by(E_NAIC_Name) %>%
sample_n(size = 60, replace = FALSE) %>%
ungroup()
(4 marks) What is the difference between simple random sampling and stratified sampling? In this case, would you use simple random sampling or stratified sampling?
Type your response here:
Simple random sampling is that all data in the dataset has the same
probability of being selected. Stratified sampling is dividing the data
or population into categories or strata and select a random sample from
each category or strata. Stratified sampling is more complex than simple
random sampling. I would use stratified sampling. Since I am studying
greenhouse gases emissions from different types of large facilities,
using stratified sampling allows me to ensure all types of large
facilities are represented. Also, in simple random sampling, if the
sample size is too small, it may not be included. To ensure the
accuracy, stratified sampling is preferred.
(2 marks) I wonder if the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Write the null hypothesis and alternate hypothesis.
Type your response here: Null hypothesis: the mean carbon dioxide equivalents (kt CO2 eq) emissions has no difference between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. Alternate hypothesis: the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities.
(4 marks) Please select an appropriate statistical test, and provide justifications by visualizing the distribution.
Type your response here: According to the visualization, both histograms show extremely skewed distributions, which are not normal distributions. Also, Oil and Gas Extraction facilities and Fossil-Fuel Electric Power Generation facilities are two independent samples. Therefore, t-test is the most appropriate statistical test.
#Visualize the Distribution
set.seed(123)
oil_and_gas <- subset(filtered_GHG_emission, E_NAIC_Name == 'Oil and gas extraction (except oil sands)')
fossil_fuel <- subset(filtered_GHG_emission, E_NAIC_Name == 'Fossil-Fuel Electric Power Generation')
hist(oil_and_gas$CO2_eq,
main = "The Mean Carbon Dioxide Equivalents Emissions for Oil and Gas Extraction",
xlab = "kt CO2 eq emissions" )
hist(fossil_fuel$CO2_eq,
main = "The Mean Carbon Dioxide Equivalents Emissions for Fossil-Fuel Power Generation",
xlab = "kt CO2 eq emissions" )
(4 marks) Please write R script to conduct the statistic test and interpret the results.
t_test_result <- t.test(oil_and_gas$CO2_eq, fossil_fuel$CO2_eq, var.equal = FALSE)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: oil_and_gas$CO2_eq and fossil_fuel$CO2_eq
## t = -4.7939, df = 98.144, p-value = 5.822e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -755.0944 -312.9731
## sample estimates:
## mean of x mean of y
## 48.14184 582.17556
Type your response here: According to the result, the t-value is -4.7939, degrees of freedom is 98.144, and the p-value is 5.822e-06. Since p < 0.05, the null hypothesis is rejected. ALso, “true difference in means is not equal to 0” refers that means of Oil and Gas Extraction facilities and Fossil-Fuel Electric Power Generation facilities are not 0. The mean of Oil and Gas Extraction facilities is 48.14184, while the mean of Fossil-Fuel Electric Power Generation facilities is 582.17556. Therefore, the mean carbon dioxide equivalents (kt CO2 eq) emissions differs between Oil and Gas Extraction facilities with Fossil-Fuel Electric Power Generation facilities. The later one is significantly higher than the former one.
(8 marks) Please follow the same procedure above to test whether the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same.
Type your response here: Null hypothesis: the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are the same Alternate hypothesis: the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are not the same
#Visualize the Distribution
set.seed(123)
oil_and_gas <- subset(filtered_GHG_emission, E_NAIC_Name == 'Oil and gas extraction (except oil sands)')
fossil_fuel <- subset(filtered_GHG_emission, E_NAIC_Name == 'Fossil-Fuel Electric Power Generation')
waste_treatment <- subset(filtered_GHG_emission, E_NAIC_Name == 'Waste Treatment and Disposal')
hist(oil_and_gas$CO2_eq,
main = "The Mean Carbon Dioxide Equivalents Emissions for Oil and gas extraction",
xlab = "kt CO2 eq emissions" )
hist(fossil_fuel$CO2_eq,
main = "The Mean Carbon Dioxide Equivalents Emissions for Fossil-Fuel Power Generation",
xlab = "kt CO2 eq emissions" )
hist(waste_treatment$CO2_eq,
main = "The Mean Carbon Dioxide Equivalents Emissions for Waste Treatment and Disposal",
xlab = "kt CO2 eq emissions" )
According to the visualization, anova test is the most appropriate statistical test. Since all histograms show skewed distributions, they are not normal distributions. Also, anova test is to assess whether there are statistically significant difference between means of two or more independent groups, which is the case.
anova_result <- aov(CO2_eq ~ E_NAIC_Name, data = filtered_GHG_emission)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## E_NAIC_Name 153 293399893 1917646 8.304 <2e-16 ***
## Residuals 1648 380573319 230930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the result, degrees of freedom for E_NAIC_Name is 153 and for residuals is 1648.The sum of sqaure for E_NAIC_Name is 293399893 and for residuals is 380573319. The mean sqaure for E_NAIC_Name is 1917646 and for residuals is 230930. The F value is 8.304 and the p-value is <2e-16. Since p < 0.05, the null hypothesis is rejected. There is a significant difference between all three facilities. Therefore, the means of CO2 eq emissions for Oil and Gas Extraction, Fossil-Fuel Electric Power Generation, and Waste Treatment and Disposal facilities are not the same.