Clear the console and load packages.
rm(list = ls()) #clear environment and remove all files from the workspace
gc() #clear the unused memory
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523261 28.0 1164838 62.3 660385 35.3
## Vcells 952741 7.3 8388608 64.0 1769491 13.6
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
setwd("/Users/Ryan/OneDrive/Documents/Data Analysis- Sharma")
I chose the Apple + Emissions data set from last week’s dropbox folders.
The quantitative variables I will be using from the data sets are revenue and emissions. We will have to manipulate the data a little before we are ready to run a regression. Overall, I want to see if overall emissions can be predicted by revenue.
#Load in the dataset
library(readr)
normalizing_factors <- read_csv("normalizing_factors.csv")
## Rows: 8 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): Fiscal Year, Revenue, Market Capitalization, Employees
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
greenhouse_gas_emissions <- read_csv("greenhouse_gas_emissions.csv")
## Rows: 136 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Category, Type, Scope, Description
## dbl (2): Fiscal Year, Emissions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Rename variables - R was not liking the name "Fiscal Year" possibly due to the space
names(greenhouse_gas_emissions)[names(greenhouse_gas_emissions) == "Fiscal Year"] <- "fy"
names(normalizing_factors)[names(normalizing_factors) == "Fiscal Year"] <- "fy"
#replace NAs in the emissions column with 0s - was messing with the calculations
greenhouse_gas_emissions <- replace_na(greenhouse_gas_emissions, replace = list(Emissions = 0))
#Create a new variable in the emissions data set that will be the total emissions for that given year (gross - removals)
emissions <- greenhouse_gas_emissions %>%
group_by(fy) %>%
summarize(total_emissions = sum(Emissions, na.omit = TRUE))
#merge the datasets together by fiscal year
merged_data <- merge(normalizing_factors, emissions, by = "fy", all = TRUE)
My independent variable (x) is going to be revenue, and my dependent variable (y) is going to be total greenhouse gas emissions for that fiscal year.
X: Revenue (in millions of dollars).
Y: Total Emissions (in metric tons of CO2).
\(Y_i \sim \beta_0 + X_i\beta_1 + \epsilon_i\)
#R linear model function stored in model
model <- lm(data = merged_data, total_emissions ~ Revenue)
print(model)
##
## Call:
## lm(formula = total_emissions ~ Revenue, data = merged_data)
##
## Coefficients:
## (Intercept) Revenue
## 4.298e+07 -5.938e+01
Intercept: 42,980,000. This is the amount of CO2 (in metric tons) that Apple is expected to emit given their revenue is 0 according to this linear model.
Slope: -59.38 million dollars. For each -60 million in revenue, the CO2 emissions for apple is predicted to rise by 1 metric ton. In other words, for each 60 million dollar increase in revenue, apple is predicted to drop emissions by 1 metric ton.
This is not the relationship I was expecting. I was expecting emissions to increase as revenue increased (positive relationship). However, this system is showing a negative relationship (as revenue increases, emissions decrease). It is important to remember that correlation does not mean causation. There could (and likely) is other things going on here. Usually revenue increases over time (like in our data), and so does technology. Even though the company is growing in revenue, they could be reinvesting some of that back into cleaner energy initiatives for the company. There also could be really high operating emissions, and a law of diminishing returns in terms of emissions could be in play. In order to fully understand this system we would need to perform an in depth study with lots more data.
#Slope = Cov(y,x)/Var(x)
cov_yx <- cov(merged_data$Revenue, merged_data$total_emissions)
var_x <- var(merged_data$Revenue)
B1<- cov_yx / var_x
B1
## [1] -59.37907
#Intercept = average of y - B1 X average of X
mean_y <- mean(merged_data$total_emissions)
mean_x <- mean(merged_data$Revenue)
B0 <- mean_y - B1*mean_x
B0
## [1] 42977941
As we can see, the same slope and intercept values were obtained from the manual calculations as the built in R function.
#plot the model
plot(merged_data$Revenue, merged_data$total_emissions, main = "Emissions vs Revenue", xlab = "Revenue (millions $)", ylab = "emissions (tons CO2)")
abline(model, col = 'blue')