Introduction

I decided to use the temp_carbon dataset from the dslabs package to create a scatter plot. The temp_carbon dataset has recorded observations of the global, land, and ocean annual mean temperatures form 1880 to 2018. It also contains data on carbon emissions in millions of metric tons from 1751 to 2014. When the data is plotted on a scatter plot, it is revealed that temperature variables display a linear correlation with the carbon emission variable. The dataset has missing values for the temperature variables from 1751 to 1880. I will be using linear regression to make two predictions on the scatter plot. The first prediction will be the ocean temperatures for carbon emission levels from the years 1751 to 1880. The second prediction will be the ocean temperatures for carbon emission levels from 10,000MMT to 15,000MMT.

Libraries

Upload required libraries:

library(dslabs)
## Warning: package 'dslabs' was built under R version 4.0.4
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'stringr' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(broom)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.0.4

Dataset

The dataset consists of 5 numerical variables and 268 observations.

#Display a preview of six observations in the dataset

head(temp_carbon)
##   year temp_anomaly land_anomaly ocean_anomaly carbon_emissions
## 1 1880        -0.11        -0.48         -0.01              236
## 2 1881        -0.08        -0.40          0.01              243
## 3 1882        -0.10        -0.48          0.00              256
## 4 1883        -0.18        -0.66         -0.04              272
## 5 1884        -0.26        -0.69         -0.14              275
## 6 1885        -0.25        -0.56         -0.17              277
#Display the structure of the dataset 

str(temp_carbon)
## 'data.frame':    268 obs. of  5 variables:
##  $ year            : num  1880 1881 1882 1883 1884 ...
##  $ temp_anomaly    : num  -0.11 -0.08 -0.1 -0.18 -0.26 -0.25 -0.24 -0.28 -0.13 -0.09 ...
##  $ land_anomaly    : num  -0.48 -0.4 -0.48 -0.66 -0.69 -0.56 -0.51 -0.47 -0.41 -0.31 ...
##  $ ocean_anomaly   : num  -0.01 0.01 0 -0.04 -0.14 -0.17 -0.17 -0.23 -0.05 -0.02 ...
##  $ carbon_emissions: num  236 243 256 272 275 277 281 295 327 327 ...

Variable Descriptions

Variable descriptions were copied from the temp_carbon dataset details section.

year: The year an observation was recorded. The years range from 1751 to 2018 in common era (CE).

temp_anomaly: Global annual mean temperature anomaly in degrees Celsius relative to the 20th century mean temperature. Temperatures were recorded from 1880 to 2018 and range from -0.45°C to 0.98°C.

land_anomaly: Annual mean temperature anomaly on land in degrees Celsius relative to the 20th century mean temperature. Temperatures were recorded from 1880 to 2018 and range from -0.69°C to 1.50°C.

ocean_anomaly: Annual mean temperature anomaly over the ocean in degrees Celsius relative to the 20th century mean temperature. Temperatures were recorded from 1880 to 2018 and range from -0.46°C to 0.79°C.

carbon_emissions: Annual carbon emissions in millions of metric tons of carbon. Emissions were recorded from 1751 to 2014 and range from 3MMT to 277MMT.

Linear Regression Predictions

Correlation:

The correlation between carbon emissions and ocean temperature is 0.86. This signifies that the two variables have a strong positive correlation.

cor(temp_carbon$carbon_emissions, temp_carbon$ocean_anomaly, use = "complete.obs")
## [1] 0.8648896

Visualization of positive linear correlation:

temp_carbon %>%
ggplot(aes(x = carbon_emissions, y = ocean_anomaly )) +
  #Create Scatter Plot
  geom_point()+
  #Create linear regression line
  geom_smooth(method = "lm", se = FALSE)+
  #label X-axis
  xlab("Carbon Emissions (MMT)")+
  #label Y-axis
  ylab("Ocean Temperature (°C)")+
  #Create title
  ggtitle("Affects of Carbon Emissions on Ocean Annual Mean Temperature ")+
   #Center title
  theme(plot.title = element_text(hjust = 0.5))+
  #Display the graph using theme_minimal
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 133 rows containing non-finite values (stat_smooth).
## Warning: Removed 133 rows containing missing values (geom_point).

Linear Regression Model:

Since the data displays a linear correlation, a linear regression model will be helpful to create predictions.

#Create a dataset with observations to remove from the model dataset to improve model metrics

temp_carbon5 <- temp_carbon %>%
  filter(year %in% c(1974,1976, 1975,1903,1971, 1904, 1908, 1909, 1910, 1911, 1940, 1941, 1942,1943,1944,1945))
  

#Conduct anti join to remove the observations from the temp_carbon dataset that match the temp_carbon5 dataset

temp_carbon3 <- temp_carbon %>%
anti_join(temp_carbon5) 
## Joining, by = c("year", "temp_anomaly", "land_anomaly", "ocean_anomaly", "carbon_emissions")
#Create a linear regression model using ocean_anomaly as the response variable and carbon emission as the explanatory variable to predict ocean_anomaly temperatures from carbon emission levels

mdl<-lm(ocean_anomaly ~ carbon_emissions, data = temp_carbon3)

#Plot linear model metrics

autoplot(mdl, 1:3, nrow=3, ncol=1)
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help

Variance:

The r.squared variable was extracted from the glance function to observe the variance of the model. Eighty-eight percent of the variation in the ocean_anomly variable can be explained by variation in carbon emissions using the model.

#Variance of model
mdl%>%
glance()%>%
pull(r.squared)
## [1] 0.8774718

Tibbles used to conduct predictions:

Three tibbles were created. Two of the tibbles will be used to create a set values, which will in turn be used to create predictions. The third tibble will be used to plot the linear regression line on the graph.

#Create a dataframe with NA values for ocean_anomaly to isolate carbon emission levels without temperature readings.  
t1<- temp_carbon %>%
  #filter for observations with NA values for ocean_anomaly
  filter(is.na(ocean_anomaly)) %>%
  #select the carbon_emissions column
  select(carbon_emissions) %>%
  #isolate unique values
  unique()

#Create a tibble to conduct the first prediction on a the subset carbon_emissions values of the t1 dataframe  

explanatory_data1 <-tibble(
  carbon_emissions = t1$carbon_emissions
)

#Create a tibble to conduct the second prediction on a sequence values for carbon emissions ranging from 10,000 to 15,000  by an interval of 2,500 

explanatory_data2 <-tibble(
  carbon_emissions = seq(10000,15000, 2500)
)

#Create a tibble to create linear regression line in plot by subsetting the carbon emission data from the temp_carbon dataset into a separate tibble

explanatory_data3 <-tibble(
  carbon_emissions = temp_carbon$carbon_emissions)

Execute predictions:

The mutate and predict functions are used to generate a column of ocean_anomaly values in the tibbles generated in the previous step. Carbon emission values in the tibbles are being used as an input in the linear regression model to generate the ocean_anomaly values.

#Conduct prediction of ocean_anomaly temperatures using the values in explanatory_data1 with linear regression model

prediction1 <- explanatory_data1  %>%
  mutate(ocean_anomaly = predict(mdl,explanatory_data1))
  
#Conduct prediction of ocean_anomaly temperatures using the values in explanatory_data2 with the linear regression model

prediction2 <- explanatory_data2 %>%
  mutate(ocean_anomaly = predict(mdl,explanatory_data2))


#Conduct prediction of ocean_anomaly temperatures using the values in explanatory_data3 with the linear regression model to plot on the graph

prediction3 <- explanatory_data3 %>%
  mutate(ocean_anomaly = predict(mdl,explanatory_data3))

Graphs

I created a scatter plot using the carbon_emission variable as the explanatory variable and the ocean_anonmly variable as the response variable. To depict global temperature, the temp_anomly variable was used to color the points on a continuous gradient using the inferno palette in viridis. The constructed prediction models were used to plot a linear regression line and the prediction points are symbolized by asterisks.

Prediction 1:

The graph is displaying the predictions of the ocean annual average temperature for the carbon emissions levels that didn’t have a recorded ocean temperature.

temp_carbon %>%
ggplot(aes(x = carbon_emissions, y = ocean_anomaly )) +
  #Color the points based off the temp_anomaly
  geom_point(aes(color = temp_anomaly))+
  #Plot the ocean_anomly points predicted in prediction2
  geom_point(data = prediction1, col = "red", shape = 8, size = 3, alpha = .5) +
  #Plot model linear regression line 
  geom_smooth(data = prediction3, method = "lm", se = FALSE,  size = 0.3)+
  #Display the graph using theme_minimal
  theme_minimal()+
  #Color the points based off of the temp_anomly temperatures and label the legend 
  scale_color_viridis_c(name = "Global Temperature (°C)", option = "inferno")+
  #label X-axis
  xlab("Carbon Emission (MMT)")+
  #label Y-axis
  ylab("Ocean Temperature (°C)")+
  #label Title
  labs(title = "Prediction of Ocean Annual Mean Temperature 
   by Annual Carbon Emissions below 300MMT", caption = "Source: NOAA and Boden, T.A., G. Marland, and R.J. Andres (2017) via CDIAC")+
   #Center title
  theme(plot.title = element_text(hjust = 0.5))+
  #Position the legend at the bottom of the graph
  theme(legend.position="bottom")+
  #limit X-axis
  xlim(0,300)+
  #limit Y-axis
  ylim(-0.5, 0)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 131 rows containing non-finite values (stat_smooth).
## Warning: Removed 261 rows containing missing values (geom_point).

Prediction 2:

The graph is displaying the predictions of the ocean annual average temperature if the carbon emissions levels increased to 15,000MMT.

temp_carbon %>%
ggplot(aes(x = carbon_emissions, y = ocean_anomaly )) +
  #Color the points based off the temp_anomaly
  geom_point(aes(color = temp_anomaly))+
  #Plot the ocean_anomly points predicted in prediction2
  geom_point(data = prediction2, col = "red", shape = 8, size = 3) +
  #Plot model linear regression line 
  geom_smooth(data = prediction3, method = "lm", se = FALSE)+
  #Display the graph using theme_minimal
  theme_minimal()+
  #Color the points based off of the temp_anomly temperatures and label the legend 
  scale_color_viridis_c(name = "Global Temperature (°C)", option = "inferno")+
  #label X-axis
  xlab("Carbon Emissions (MMT)")+
  #label Y-axis
  ylab("Ocean Temperature (°C)")+
  #label Title
  labs(title = "Prediction of Ocean Annual Mean Temperature 
       by the Rise of Annual Carbon Emissions", caption = "Source: NOAA and Boden, T.A., G. Marland, and R.J. Andres (2017) via CDIAC")+
  #Center title
  theme(plot.title = element_text(hjust = 0.5))+
  #Position the legend at the bottom of the graph
  theme(legend.position="bottom")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 133 rows containing missing values (geom_point).

Source

NOAA and Boden, T.A., G. Marland, and R.J. Andres (2017) via CDIAC