Module 8: Regression and the Line of Best Fit

Lab 8: Modeling Carbon Footprints in Japan

Click here to run the code for this lab yourself on RStudio.Cloud!

This lab investigates trends in carbon emissions in Japanese municipalities, from 2005 to 2017. Japan held the landmark Kyoto Protocol in 1997, which started international commitments to reduce greenhouse gas emissions. How much have Japanese cities reduced their carbon footprint, and which ones have succeeded the most? In this lab, we hypothesize that each passing year has led to a statistically significant reduction in cities’ carbon footprints. Let’s test that hypothesis!

0. Import Data

Load Data

In this dataset, each row is a city-year! We’ll calculate the carbon footprint for each city, in kilotons of emissions per 1000 residents, and we’ll zoom into just municipalities with at least 1 resident.

# Load packages
library(tidyverse)
library(viridis)
library(broom)
library(moderndive)

# Import data
jp <- read_csv("jp_emissions.csv") %>%
     # to get the carbon footprint of each city,
     mutate(footprint = emissions / pop * 1000) %>%
     # Zoom into just cities where anyone lives
     filter(pop >= 1) %>%
     # Grab just these variables
     select(muni, footprint, year, pop)

View Data

# View first 3 rows of dataset
jp %>% head(3)
muni footprint year pop
Hokkaido Sapporo-shi 7.829075 2017 1952356
Hokkaido Sapporo-shi 7.774304 2016 1952356
Hokkaido Sapporo-shi 7.765927 2015 1952356

Codebook

In this dataset, our variables mean:

  • muni: Name of Municipality, preceded by the prefecture where it is located (eg. Hokkaido [Prefecture] Sapporo-shi [City]).

  • footprint: city’s carbon footprint, adjusted for population size. The average kilotons of carbon emissions 1000 residents produce in a year.

  • year: year of observation (numeric).

  • pop: number of residents in that municipality, as of most recent census.

Descriptives

  • The average Japanese city between 2005 and 2017 had a carbon footprint of 11 kilotons of emissions per 1000 residents, plus or minus a standard deviation of 20.7.
  • The smallest carbon footprint was 2.9 kilotons per 1000 residents.
  • The median city had a carbon footprint of 8.2 kilotons per 1000 residents.
  • The largest carbon footprint observed was 931.3 kilotons per 1000 residents.


Task 1: Average Carbon Footprints

Question

First, please calculate the average carbon footprint for Japanese cities in each year, using group_by(), and save it as a data.frame called avg.

Now visualize the average carbon footprints for Japanese cities over time, using a scatterplot of your data.frame avg. Plot the line of best fit! What trend do we observe, on average?

Answer

avg <- jp %>%
    group_by(year) %>%
    summarize(footprint = mean (footprint, na.rm = TRUE))
# View outcome
avg
year footprint
2005 10.13
2007 10.63
2008 10.06
2009 9.54
2010 9.93
2011 11.79
2012 12.28
2013 12.30
2014 11.82
2015 11.47
2016 11.38
2017 11.18
avg %>%
  ggplot(mapping = aes(x = year, y = footprint)) +
  geom_line(size = 1) + # adding the geom_line is optional!
  geom_point(size = 5, shape = 21, fill = "white", color = "black") +
  geom_smooth(method = "lm", se = FALSE) + # adds line of best fit
  theme_classic(base_size = 28) +
  labs(y = "Average Emissions\nper 1000 residents",
       x = "Year", subtitle = "Effect of Time on Emissions")


Task 2: Modeling

Model Equation

Question

Now, calculate the model equation for that line of best fit, using your dataframe avg and the lm() function. What rate of emissions was an average Japanese city projected to produce in the year 0 CE? How much does the average carbon footprint increase with every passing year, according to your model’s beta coeficient?

Answer

m <- avg %>%
  lm(formula = footprint ~ year)

m
## 
## Call:
## lm(formula = footprint ~ year, data = .)
## 
## Coefficients:
## (Intercept)         year  
##   -307.2731       0.1583

The average Japanese city was projected to produce -307.27 kilotons of carbon emissions per 1000 residents in the year CE. With every passing year, the average carbon footprint increases by 0.16 kilotons per 1000 residents.

Model Table

Question

Examine the model table, using summary() or get_regression_table(). How likely is it that these two statistics were just that extreme by chance?

Answer

m %>%
  get_regression_table()
term estimate std_error statistic p_value lower_ci upper_ci
intercept -307.273 127.221 -2.415 0.036 -590.740 -23.806
year 0.158 0.063 2.502 0.031 0.017 0.299

The alpha coefficient, -307.27, was so extreme that there is just a 0.036 probability (3.6% chance) that this statistic occurred due to chance. The effect of each year (beta = 0.16) was so extreme that there is just a 0.031 probability (3.1% chance) that this statistic occurred due to chance. Both effects are therefore statistical significant at the p < 0.05 level.

Model Fit

Question

Evaluate this model, using summary() or glance(). What percentage of variation in average carbon footprints does it explain? Does it fit better than an intercept model? Is this model’s F statistic statistically significant? How do you know?

Answer

glance(m)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.3850064 0.3235071 0.7872393 6.260333 0.0313357 1 -13.06266 32.12532 33.58003 6.197458 10 12

This model explained 39% of the variation in average carbon footprints (\(R^{2}\) = 0.39). It fits much better than an intercept model, according to its F-statistic value of ``6.26. In fact, there’s just a 0.031 probability (3.1% chance) that this statistic occurred due to chance, so we can be at least 96.9% confident that this statistic did not occur due to chance. This model has a statistically significant fit.




Task 3: Top Cities

Question

Using filter() and arrange(), find the names of the 3 municipalities with the largest population in Japan in 2017.

Then, for those 3 largest municipalities in Japan, visualize the relationship between year and carbon footprints, adding a separate line of best fit for each city. You can do this using facet_wrap(), aes(color = ...), or both! How do these trends compare with the average city in Japan?

Answer

jp %>%
  filter(year == 2017) %>%
  arrange(desc(pop)) %>%
  head(3)
# write down names of top 3 cities
muni footprint year pop
Kanagawa-ken Yokohama-shi 5.9 2017 3724844
Osaka-fu Osaka-shi 7.3 2017 2691185
Aichi-ken Nagoya-shi 6.7 2017 2295638
jp %>%
  # Filter to top 3 cities
  filter(muni %in% c("Kanagawa-ken Yokohama-shi", 
                     "Osaka-fu Osaka-shi",
                     "Aichi-ken Nagoya-shi") ) %>%
  # plot them
  ggplot(mapping = aes(x = year, y = footprint,
                       color = muni)) +
  geom_point() +
  facet_wrap(~muni) +
  # Add line of best fit!
  geom_smooth(method = "lm", se = FALSE) +
  # if you every want to get rid of a legend, 
  # just say 'attribute' = "none"
  guides(color = "none") +
  theme_classic() +
  labs(x = "Year", y = "Carbon Emissions\nper 1000 residents")

Looks like Nagoya has sharply reduced their emissions over time, while Yokohama and Osaka have seen much variation. Though Yokohama and Osaka reduced their emissions recently in 2015, it’s just a recent trend; the line of best fit still predicts increases in emissions based on their past.



Task 4: Compare

Question

Now, filtering to each of these 3 municipalities separately, please calculate the model equation for your lines of best fit. (You’ll need to use filter() and lm() 3 times. You can’t use group_by() for this!) Report the city name, alpha coefficient, beta coefficients, and R-squared for each model in the table below. For Alpha, Beta, and the F-statistic, please report the p-value too in parentheses.

What story does this table tell? In which city did each passing year decrease its carbon footprint the most? (See beta). Which city’s emissions are best predicted by these models? Which city’s emissions are poorly captured by these models?

Answer

m1 <- jp %>%
  filter(muni == "Kanagawa-ken Yokohama-shi") %>%
  lm(formula = footprint ~ year)

m1 %>% summary()
m2 <- jp %>%
  filter(muni == "Osaka-fu Osaka-shi") %>%
  lm(formula = footprint ~ year)

m2 %>% summary()
m3 <- jp %>%
  filter(muni == "Aichi-ken Nagoya-shi") %>%
  lm(formula = footprint ~ year)
m3 %>% summary()
Measure Yokohama Osaka Nagoya
Alpha: (Intercept) -54.57
(0.286)
-175.59
(0.102)
140.50
(0.009)
Beta: year 0.03
(0.240)
0.09
(0.090)
-0.07
(0.009)
R2 0.14 0.26 0.51
F statistic 1.56
(0.239)
3.51
(0.090)
10.33
(0.009)

Nagoya decreased its emissions the most each year, at -0.07 fewer kilotons of emissions per 1000 residents each year. That effect was statistically significant (p < 0.01). In contrast, Osaka increased its carbon footprint at a rate of 0.09 kilotons of emissions per 1000 residents each year. That effect was somewhat statistically significant (p < 0.10). Finally, Yokohama increased its emissions slightly over time, at 0.03 kilotons per 1000 residents a year. However, that effect was not statistically significant (p > 0.240). The best fitting model was for Nagoya, as it explained 51% of the variation in emissions rates.


Conclusion

That’s it! You made it! We have seen that though cities on average saw increases in emissions since 2005, recent changes make it hard to fit a perfect model. In fact, some cities like Nagoya have robustly decreased their emissions, while others like Yokohama and Osaka have seen great increases in emissions. Hopefully this was a good demonstration of the power and accessibility of using the line of best fit to answer our questions about cities over time.