Module 8: Regression and the Line of Best Fit
Lab 8: Modeling Carbon Footprints in Japan
This lab investigates trends in carbon emissions in Japanese municipalities, from 2005 to 2017. Japan held the landmark Kyoto Protocol in 1997, which started international commitments to reduce greenhouse gas emissions. How much have Japanese cities reduced their carbon footprint, and which ones have succeeded the most? In this lab, we hypothesize that each passing year has led to a statistically significant reduction in cities’ carbon footprints. Let’s test that hypothesis!
0. Import Data
Load Data
In this dataset, each row is a city-year! We’ll calculate the carbon footprint for each city, in kilotons of emissions per 1000 residents, and we’ll zoom into just municipalities with at least 1 resident.
# Load packages
library(tidyverse)
library(viridis)
library(broom)
library(moderndive)
# Import data
jp <- read_csv("jp_emissions.csv") %>%
# to get the carbon footprint of each city,
mutate(footprint = emissions / pop * 1000) %>%
# Zoom into just cities where anyone lives
filter(pop >= 1) %>%
# Grab just these variables
select(muni, footprint, year, pop)View Data
# View first 3 rows of dataset
jp %>% head(3)| muni | footprint | year | pop |
|---|---|---|---|
| Hokkaido Sapporo-shi | 7.829075 | 2017 | 1952356 |
| Hokkaido Sapporo-shi | 7.774304 | 2016 | 1952356 |
| Hokkaido Sapporo-shi | 7.765927 | 2015 | 1952356 |
Codebook
In this dataset, our variables mean:
muni: Name of Municipality, preceded by the prefecture where it is located (eg. Hokkaido [Prefecture] Sapporo-shi [City]).footprint: city’s carbon footprint, adjusted for population size. The average kilotons of carbon emissions 1000 residents produce in a year.year: year of observation (numeric).pop: number of residents in that municipality, as of most recent census.
Descriptives
- The average Japanese city between 2005 and 2017 had a carbon footprint of 11 kilotons of emissions per 1000 residents, plus or minus a standard deviation of 20.7.
- The smallest carbon footprint was 2.9 kilotons per 1000 residents.
- The median city had a carbon footprint of 8.2 kilotons per 1000 residents.
- The largest carbon footprint observed was 931.3 kilotons per 1000 residents.
Task 1: Average Carbon Footprints
Question
First, please calculate the average carbon footprint for Japanese cities in each year, using group_by(), and save it as a data.frame called avg.
Now visualize the average carbon footprints for Japanese cities over time, using a scatterplot of your data.frame avg. Plot the line of best fit! What trend do we observe, on average?
Answer
avg <- jp %>%
group_by(year) %>%
summarize(footprint = mean (footprint, na.rm = TRUE))# View outcome
avg| year | footprint |
|---|---|
| 2005 | 10.13 |
| 2007 | 10.63 |
| 2008 | 10.06 |
| 2009 | 9.54 |
| 2010 | 9.93 |
| 2011 | 11.79 |
| 2012 | 12.28 |
| 2013 | 12.30 |
| 2014 | 11.82 |
| 2015 | 11.47 |
| 2016 | 11.38 |
| 2017 | 11.18 |
avg %>%
ggplot(mapping = aes(x = year, y = footprint)) +
geom_line(size = 1) + # adding the geom_line is optional!
geom_point(size = 5, shape = 21, fill = "white", color = "black") +
geom_smooth(method = "lm", se = FALSE) + # adds line of best fit
theme_classic(base_size = 28) +
labs(y = "Average Emissions\nper 1000 residents",
x = "Year", subtitle = "Effect of Time on Emissions")Task 2: Modeling
Model Equation
Question
Now, calculate the model equation for that line of best fit, using your dataframe avg and the lm() function. What rate of emissions was an average Japanese city projected to produce in the year 0 CE? How much does the average carbon footprint increase with every passing year, according to your model’s beta coeficient?
Answer
m <- avg %>%
lm(formula = footprint ~ year)
m##
## Call:
## lm(formula = footprint ~ year, data = .)
##
## Coefficients:
## (Intercept) year
## -307.2731 0.1583
The average Japanese city was projected to produce -307.27 kilotons of carbon emissions per 1000 residents in the year CE. With every passing year, the average carbon footprint increases by 0.16 kilotons per 1000 residents.
Model Table
Question
Examine the model table, using summary() or get_regression_table(). How likely is it that these two statistics were just that extreme by chance?
Answer
m %>%
get_regression_table()| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -307.273 | 127.221 | -2.415 | 0.036 | -590.740 | -23.806 |
| year | 0.158 | 0.063 | 2.502 | 0.031 | 0.017 | 0.299 |
The alpha coefficient, -307.27, was so extreme that there is just a 0.036 probability (3.6% chance) that this statistic occurred due to chance. The effect of each year (beta = 0.16) was so extreme that there is just a 0.031 probability (3.1% chance) that this statistic occurred due to chance. Both effects are therefore statistical significant at the p < 0.05 level.
Model Fit
Question
Evaluate this model, using summary() or glance(). What percentage of variation in average carbon footprints does it explain? Does it fit better than an intercept model? Is this model’s F statistic statistically significant? How do you know?
Answer
glance(m)| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.3850064 | 0.3235071 | 0.7872393 | 6.260333 | 0.0313357 | 1 | -13.06266 | 32.12532 | 33.58003 | 6.197458 | 10 | 12 |
This model explained 39% of the variation in average carbon footprints (\(R^{2}\) = 0.39). It fits much better than an intercept model, according to its F-statistic value of ``6.26. In fact, there’s just a 0.031 probability (3.1% chance) that this statistic occurred due to chance, so we can be at least 96.9% confident that this statistic did not occur due to chance. This model has a statistically significant fit.
Task 3: Top Cities
Question
Using filter() and arrange(), find the names of the 3 municipalities with the largest population in Japan in 2017.
Then, for those 3 largest municipalities in Japan, visualize the relationship between year and carbon footprints, adding a separate line of best fit for each city. You can do this using facet_wrap(), aes(color = ...), or both! How do these trends compare with the average city in Japan?
Answer
jp %>%
filter(year == 2017) %>%
arrange(desc(pop)) %>%
head(3)
# write down names of top 3 cities| muni | footprint | year | pop |
|---|---|---|---|
| Kanagawa-ken Yokohama-shi | 5.9 | 2017 | 3724844 |
| Osaka-fu Osaka-shi | 7.3 | 2017 | 2691185 |
| Aichi-ken Nagoya-shi | 6.7 | 2017 | 2295638 |
jp %>%
# Filter to top 3 cities
filter(muni %in% c("Kanagawa-ken Yokohama-shi",
"Osaka-fu Osaka-shi",
"Aichi-ken Nagoya-shi") ) %>%
# plot them
ggplot(mapping = aes(x = year, y = footprint,
color = muni)) +
geom_point() +
facet_wrap(~muni) +
# Add line of best fit!
geom_smooth(method = "lm", se = FALSE) +
# if you every want to get rid of a legend,
# just say 'attribute' = "none"
guides(color = "none") +
theme_classic() +
labs(x = "Year", y = "Carbon Emissions\nper 1000 residents")Looks like Nagoya has sharply reduced their emissions over time, while Yokohama and Osaka have seen much variation. Though Yokohama and Osaka reduced their emissions recently in 2015, it’s just a recent trend; the line of best fit still predicts increases in emissions based on their past.
Task 4: Compare
Question
Now, filtering to each of these 3 municipalities separately, please calculate the model equation for your lines of best fit. (You’ll need to use filter() and lm() 3 times. You can’t use group_by() for this!) Report the city name, alpha coefficient, beta coefficients, and R-squared for each model in the table below. For Alpha, Beta, and the F-statistic, please report the p-value too in parentheses.
What story does this table tell? In which city did each passing year decrease its carbon footprint the most? (See beta). Which city’s emissions are best predicted by these models? Which city’s emissions are poorly captured by these models?
Answer
m1 <- jp %>%
filter(muni == "Kanagawa-ken Yokohama-shi") %>%
lm(formula = footprint ~ year)
m1 %>% summary()m2 <- jp %>%
filter(muni == "Osaka-fu Osaka-shi") %>%
lm(formula = footprint ~ year)
m2 %>% summary()m3 <- jp %>%
filter(muni == "Aichi-ken Nagoya-shi") %>%
lm(formula = footprint ~ year)
m3 %>% summary()| Measure | Yokohama | Osaka | Nagoya |
|---|---|---|---|
| Alpha: (Intercept) | -54.57 (0.286) |
-175.59 (0.102) |
140.50 (0.009) |
| Beta: year | 0.03 (0.240) |
0.09 (0.090) |
-0.07 (0.009) |
| R2 | 0.14 | 0.26 | 0.51 |
| F statistic | 1.56 (0.239) |
3.51 (0.090) |
10.33 (0.009) |
Nagoya decreased its emissions the most each year, at -0.07 fewer kilotons of emissions per 1000 residents each year. That effect was statistically significant (p < 0.01). In contrast, Osaka increased its carbon footprint at a rate of 0.09 kilotons of emissions per 1000 residents each year. That effect was somewhat statistically significant (p < 0.10). Finally, Yokohama increased its emissions slightly over time, at 0.03 kilotons per 1000 residents a year. However, that effect was not statistically significant (p > 0.240). The best fitting model was for Nagoya, as it explained 51% of the variation in emissions rates.
Conclusion
That’s it! You made it! We have seen that though cities on average saw increases in emissions since 2005, recent changes make it hard to fit a perfect model. In fact, some cities like Nagoya have robustly decreased their emissions, while others like Yokohama and Osaka have seen great increases in emissions. Hopefully this was a good demonstration of the power and accessibility of using the line of best fit to answer our questions about cities over time.