For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, choose a previous lab notebook between weeks 6-11.
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable> will be
identified.”
The primary audience for this analysis is international environmental policy organizations and government climate agencies. These stakeholders rely on data-driven insights to guide environmental regulations, sustainability investments, and emissions reduction strategies.
Governments and environmental agencies need to determine whether increasing renewable energy adoption meaningfully reduces CO2 emissions per capita, in order to justify large-scale investments in renewable infrastructure. Specifically, the goal is to evaluate whether renewable energy usage and related environmental variables can reliably predict CO2 emissions levels across countries.
We will use Co2 emissions, Renewable energy percentage, forest area, sea level rise, and rainfall.
We’ll do linear regression, multicellularity checks, and diagnostics eval.
Assumptions: relationships between predictors and response are approximately linear, observations are independent across countries, data is measured consistently across all entries, environmental variables may indirectly relate to emissions.
The objective of this analysis is to determine whether environmental and energy-related variables significantly explain variation in CO2 emissions per capita. Success is defined by identifying statistically significant predictors and evaluating whether the model provides reliable explanatory power for emissions trends.
Since this is a class, and not a workplace, we need to be careful not to present information to you too quickly or all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … Now that you have a more informed knowledge of statistics, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(performance)
df_main <- read.csv("climate_change_dataset.csv")
#recreate base model
base_model <- df_main |>
lm(`CO2.Emissions..Tons.Capita.` ~ `Renewable.Energy....`, data = _)
summary(base_model)
##
## Call:
## lm(formula = CO2.Emissions..Tons.Capita. ~ Renewable.Energy....,
## data = df_main)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1280 -4.8633 0.2287 4.9185 9.7632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.70180 0.41400 25.850 <2e-16 ***
## Renewable.Energy.... -0.01011 0.01370 -0.738 0.461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.616 on 998 degrees of freedom
## Multiple R-squared: 0.0005455, Adjusted R-squared: -0.000456
## F-statistic: 0.5447 on 1 and 998 DF, p-value: 0.4607
#visualize relationship
df_main |>
ggplot(aes(x = Renewable.Energy....,
y = `CO2.Emissions..Tons.Capita.`)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
the model uses only one predictor, limiting explanatory power
ignores potential cofounding variables
no validation of model assumptions
weak statistical significance reduces practical usefulness
#improved model with additional predictors
improved_model <- df_main |>
lm(`CO2.Emissions..Tons.Capita.` ~
`Renewable.Energy....` +
`Forest.Area....` +
`Sea.Level.Rise..mm.` +
Rainfall..mm.,
data = _)
summary(improved_model)
##
## Call:
## lm(formula = CO2.Emissions..Tons.Capita. ~ Renewable.Energy.... +
## Forest.Area.... + Sea.Level.Rise..mm. + Rainfall..mm., data = df_main)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3538 -4.8915 0.1864 4.9212 10.0051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6822007 0.8683426 12.302 <2e-16 ***
## Renewable.Energy.... -0.0097459 0.0137048 -0.711 0.477
## Forest.Area.... 0.0095411 0.0102227 0.933 0.351
## Sea.Level.Rise..mm. -0.1870571 0.1551714 -1.205 0.228
## Rainfall..mm. 0.0001067 0.0002508 0.425 0.671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.617 on 995 degrees of freedom
## Multiple R-squared: 0.003115, Adjusted R-squared: -0.0008924
## F-statistic: 0.7773 on 4 and 995 DF, p-value: 0.54
INSIGHT:
all predictors have high pvalues (>0.05)
indicates no statistically significant relationships
suggests model may be under specified or inappropriate
Improved analysis 2: multicollineary check
#check multicollinearity
improved_model |>
check_collinearity()
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
## Renewable.Energy.... 1.00 [1.00, Inf] 1.00 1.00 [0.00, 1.00]
## Forest.Area.... 1.00 [1.00, 4.37e+13] 1.00 1.00 [0.00, 1.00]
## Sea.Level.Rise..mm. 1.00 [1.00, Inf] 1.00 1.00 [0.00, 1.00]
## Rainfall..mm. 1.00 [1.00, Inf] 1.00 1.00 [0.00, 1.00]
INSIGHT:
all vif values ~ 1
confirms low multicollinearity
predictors are not redundant but still not informative
Analsysi 3: model diagnostics
#diagnostic plots
improved_model |>
check_model()
residuals show approximate linearity, but slight curvature
variance appears mostly constant → weak evidence of heteroscedasticity
qq plot shows nonnormal tails → mild violation
no strong outliers or influential points
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
One major concern with this analysis is the presence of bias and data limitations. The dataset may overrepresent certain regions while underrepresenting others, and different countries may follow inconsistent standards when measuring environmental variables. This lack of uniformity can introduce bias and reduce the reliability of conclusions drawn from the model.
Another critical issue is the absence of important variables. CO2 emissions are heavily influenced by factors such as industrialization, GDP, population density, and government policy. Since these are not included in the dataset, the model provides an incomplete picture, limiting its explanatory power and practical usefulness.
From a societal perspective, flawed or incomplete models could lead to poor decision-making. Policies based on weak evidence might result in misallocation of resources, unfair targeting of specific countries, or delays in implementing effective climate strategies. These consequences could have real-world environmental and economic impacts.
Finally, there is an epistemological concern regarding the assumptions underlying this analysis. The model assumes that environmental indicators alone can explain CO2 emissions, which reflects an incomplete understanding of the broader system. In reality, emissions are driven by a complex interaction of economic, political, and technological factors. Without incorporating these dimensions, the conclusions remain limited and potentially misleading.
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.