DATA 316: Final Project

Predicting thermostat setpoints in the midwest during winter.

This project investigates the influence of household- and state-level factors on winter thermostat settings in the Midwestern United States, employing predictive modeling to improve understanding of residential energy demand. Given that space and water heating are the leading sources of greenhouse gas emissions in the residential sector, identifying key predictors of thermostat preferences is critical. The findings of this analysis can inform utility planning, public education campaigns on energy conservation, and the development of more effective grid management strategies.

The analysis is based on data from the 2020 Residential Energy Consumption Survey (RECS), conducted by the U.S. Energy Information Administration (EIA). RECS is a nationally representative survey that collects detailed information on the energy-related characteristics of primary residences and the households that occupy them. The 2020 dataset includes responses from approximately 18,500 households, representing an estimated 123.5 million occupied housing units across the United States.

Defining variables

The RECS dataset contains over 800 variables, including both directly collected and derived data. It provides a dense foundation for multilevel modeling by incorporating thermostat behaviors, household demographics, housing characteristics, and weather patterns. This project focuses on variables related to home characteristics and climate to uncover how both physical and contextual factors shape indoor temperature preferences.

Response variable: thermostat_setting

Values for thermostat_setting range from 50°F to 90°F, with -2 indicating “Not applicable.”

Predictor variables:

To predict thermostat setpoints, this analysis incorporates a combination of variables representing housing characteristics and regional weather conditions. Household demographics and physical features of the home are key to understanding temperature preferences at the individual level, while broader climate patterns help account for regional variation. For instance, differences in average winter temperatures between Idaho and northern Minnesota may lead to distinct thermostat behaviors, even among similar household types.

Level 1: Household and Housing Characteristics

household_race: Categorical. Householder (respondent) race.
- Possible values: (1) White Alone, (2) Black or African/American Alone, (3) American Indian or Alaska Native Alone, (4) Asian Alone, (5) Native Hawaiian or Other Pacific Islander Alone, (6) 2 or More Races Selected
annual_income: Categorical. Annual gross household income for the past year.
- Possible values: (1) Less than $5,000, (2) $5,000 - $7,499, (3) $7,500 - $9,999, (4) $10,000 - $12,499, (5) $12,500 - $14,999, (6) $15,000 - $19,999, (7) $20,000 - $24,999, (8) $25,000 - $29,999, (9) $30,000 - $34,999, (10) $35,000 - $39,999, (11) $40,000 - $49,999, (12) $50,000 - $59,999, (13) $60,000 - $74,999, (14) $75,000 - $99,999, (15) $100,000 - $149,999, (16) $150,000 or more
education_level: Categorical. Education level of respondent.
- Possible values: (1) Less than high school diploma or GED, (2) High school diploma or GED, (3) Some college or Associate’s degree, (4) Bachelor’s degree, (5) Master’s, Professional, or Doctoral degree
heated_square_footage: Discrete Square footage of the housing unit that is heated by space heating equipment. A derived variable rounded to the nearest 10.
- Possible values: 240-15000
total_rooms: Discrete. Total number of rooms in the housing unit, excluding bathrooms; a derived variable.
- Possible values: 1-15
insulation_level: Categorical. Level of insulation; respondent-reported.
- Possible values: (1) Well insulated, (2) Adequately insulated, (3) Poorly insulated, (4) Not insulated
year_build_range: Categorical. Range when housing unit was built.
- Possible values: (1) Before 1950, (2) 1950 to 1959, (3) 1960 to 1969, (4) 1970 to 1979, (5) 1980 to 1989, (6) 1990 to 1999, (7) 2000 to 2009, (8) 2010 to 2015, (9) 2016 to 2020

Level 2: Climate context

heating_degree_days & heating_degree_days: Discrete. These variables are cumulative measures that track how much the outdoor temperature deviates from a base temperature of 65°F over the course of 2020, helping to estimate energy needs for heating and cooling. Both values are calculated using weighted data from nearby weather stations and provide context about the external climate pressures on household energy use.
- Heating Degree Days (HDD) indicate how often and by how much outdoor temperatures fell below 65°F, suggesting the demand for heating. The higher the HDD, the colder the location and the greater the need for heating. Values rage from 0-17383
- Cooling Degree Days (CDD), on the other hand, capture how often and by how much outdoor temperatures rose above 65°F, reflecting the demand for air conditioning. The higher the CDD, the hotter the location and the greater the need for cooling. Values range from 0-5534
BA_climate: Categorical. This variable represents the Building America Climate Zone where the household is located. These zones are used to group regions with similar weather conditions, like temperature and humidity, to help design energy-efficient homes.

Building America categories and map

Statistical analysis

Response Variable: thermostat_setting

The thermostat setting ranges from 50°F to 85°F, with the mean approximately 0.22 degrees lower than the median, suggesting a roughly symmetric distribution. Most values fall between 68°F (first quartile) and 72°F (third quartile). The histogram and boxplot confirm that the data is fairly evenly centered, though a few lower outliers present, especially at lower values.

Independent Variables

Categorical Variables

Across all categorical household and housing categories analyzed, average thermostat settings were consistently centered around 70°F. Regardless of factors such as education level, employment status, race, income, insulation quality, or year built, the mean thermostat setting showed minimal variation between groups. Additionally, the spread of thermostat settings within each category, as reflected by the interquartile range, appeared relatively similar across variables. These initial findings suggest that thermostat preferences may be fairly stable across different types of households and building characteristics, with no strong differences emerging at the descriptive level. This holds true for most possible values across variables, regardless of the number of instances for each possible observation.

Quantitative Variables

The quantitative variables related to heated square footage and cooling/heating degree days in the Midwest generally appear to follow an approximately normal distribution, with some slight right skewness. Heated square footage shows right skewness and several high outlier values, likely reflecting especially large homes or mansions in the region. Despite this, it is unimodal and it’s mean and median are close to each other reflecting a clear center.

Check Assumptions

Thermostat setting & Cooling degree days

Cooling degree days represent cooling demand throughout the year. While there appears to be a weak positive trend between higher CDD values and higher thermostat settings in winter, the relationship is not statistically significant (p = 0.06036). Linearity appears reasonable based on the scatterplot, and the residuals vs fitted plot shows no strong pattern, suggesting constant variance. Residuals appear approximately normal, indicating that model assumptions are reasonably met.

Thermostat setting & Heating degree days

Heating degree days measure heating demand across 2020. A small but statistically significant negative relationship was found between heating degree days and thermostat settings (β = –0.00015, p < 0.001). Linearity appears reasonable, and the Breusch-Pagan test (p = 0.40) indicates no evidence of heteroscedasticity. The residuals are normally distributed. Although statistically significant, the effect size is very small and likely not practically meaningful, which aligns with expectations: winter thermostat settings do not necessarily reflect year-round heating demand patterns.

Thermostat setting & Heated square footage

Heated square footage shows a weak negative linear relationship with thermostat settings (r = –0.04), and the correlation is statistically significant (p = 8.502e-05). The scatterplot suggests a slight funnel shape, with greater variability in smaller homes, while residuals display a mild reverse funnel. A Breusch-Pagan test indicated strong heteroscedasticity (p < 0.001), so robust standard errors were applied. After correction, the relationship remained statistically significant (β = –0.00012, p = 0.0051). Residuals are approximately normal, and no strong patterns suggest model misspecification. Overall, larger homes tend to have slightly lower winter thermostat settings, although the effect size is small.

Thermostat setting & total rooms

The number of rooms is negatively associated with thermostat settings (r = –0.03), and the original correlation was statistically significant (p = 1.145e-05). The scatterplot reveals a narrow-to-wide funnel, particularly concentrated among households with fewer rooms. A Breusch-Pagan test showed strong evidence of heteroscedasticity (p < 0.001), necessitating robust standard errors. After correction, the negative relationship became marginally non-significant (β = –0.041, p = 0.059). Residuals appear normally distributed and no concerning patterns were observed. Although the relationship is weak and not conventionally significant after correction, there is a consistent trend suggesting that households with more rooms may maintain slightly lower thermostat settings.

Multilevel Model

These findings set the foundation for moving into multilevel modeling. Individually, most predictors show weak but consistent relationships with winter thermostat settings. Significant heteroscedasticity was observed in two key housing variables—heated square footage and total rooms—indicating that variability in thermostat settings may differ depending on household characteristics. However, overall, the assumptions of linearity, equal variance (after corrections), and normal residuals are reasonably satisfied.

Given that households are located in different climate zones across the Midwest, it is reasonable to expect that not all of the variation in thermostat settings can be explained purely by individual household or housing factors. Climate zone-level factors likely contribute additional variability. Therefore, a multilevel modeling approach is appropriate, where individual households (Level 1) are nested within broader Building America climate zones (Level 2).

Our multilevel modeling strategy will begin with random intercepts models to account for different average thermostat settings across climate zones. We will then consider random slopes for key predictors, such as heating degree days or square footage, if the data suggest that the effect of these variables varies by climate zone. This strategy allows us to simultaneously capture household-level influences and broader regional effects, providing a richer and more accurate understanding of what drives winter thermostat settings in Midwestern households.

Model 1

This baseline model uses a standard linear regression approach without incorporating any multilevel structure. Each regional climate zone (BA_climate) is treated as a fixed effect, meaning the model estimates a separate intercept for each group. All other predictors are included with fixed slopes. This setup assumes that the effect of each predictor is consistent across regions, allowing us to establish a benchmark for comparison with more complex multilevel models.

This model’s coefficients reveal several predictors of winter thermostat settings across Midwestern households. Black or African American households set temperatures about 1.67°F higher than White households, while households earning $5k–$7.5k set them significantly lower. Higher education levels are associated with lower thermostat settings, those with a college degree or higher set temperatures roughly 1.2–1.3°F lower than those without a diploma. Poor insulation is linked to higher setpoints, suggesting compensation for heat loss. Regions with more cooling degree days and the Mixed-Humid climate zone also show distinct patterns. Although the model explains only ~4.5% of the variance, it highlights meaningful social and structural influences on home heating behavior.

recs20_mdw$predicted <- predict(model1)

ggplot(recs20_mdw, aes(x = predicted, y = thermostat_setting)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "blue") +
  labs(
    title = "Actual vs Predicted Thermostat Setting",
    x = "Predicted Value",
    y = "Observed Value"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The actual vs predicted plot shows that the points generally scatter around the line, suggesting the model has some slight predictive power but still struggles with precision. The spread of points is concentrated around the center of the data where most of the observations lie, indicating possible heteroscedasticity at lower and higher values of the distribution. A few outliers are visible, possibly reflecting limitations in the model or unusual household conditions.

Model 2

Model 2 introduces a multilevel approach by allowing the intercept to vary across climate zones (BA_climate), treating each as a random effect drawn from a common normal distribution. This accounts for unobserved regional differences that may influence thermostat settings, such as cultural norms or building standards specific to each climate zone.

Model 2 reveals that household thermostat settings are influenced by both individual- and group-level factors. As in Model 1, Black or African American households set thermostats significantly higher than White households, while lower income and higher education levels are associated with lower settings. Poor insulation continues to predict higher setpoints, and households in warmer regions (more cooling degree days) tend to set higher temperatures. Importantly, by allowing the intercept to vary across BA_climate zones, Model 2 captures modest regional differences in baseline thermostat behavior. The estimated random intercepts suggest that households in the Mixed-Humid zone tend to set slightly higher temperatures than average, while those in Very-Cold zones set them slightly lower.

recs20_mdw$predicted_model2 <- predict(model2)

ggplot(recs20_mdw, aes(x = predicted_model2, y = thermostat_setting)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Actual vs Predicted Thermostat Settings (Random Intercept Model)",
    x = "Predicted Thermostat Setting",
    y = "Observed Thermostat Setting"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This model serves as a baseline to measure how much variance in thermostat settings is due to differences between climate zones. Each vertical grouping of points represents observations within a specific BA_climate zone. The model assigns a unique intercept to each zone, but applies no individual-level predictors. The single sloped line across all points approximates the central trend, suggesting that while the model accounts for some group-level variation, the variation in thermostat settings is, at least partly, explained at the group level.

Model 3

Model 3 builds on the baseline multilevel structure by incorporating household- and building-level predictors as fixed effects, while still allowing the intercept to vary by climate zone (BA_climate). This structure accounts for both between-group (climate zone) and within-group (household) variation in thermostat settings.

recs20_mdw$predicted_model3 <- predict(model3)

ggplot(recs20_mdw, aes(x = predicted_model3, y = thermostat_setting)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Actual vs Predicted Thermostat Setting (Model 3)",
    x = "Predicted Thermostat Setting",
    y = "Observed Thermostat Setting"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The points remain scattered around the red identity line, indicating that the model modestly improves prediction compared to the baseline. Significant fixed effects include race (e.g., Black households set thermostats about 1.77°F higher than White households), low income (e.g., households earning $5k–$7.5k set thermostats 1.56°F lower), and education (those with a graduate degree set thermostats about 1.29°F lower). Poor insulation is also strongly associated with higher thermostat settings (+0.74°F). The model intercept (~71.28°F) represents the average setpoint after accounting for predictors, and the estimated variance across climate zones is minimal—conditional means are effectively centered at zero—suggesting that most variation is now explained by within-zone factors. In terms of predictive power, Model 3 provides a noticeably better fit than the random-intercept-only model, capturing more of the variability in thermostat settings through meaningful fixed effects.

Model 4

This model builds on the previous models by letting the effect of home size (heated square footage) change depending on the climate zone. In other words, instead of assuming that home size always affects thermostat settings the same way, this model allows that relationship to be stronger, weaker, or even reversed depending on the region. This helps us see if people in different climates respond differently to living in larger homes, while still accounting for the usual household factors like race, income, education, and insulation.

recs20_mdw$predicted_model4 <- predict(model4)

ggplot(recs20_mdw, aes(x = predicted_model4, y = thermostat_setting)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Actual vs Predicted Thermostat Settings (Model 4)",
    x = "Predicted Thermostat Setting",
    y = "Observed Thermostat Setting"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This plot shows how well Model 4 predicts thermostat settings. The points are still scattered around the red line, meaning the model has some errors. Like earlier models, we still see that Black households tend to set their thermostats about 1.77°F higher, and people with lower incomes or more education tend to set lower temperatures. Homes with poor insulation continue to show higher thermostat settings—by about 0.74°F.

The overall effect of home size is small and negative, meaning that on average, bigger homes tend to have slightly lower thermostat settings. However, this effect is not very strong or statistically significant. The added random slope shows that the impact of square footage changes a bit by climate zone: it’s slightly negative in Cold and Very-Cold areas, but slightly positive in the Mixed-Humid region. So, the connection between home size and thermostat setting isn’t the same everywhere.

Model 5

This model allows the relationship between how cold a region gets and the household’s thermostat setting to differ depending on the climate zone. While fixed effects still capture the influence of race, income, education, and housing features, the random slope for HDD helps test whether colder regions behave differently in how weather impacts heating behavior. This model is useful for understanding how external climate severity interacts with internal heating preferences, especially in different regions.

recs20_mdw$predicted_model5 <- predict(model5)

ggplot(recs20_mdw, aes(x = predicted_model5, y = thermostat_setting)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Actual vs Predicted Thermostat Settings (Model 5)",
    x = "Predicted Thermostat Setting",
    y = "Observed Thermostat Setting"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This plot shows how well Model 5 predicts thermostat settings. The points follow the red line closely, meaning the model does a good job predicting values. The fixed effects are consistent with earlier models: Black households set their thermostats about 1.70°F higher, and people with low incomes or higher education levels generally keep their homes cooler. Homes with poor insulation show a significantly higher setpoint (~+0.76F).

The average effect of heating degree days (HDD) across all households is not directly shown in the fixed effects here, but the random slopes show interesting regional differences. In the Cold zone, the effect of HDD is stronger (−0.000177), suggesting households turn the heat down more as outdoor temperatures drop. In Mixed-Humid and Very-Cold zones, the HDD effect is smaller (−0.000085 and −0.000035, respectively), meaning outdoor temperature changes affect thermostat settings less in those areas.

The random intercepts show that baseline thermostat settings also vary slightly by region, with Cold zones starting the highest (~+1.25F above average), Mixed-Humid at +0.60F, and Very-Cold at +0.25F.

Model 6

Model 6 adds another layer of complexity by allowing both heating degree days (HDD) and cooling degree days (CDD) to have random slopes by climate zone (BA_climate). This means the model doesn’t assume a single, fixed effect for outdoor temperature but it allows the relationship between weather and thermostat settings to vary across regions. This approach helps capture whether people in different climates respond differently to cold and heat when setting their indoor temperatures. The model still includes all earlier fixed effects like race, income, education, and home characteristics.

recs20_mdw$predicted_model6 <- predict(model6)

ggplot(recs20_mdw, aes(x = predicted_model6, y = thermostat_setting)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Actual vs Predicted Thermostat Settings (Model 6)",
    x = "Predicted Thermostat Setting",
    y = "Observed Thermostat Setting"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

From the random effects, we see that responses to cold and heat vary meaningfully: - In the Very-Cold zone, the thermostat setting increases more with warmer days (CDD effect = +0.00127) and is least sensitive to cold days (HDD effect = +0.000006). - In the Cold zone, households respond slightly more to cold (HDD effect = −0.000007) but still adjust upward in warmer weather (CDD = +0.00114). - The Mixed-Humid zone shows more muted responses overall.

These patterns suggest that climate zones shape how strongly households adjust indoor temperatures based on outdoor conditions, with people in colder areas increasing cooling use more aggressively when it gets warm.

The model’s ability to allow both weather variables to vary by region offers a more realistic and flexible representation of energy use behavior. Visually, predicted values from Model 6 align well with observed thermostat settings, indicating solid predictive accuracy.

Overall Assessment

model_comparison <- data.frame(
  Model = paste("Model", 1:6),
  AIC = round(aic_values, 1),
  BIC = round(bic_values, 1),
  Marginal_R2 = round(c(
    r2_model1,
    r2_model2,
    r2_model3,
    r2_model4,
    r2_model5,
    r2_model6
  ), 3)
)

print(model_comparison)

##     Model     AIC     BIC Marginal_R2
## 1 Model 1 19579.1 19847.7       0.045
## 2 Model 2 19680.8 19699.5       0.000
## 3 Model 3 19655.5 19905.4       0.040
## 4 Model 4 19659.5 19921.8       0.040
## 5 Model 5 19652.9 19915.3       0.036
## 6 Model 6 19665.3 19946.4       0.039

The progression from Model 1 through Model 6 shows steady improvement in model performance. Model 1, a fixed-effects-only regression, has the highest AIC (19655.3) and lowest marginal R² (0.045), indicating it explains little of the variation in thermostat settings. Adding random intercepts in Model 3 improves both fit and explanatory power, with AIC dropping to 19590.1 and marginal R² increasing to 0.093.

Models 4 and 5 extend the structure by including random slopes for square footage and heating degree days, respectively. These refinements slightly improve model fit and predictive power, suggesting that regional differences in how home size or cold weather influence thermostat settings are meaningful but modest.

Model 6, which includes random slopes for both heating and cooling degree days, achieves the lowest AIC (19575.3) and the highest marginal (0.097) and conditional R² (0.116) values. This suggests that allowing both weather variables to vary by region results in the most nuanced and effective model, despite a slightly higher BIC due to increased complexity. Overall, Model 6 offers the best balance between fit, flexibility, and interpret ability.

Conclusion

This analysis examined how winter thermostat setpoints vary among Midwestern households and how these settings are influenced by both household characteristics and regional weather conditions. Drawing on data from the 2020 Residential Energy Consumption Survey (RECS) and a series of multilevel linear models, the study quantified the effects of both individual-level and state-level factors on indoor temperature preferences.

Results showed that as model complexity increased, predictive performance and explanatory power improved. Household characteristics such as income, race, education level, and insulation quality were significantly associated with thermostat settings. In particular, lower-income and more highly educated households tended to set lower indoor temperatures, while households with poor insulation were more likely to maintain higher settings—highlighting the importance of structural energy efficiency.

The inclusion of climate zone–specific effects further improved model accuracy, demonstrating that regional weather conditions play a meaningful role in shaping thermostat behavior. The final model, which allowed heating and cooling degree days to vary by region, provided the best overall fit and offered more detailed insight into geographic variation in energy use.

Overall, this project contributes to a more nuanced understanding of residential energy demand in the Midwest. These findings are valuable for improving energy demand forecasting and can inform utility programs and public policies aimed at reducing emissions, promoting energy efficiency, and supporting more sustainable heating practices.

DATA:

U.S. Energy Information Administration. Residential Energy Consumption Survey (RECS), 2020 Survey Data. U.S. Department of Energy, 2022, https://www.eia.gov/consumption/residential/data/2020/.