Executive Summary

This analysis examines how reported electricity payments vary by housing type in Oregon, focusing on payment differences between detached single-unit homes and attached or multi-unit residences. Using microdata from the 2015 American Community Survey (ACS), a log-linear regression model was used to estimate the association between housing type and monthly electricity payments, controlling for household size, number of bedrooms, and their interaction.

Households in attached or multi-unit housing were associated with approximately 18.4% lower monthly electricity payments than those in detached homes. This estimate remained consistent in a sensitivity analysis that excluded large apartment buildings, with a 95% confidence interval between 15.4% and 21.4%.

While this result may appear intuitive—given shared walls and lower external exposure—this study quantifies the difference using population-level data and a replicable statistical approach. However, the analysis reflects reported household payments, not total energy usage. In some multi-unit buildings, shared utility costs may be bundled into rent or paid by landlords, potentially lowering reported figures. As such, findings should be interpreted as associations with household-reported payments—not as direct measures of efficiency or consumption.

The findings are relevant for urban planners, housing advocates, and energy policymakers. If similar cost differences persist across regions and years, they may support efforts to increase housing density or revise public messaging around the financial and energy-related advantages of multi-unit living.

Introduction

This analysis examines whether monthly electricity payments differ by housing type in Oregon, specifically comparing detached single-unit housing to attached or multi-unit housing. The goal is to estimate the extent to which housing structure is associated with household electricity expenses after adjusting for household size and number of bedrooms.

The analysis uses data from the 2015 American Community Survey (ACS), a nationally representative dataset that provides detailed information on housing and demographics. While energy efficiency is not measured directly, the model evaluates whether structural housing type is linked to electricity payment patterns after accounting for occupancy and space.

The analysis focuses on the following research questions:

  1. Are Oregon households living in attached or multi-unit housing associated with lower electricity payments compared to those in detached single-unit housing?
  2. If so, what is the estimated magnitude of the difference in monthly electricity payments?

These questions are increasingly relevant as policymakers and planners explore ways to manage energy demand, increase housing density, and improve infrastructure efficiency. Understanding how housing type relates to utility costs may inform further analysis and future development strategies.

Data Description

The dataset used in this analysis comes from the 2015 1-Year Public Use Microdata Sample (PUMS) of the American Community Survey (ACS), limited to households in Oregon. The ACS, conducted by the U.S. Census Bureau, collects information continuously throughout the calendar year. As such, the 2015 1-Year Estimates represent household and housing characteristics reported between January 1 and December 31, 2015.

PUMS data offers de-identified individual-level responses and uses a stratified random sampling method, making it suitable for population-level analysis when weighted appropriately. The original dataset contained 15,166 household records, including details on building structure, number of household members, bedrooms, tenure status, utility expenditures, and other housing characteristics.

For this study, the analysis focuses specifically on occupied housing units in Oregon that reported monthly electricity payments, excluding vacant units, group quarters, and households with missing electricity payment data.

Data Overview

Each observation in the dataset corresponds to a single household and includes variables describing household size, structural characteristics, and utility payments. A subset of these variables was selected for modeling electricity payments.

Table 1 provides a summary of these key variables.

Table 1. Variables Used in the Analysis
Variable Description
NP Number of people in the household
BDSP Number of bedrooms in the housing unit
ELEP Monthly electricity payment (in US dollars)
BLD Type of housing structure

Classification of Housing Type

To focus the analysis on conventional residential housing, the dataset was filtered to exclude mobile homes, boats, RVs, and similar non-residential structures. The ACS variable BLD was then used to construct a binary housing classification variable, SFDA (Single Family Detached or Attached):

  • SFDA = 0 identifies detached single-family housing (e.g., standalone houses)
  • SFDA = 1 identifies attached housing, (e.g., duplexes, triplexes and apartments)

Table 2 summarizes how original ACS housing types were grouped into the SFDA classification. Non-residential categories were excluded prior to analysis.

Table 2. ACS Building Type (BLD) Categories and Use in Analysis
ACS.Code ACS.Label Used.in.Analysis. SFDA.Value
01 Mobile home or trailer Excluded NA
02 One-family house detached Included 0
03 One-family house attached Included 1
04 2 apartments Included 1
05 3–4 apartments Included 1
06 5–9 apartments Included 1
07 10–19 apartments Included 1
08 20–49 apartments Included 1
09 50 or more apartments Included 1
10 Boat, RV, van, etc. Excluded NA
# Data preparation and cleaning
# Remove non-residential entries
rm_ix <- which(OR_acs_house_occ$BLD=="Boat, RV, van, etc." | 
               OR_acs_house_occ$BLD=="Mobile home or trailer")
OR_acs_house_occ <- OR_acs_house_occ[-rm_ix,]

# Create binary housing type variable (0=detached home, 1=attached home)
OR_acs_house_occ$SFDA <- ifelse(OR_acs_house_occ$BLD %in% 
                                c("One-family house detached"), 0, 1)

After filtering out non-residential and mobile housing types, 13,774 observations remained for analysis.

Source: U.S. Census Bureau, 2015 American Community Survey Public Use Microdata Sample (ACS PUMS).
https://www.census.gov/programs-surveys/acs/microdata.html

Exploratory Data Analysis

Before modeling, the cleaned dataset was examined to understand the distribution of housing types and assess potential imbalances that could affect analysis.

Table 3 summarizes the frequency of each housing type based on the original ACS BLD classifications. Detached single-family houses are the most common, comprising over three-quarters of the sample. In contrast, attached or multi-unit housing types—including duplexes, triplexes, fourplexes and apartment buildings—make up the remainder.

Table 3. Frequency of Housing Types in Cleaned Dataset (ACS BLD codes)
Housing Type Frequency
10-19 Apartments 418
2 Apartments 357
20-49 Apartments 334
3-4 Apartments 541
5-9 Apartments 509
50 or more apartments 401
One-family house attached 626
One-family house detached 10588

Although the ACS PUMS dataset is based on a stratified sample, the stratification is primarily designed around geography and demographic characteristics—not housing type. As a result, the distribution of housing structures in the unweighted sample may not perfectly reflect the population-level distribution. In particular, less common or concentrated housing types, such as large apartment buildings or mobile homes, may appear underrepresented due to the nature of the sample design.

A closer look at the cleaned dataset in Table 3 reveals an imbalanced distribution in the key variable of interest ‘SFDA’.

A further breakdown of the SFDA classification shows:

  • 76.9% of households live in detached single-family housing (SFDA = 0)
  • 23.1% live in attached housing structures (SFDA = 1)

This imbalance is relevant for analysis, as unequal group sizes can affect model estimates or introduce bias in comparisons.

Stratified Subsample

To examine whether this imbalance affects model results, a stratified subsample was created by randomly selecting an equal number of detached households to match the number of attached households. The resulting dataset includes 3,186 households in each group, forming a balanced 50/50 sample. Table 4 summarizes the distribution.

# Stratified sampling by SFDA group
detached <- OR_acs_house_occ[OR_acs_house_occ$SFDA == 0, ]
attached <- OR_acs_house_occ[OR_acs_house_occ$SFDA == 1, ]

# Only proceed if both groups exist
if (nrow(detached) > 0 & nrow(attached) > 0) {
  set.seed(1)
  detached_sample <- detached[sample(nrow(detached), size = nrow(attached)), ]

  # Combine into balanced dataset
  OR_balanced <- rbind(detached_sample, attached)

  # Calculate count and percent
  sfda_counts <- table(OR_balanced$SFDA)
  sfda_percent <- prop.table(sfda_counts) * 100

  # Create housing type labels based on SFDA values
  housing_labels <- if ("0" %in% names(sfda_counts) & "1" %in% names(sfda_counts)) {
    c("Detached Single-Family", "Attached Housing")
  } else {
    rep("Unknown", length(sfda_counts))
  }

  # Clean table
  sfda_table <- data.frame(
    `SFDA Value` = names(sfda_counts),
    `Housing Type` = housing_labels,
    Count = as.integer(sfda_counts),
    Percent = round(as.numeric(sfda_percent), 1)
  )

  # Display kableExtra table
  kable(sfda_table,
        caption = "Table 4. Distribution of Balanced Subsample",
        align = "lclc") %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
} else {
  cat("Unable to generate balanced subsample: One or both housing types are missing from the data.")
}
Table 4. Distribution of Balanced Subsample
SFDA.Value Housing.Type Count Percent
0 Detached Single-Family 3186 50
1 Attached Housing 3186 50

The balanced sample is reserved for supplemental sensitivity testing in later sections of the report. Primary modeling continues with the full dataset (13,774 observations) to maintain statistical power and preserve the original population distribution. In contrast, the balanced version (6,372 observations) reduces statistical power and may exclude important subgroups, particularly among less common combinations of household size and bedroom count.

The full dataset also aligns more closely with the underlying ACS sampling design, ensuring results remain applicable to Oregon’s actual housing composition. Modeling exclusively on a 50/50 sample would overrepresent attached housing and risk mischaracterizing broader population trends.

By retaining both versions, the analysis benefits from the statistical power and real-world representation of the full dataset while using the balanced subset to verify whether observed effects hold when group sizes are equal.

Electricity Payment Distributions

To further explore electricity payment patterns prior to modeling, exploratory visualizations were created to examine the distribution of reported electricity payments and assess differences across housing types. A histogram of monthly electricity payments (ELEP) reveals a strong right skew, which is common in utility payment data. A boxplot comparing electricity payments by building type shows that households in attached housing typically report lower electricity payments than those in detached homes.

Figure 1. Distribution of Monthly Electricity Payments by Housing Type

Figure 1. Distribution of Monthly Electricity Payments by Housing Type

Due to the skewed distribution of ELEP, a natural log transformation was applied. This transformation improves the linearity of relationships between variables, stabilizes variance, and supports key assumptions for regression modeling—particularly normality of residuals and homoscedasticity.

Figure 2 compares the original distribution of electricity payments to the log-transformed version. The log scale provides a more symmetric distribution, making it more suitable for modeling continuous outcomes such as electricity payments.

Figure 2. Comparison of Original and Log-Transformed Electricity Payments

Figure 2. Comparison of Original and Log-Transformed Electricity Payments

Stratified Data Split

To support model generalization and prevent overfitting (learning noise instead of true patterns), the cleaned dataset was split into three subsets using stratified sampling based on the SFDA variable. This ensured that each subset retained the same proportion of detached (SFDA = 0) and attached (SFDA = 1) housing as the full dataset.

  • 50% for training model parameters
  • 25% for validating model selection
  • 25% for testing predictive performance

Using stratified partitioning helps maintain balance between housing types across all modeling stages. Table 5 summarizes the results.

# Create a stratified 50% training set based on SFDA
trainIndex <- createDataPartition(OR_acs_house_occ$SFDA, p = 0.5, list = FALSE)
OR_TRAIN <- OR_acs_house_occ[trainIndex, ]
remaining <- OR_acs_house_occ[-trainIndex, ]

# Stratify the remaining 50% into validation and test sets (25% each)
validIndex <- createDataPartition(remaining$SFDA, p = 0.5, list = FALSE)
OR_VALID <- remaining[validIndex, ]
OR_TEST <- remaining[-validIndex, ]

# Create a summary table of split sizes and SFDA proportions
split_summary <- data.frame(
  Dataset = c("Training", "Validation", "Test"),
  Size = c(nrow(OR_TRAIN), nrow(OR_VALID), nrow(OR_TEST)),
  SFDA_Proportion = c(
    mean(OR_TRAIN$SFDA),
    mean(OR_VALID$SFDA),
    mean(OR_TEST$SFDA)
  )
)

# Display the table
if (require(kableExtra)) {
  kable(split_summary,
        caption = "Table 5. Stratified Data Split Summary (Caret)",
        digits = c(0, 0, 2),
        align = "lcc") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  split_df
}
Table 5. Stratified Data Split Summary (Caret)
Dataset Size SFDA_Proportion
Training 6887 0.24
Validation 3444 0.22
Test 3443 0.23

The stratified method preserves the housing type composition (approximately 77% detached and 23% attached) in each subset. This approach improves model training consistency while ensuring downstream evaluation steps reflect the true population structure.

Model Development

Statistical Modeling

Two linear regression models were developed to estimate the relationship between housing type and household electricity payments, using log-transformed monthly payments (log(ELEP)) as the outcome variable.

  • MOD1 (Additive Model): Includes housing type (SFDA), number of people (NP), and number of bedrooms (BDSP). Assumes these factors influence electricity payments independently.
  • MOD2 (Interaction Model): Builds on MOD1 by adding an interaction term (NP × BDSP) to capture whether the effect of available space on electricity use varies by household size.
# Define the two models (additive and interaction)
MOD1 <- lm(log(ELEP) ~ SFDA + NP + BDSP, data = OR_TRAIN)
MOD2 <- lm(log(ELEP) ~ SFDA + NP + BDSP + NP:BDSP, data = OR_TRAIN)

# Compare models
anova(MOD1, MOD2)
AIC(MOD1, MOD2)
BIC(MOD1, MOD2)

# Extract diagnostics
OR_TRAIN_diag <- augment(MOD2, OR_TRAIN)

Model Validation

# Predict on validation set with both models
val_pred_MOD1 <- predict(MOD1, OR_VALID)
val_pred_MOD2 <- predict(MOD2, OR_VALID)

# Calculate validation MSE for both models
val_MSE_MOD1 <- mean((val_pred_MOD1 - log(OR_VALID$ELEP))^2)
val_MSE_MOD2 <- mean((val_pred_MOD2 - log(OR_VALID$ELEP))^2)

# Compare validation MSE
val_results <- data.frame(
  Model = c("MOD1 (Additive)", "MOD2 (Interaction)"),
  `Validation MSE` = c(val_MSE_MOD1, val_MSE_MOD2)
)

# Display comparison
if(require(kableExtra)) {
  kable(val_results, 
        caption = "Validation Set Performance",
        digits = 4) %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  val_results
}

To validate the model selection, both candidate models were evaluated on the validation set. The additive model (MOD1) produced a validation MSE of 0.2852, while the interaction model (MOD2) achieved a better validation MSE of 0.2829, confirming the selection based on AIC and BIC comparisons. This validation step ensures that the improvement from adding the interaction term generalizes beyond the training data.

Model Selection

To evaluate model performance, several criteria were used:

  • Adjusted R-squared: Measures explained variance while adjusting for model complexity.
  • AIC and BIC: Penalize overfitting by balancing model fit and simplicity.
  • ANOVA: Tests whether adding the interaction term significantly improves model fit.

Table 6 below shows the ANOVA test confirmed that the interaction model (MOD2) provided a significant improvement over the additive model (MOD1), with a p-value of 2.8397e-10. Both AIC and BIC also favored MOD2, suggesting that the model’s added complexity was justified by a better overall fit.

# Compare models using ANOVA, AIC and BIC
anova_result <- anova(MOD1, MOD2)
aic_result <- AIC(MOD1, MOD2)
bic_result <- BIC(MOD1, MOD2)

# Comparison table
model_comp <- data.frame(
  Model = c("MOD1 (Additive)", "MOD2 (With Interaction)"),
  DF = c(anova_result$Df[1], anova_result$Df[2]),
  AIC = c(aic_result[1], aic_result[2]),
  BIC = c(bic_result[1], bic_result[2]),
  `p-value` = c(NA, formatC(anova_result$`Pr(>F)`[2], format = "e", digits = 4))
)

if(require(kableExtra)) {
  kable(model_comp, 
        caption = "Table 6. Model Comparison Statistics",
        digits = c(0, 0, 2, 2, 8)) %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  model_comp
}
Table 6. Model Comparison Statistics
Model DF AIC.df AIC.AIC BIC.df BIC.BIC p.value
MOD1 (Additive) NA 5 11300.30 5 11334 NA
MOD2 (With Interaction) 1 6 11262.49 6 11304 2.8397e-10

Model Assumptions

Model assumptions were evaluated using diagnostic plots for MOD2, the interaction model (Figure 3). These plots assess linearity, normality of residuals, variance homogeneity, and influence.

Figure 3. Diagnostic Plots for MOD2 (Interaction Model)

Figure 3. Diagnostic Plots for MOD2 (Interaction Model)

Figure 3 Interpretation:

  • Residuals vs Fitted: Residuals appear roughly centered around zero, but the spread narrows slightly at higher fitted values, suggesting potential heteroscedasticity.
  • Q-Q Plot: Residuals mostly follow the theoretical line, though some deviations appear in both tails. Cases 5471, 13088, and 14562 are flagged as extreme outliers.
  • Scale-Location: The red LOESS line trends upward, supporting concerns of non-constant variance across fitted values.
  • Residuals vs Leverage: Several points show high leverage (cases 11, 5710, and 5830). The curved dashed lines represent Cook’s distance contours, which measure how much model coefficients would change if an observation were removed. While these points remain within the 0.5 Cook’s distance boundary shown here, the supplementary analysis in Figure 3b (below) reveals they exceed the more conservative 4/n threshold.
Figure 3b. Cook's Distance Plot for Influential Observations

Figure 3b. Cook’s Distance Plot for Influential Observations

Cook’s distance is a key diagnostic that combines leverage and residual size to identify observations that disproportionately influence regression results. Points with high Cook’s distance values can significantly alter coefficient estimates and potentially bias interpretation. While no assumptions are severely violated, the plots indicate moderate issues with residual spread and influential observations that justify further investigation in later sensitivity checks. The presence of several observations exceeding the standard 4/n Cook’s distance threshold suggests some caution when interpreting model coefficients.

Additional Residual Plots

To explore residual behavior in more depth, Figure 4 plots residuals against each predictor in MOD2.

Figure 4. Residuals vs Key Predictors for MOD2

Figure 4. Residuals vs Key Predictors for MOD2

Figure 4 Interpretation:

  • Fitted Values: Diagonal striping pattern with narrowing spread at higher values confirms mild heteroscedasticity.
  • log(ELEP): Expected linear relationship with outcome variable confirms appropriate log stabilized variance transformation.
  • Housing Type (SFDA): Residuals well-centered for both categories with comparable spreads, validating the binary classification.
  • Number of People (NP): Downward LOESS curve for larger households supports the interaction term’s inclusion and suggests diminishing per-person energy payments.
  • Number of Bedrooms (BDSP): Stable residuals across common bedroom counts with minor variation at extremes, indicating adequate modeling of this predictor.

Overall, these diagnostics support the validity of the model while highlighting areas where performance could be improved. The patterns informed the decision to stratify the data and perform a targeted sensitivity analysis.

Several influential observations merit particular attention for their leverage or outlier status. Cases 11, 5710, and 5830 exhibited notably high Cook’s distance values, well above the standard 4/n threshold, indicating these points have substantial influence on the model estimates. Case 3688 was flagged as an extreme residual outlier. Though these observations represent less than 0.2% of the sample, their influence on model estimates warrants consideration in future analyses.

The stratified three-way data split (training, validation, test) ensured that model development and evaluation were not biased by housing type imbalance. This structure supports generalization while guarding against overfitting.

Based on statistical fit, theoretical plausibility, and diagnostic performance, MOD2 was selected as the final model. It explains more variation in electricity payments and better reflects how household size and space interact across housing types.

Results

This section presents the modeling results and interprets the relationship between housing type and electricity payments. It begins with the main regression findings and follows with a sensitivity analysis to test the robustness of results under more conservative assumptions.

Main Model Results

The final model (MOD2) included housing type, number of people (NP), number of bedrooms (BDSP), and an interaction term between NP and BDSP. This structure accounts for how household size and available space interact to influence electricity payments.

The log-linear model allows each coefficient to be interpreted as the approximate percentage change in electricity payment for a one-unit increase in the predictor, holding other variables constant.

Table 7a presents the raw coefficient estimates with 95% confidence intervals.

# Extract coefficient estimates and confidence intervals
coef_summary <- summary(MOD2)$coefficients
conf_int <- confint(MOD2)
# Combine into a table with updated variable labels
coef_table <- data.frame(
  Variable = c("Intercept", 
               "Attached or Multi-Unit Housing (SFDA)", 
               "Number of People (NP)", 
               "Number of Bedrooms (Bedrooms)", 
               "Interaction (NP × Bedrooms)"),
  `Point Estimate` = round(c(coef_summary[1,1], coef_summary[2,1], coef_summary[3,1], 
                       coef_summary[4,1], coef_summary[5,1]), 2),
  `Lower CI` = round(conf_int[,1], 2),
  `Upper CI` = round(conf_int[,2], 2),
  `p-value` = format(c(coef_summary[1,4], coef_summary[2,4], coef_summary[3,4], 
                coef_summary[4,4], coef_summary[5,4]), scientific = TRUE)
)
if(require(kableExtra)) {
  kable(coef_table, 
        caption = "Table 7a. Model Coefficients with 95% Confidence Intervals") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  coef_table
}
Table 7a. Model Coefficients with 95% Confidence Intervals
Variable Point.Estimate Lower.CI Upper.CI p.value
(Intercept) Intercept 3.94 3.86 4.02 0.000000e+00
SFDA Attached or Multi-Unit Housing (SFDA) -0.20 -0.24 -0.17 1.226747e-27
NP Number of People (NP) 0.20 0.17 0.22 1.980180e-41
BDSP Number of Bedrooms (Bedrooms) 0.14 0.11 0.16 2.747337e-26
NP:BDSP Interaction (NP × Bedrooms) -0.03 -0.04 -0.02 2.839716e-10

Table 7b transforms the coefficients into percentage terms for easier interpretation.

# Calculate transformed estimates for interpretation
transformed_table <- data.frame(
  Effect = c("Attached/Multi-Unit Housing (SFDA=1)", 
             "Each Additional Person (NP)",
             "Each Additional Bedroom (BDSP)",
             "Interaction Effect (NP × BDSP)"),
  
  `Multiplier` = c(
    round(exp(coef_summary[2,1]), 2),
    round(exp(coef_summary[3,1]), 2),
    round(exp(coef_summary[4,1]), 2),
    round(exp(coef_summary[5,1]), 2)
  ),
  
  `Mult. 95% CI` = c(
    paste0(round(exp(conf_int[2,1]), 2), " to ", round(exp(conf_int[2,2]), 2)),
    paste0(round(exp(conf_int[3,1]), 2), " to ", round(exp(conf_int[3,2]), 2)),
    paste0(round(exp(conf_int[4,1]), 2), " to ", round(exp(conf_int[4,2]), 2)),
    paste0(round(exp(conf_int[5,1]), 2), " to ", round(exp(conf_int[5,2]), 2))
  ),
  
  `% Change` = c(
    paste0(format((exp(coef_summary[2,1])-1)*100, digits=1, nsmall=1), "%"),
    paste0(format((exp(coef_summary[3,1])-1)*100, digits=1, nsmall=1), "%"),
    paste0(format((exp(coef_summary[4,1])-1)*100, digits=1, nsmall=1), "%"),
    paste0(format((exp(coef_summary[5,1])-1)*100, digits=1, nsmall=1), "%")
  ),
  
  `% 95% CI` = c(
    paste0(format((exp(conf_int[2,1])-1)*100, digits=1, nsmall=1), "% to ", 
           format((exp(conf_int[2,2])-1)*100, digits=1, nsmall=1), "%"),
    paste0(format((exp(conf_int[3,1])-1)*100, digits=1, nsmall=1), "% to ", 
           format((exp(conf_int[3,2])-1)*100, digits=1, nsmall=1), "%"),
    paste0(format((exp(conf_int[4,1])-1)*100, digits=1, nsmall=1), "% to ", 
           format((exp(conf_int[4,2])-1)*100, digits=1, nsmall=1), "%"),
    paste0(format((exp(conf_int[5,1])-1)*100, digits=1, nsmall=1), "% to ", 
           format((exp(conf_int[5,2])-1)*100, digits=1, nsmall=1), "%")
  )
)

if(require(kableExtra)) {
  kable(transformed_table, 
        caption = "Table 7b. Transformed Model Coefficients for Interpretation",
        align = c('l', 'c', 'c', 'c', 'c')) %>%
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    column_spec(1, bold = FALSE) %>%
    add_header_above(c(" " = 1, "Multiplicative Effects" = 2, "Percentage Effects" = 2))
} else {
  transformed_table
}
Table 7b. Transformed Model Coefficients for Interpretation
Multiplicative Effects
Percentage Effects
Effect Multiplier Mult..95..CI X..Change X..95..CI
Attached/Multi-Unit Housing (SFDA=1) 0.82 0.79 to 0.85 -18.4% -21.4% to -15.4%
Each Additional Person (NP) 1.22 1.18 to 1.25 21.7% 18.3% to 25.2%
Each Additional Bedroom (BDSP) 1.15 1.12 to 1.18 14.9% 12.0% to 17.8%
Interaction Effect (NP × BDSP) 0.97 0.97 to 0.98 -2.6% -3.4% to -1.8%

Interpretation of Results

  • Households in attached or multi-unit housing (SFDA = 1) were associated with 18.4% lower monthly electricity payments than detached households (SFDA = 0), after adjusting for household size and number of bedrooms. This estimate is fairly precise, with a 95% confidence interval of falling between 15.4% and 21.4%.
  • Electricity payments increased by 21.7% per additional household member and 14.9% per additional bedroom, on average.
  • The negative interaction term (−2.64%) indicates that when both people and bedrooms increase together, electricity payments rise at a slower rate—consistent with shared space efficiencies in larger households.

While the model does not directly measure energy systems, structural design, or occupancy behavior, the results suggest that attached housing is associated with lower monthly electricity payments—even after adjusting for household size and number of bedrooms. This may reflect a combination of structural factors (e.g., shared walls that reduce heat loss), demographic factors (e.g., smaller or lower-income households), and spatial efficiency (e.g., less underutilized space).

However, it’s important to note that reported electricity payments may not fully capture household energy use. In some attached or multi-unit buildings, electricity for shared spaces such as hallways, elevators, or central HVAC systems may be billed to property managers or bundled into rent, resulting in lower reported household electricity payments. As such, the observed payment differences should be interpreted as associations with reported utility payments, not direct measurements of total electricity consumption or efficiency.

Interpreting the Housing Type Effect in Dollar Terms

In a log-linear model, coefficients can be interpreted as multiplicative effects on the dependent variable. For the SFDA variable, this means that households in attached or multi-unit housing (SFDA = 1) are associated with an 18.4% lower monthly electricity payment, on average, compared to those in detached homes (SFDA = 0). This estimate accounts for household size, number of bedrooms, and their interaction, isolating the association with housing type itself.

Based on model assumptions, where a typical detached household reports an average monthly electricity payment of $122.62, an otherwise similar household in an attached or multi-unit residence would be expected to pay approximately $100.00—or $22.62 less per month.

Sensitivity Check

Excluding Large Multi-Unit Buildings

A key concern during model development was that households in large apartment buildings—particularly those with 50 or more units—may follow different utility billing practices. Electricity for common spaces like elevators, lighting, or HVAC systems is often paid by landlords or bundled into rent, potentially making household-reported electricity payments appear lower than actual consumption.

To test whether these high-density buildings were skewing results, the final interaction model (MOD2) was re-estimated after excluding all training observations classified as “50 or more apartments.”

# Exclude large apartment buildings from the training set
OR_TRAIN_filtered <- OR_TRAIN[!grepl("50 or more", OR_TRAIN$BLD), ]

# Refit model
MOD2_restricted <- lm(log(ELEP) ~ SFDA + NP + BDSP + NP:BDSP, data = OR_TRAIN_filtered)
Table 8. Sensitivity Check: Coefficients After Excluding Large Apartment Buildings
Variable Point.Estimate Lower.CI Upper.CI p.value
(Intercept) Intercept 3.9701 3.8889 4.0513 0.0e+00
SFDA Attached or Multi-Unit Housing (SFDA) -0.1998 -0.2373 -0.1622 3.0e-25
NP Number of People (NP) 0.1874 0.1583 0.2164 3.3e-36
BDSP Number of Bedrooms (Bedrooms) 0.1281 0.1019 0.1544 1.4e-21
NP:BDSP Interaction (NP × Bedrooms) -0.0242 -0.0327 -0.0157 2.3e-08

Table 8 shows the updated model coefficients. The SFDA variable remains reliable (p < 0.0001), with attached or multi-unit homes associated with approximately 16.5% lower electricity payments than detached homes. The 95% confidence interval—14.3% to 18.7%—is slightly narrower than in the full model. The other variables (NP, BDSP, and their interaction) remain significant and consistent, reinforcing the interpretation that electricity use becomes more efficient as space and occupancy scale together.

Actual vs. Predicted log(ELEP), excluding large apartment buildings

Actual vs. Predicted log(ELEP), excluding large apartment buildings

Figure 5 displays predicted versus actual log electricity payments (log(ELEP)) for the restricted model, evaluated on the filtered training data. Model performance remains stable, with predictions tracking observed values across both housing types, even after excluding large apartment buildings.

This sensitivity check reduces a plausible source of bias: centralized billing in large apartment buildings. By excluding these buildings, the analysis focuses on more typical low- and mid-rise multi-unit structures, where billing is more likely to reflect actual household-level electricity use. The slightly reduced but still-strong payment difference supports the interpretation that housing type is meaningfully associated with reported electricity expenses—independent of billing anomalies in large complexes.

While this refinement improves the validity of the comparison, it does not resolve all diagnostic concerns—particularly the moderate heteroskedasticity identified earlier. Still, by isolating and removing a known reporting issue, it increases confidence in the core finding.

Key Findings

The significant interaction between number of people (NP) and number of bedrooms (BDSP) indicates that the relationship between bedroom count and electricity payments depends on household size. The negative coefficient for the interaction term suggests that as household size increases, the marginal effect of additional bedrooms on electricity payments decreases—possibly reflecting more efficient space utilization in larger households.

While the model shows that housing type remains a strong predictor of electricity payments even after accounting for household size and space, this difference may not solely reflect improved energy efficiency. In some attached or multi-unit housing, certain electricity costs—such as lighting for common areas or central HVAC systems—may be paid by the landlord or included in rent rather than reported by tenants as utility payments. As a result, the observed lower electricity payments in these households may partially reflect billing practices rather than reduced consumption. This distinction is important when interpreting the magnitude of the difference as a proxy for efficiency, and suggests that further research including more detailed utility billing structures would be needed to draw stronger conclusions about energy use itself.

Model results estimate that households in attached or multi-unit housing pay approximately 18.4% less for monthly electricity payments compared to those in detached single-family homes, after adjusting for other factors. The 95% confidence interval for this reduction ranges from 15.4% to 21.4%, reinforcing both statistical and practical significance.

For many households, an 18% reduction in electricity expenses could translate into meaningful annual savings. These insights may help inform energy efficiency programs, housing affordability initiatives, and urban development strategies.

The final model was refit on the full training dataset, formed by combining the original training and validation sets. This model was then used for final performance evaluation on the held-out test set.

# Combine original training and validation sets to form a final training dataset
# This model will be evaluated on the untouched test set for final performance
final_training_set <- rbind(OR_TRAIN, OR_VALID)

# Refit the selected model structure on the full training dataset
MOD_FINAL <- lm(log(ELEP) ~ SFDA + NP + BDSP + NP:BDSP, data = final_training_set)

Final Model Performance

Evaluation on Log Scale

Figure 6. Actual vs. Predicted log(ELEP) on Test Set

Figure 6. Actual vs. Predicted log(ELEP) on Test Set

The final predictive model, which included housing type, number of occupants, number of bedrooms, and their interaction, was evaluated on the held-out test data to assess generalization performance. As shown in Figure 6, which plots predicted versus actual log(ELEP) on the test set, the model captures broad patterns in the data, but prediction errors widen as actual values increase—particularly at the higher end of the distribution. This increasing variance in residuals is reflected in the model’s overall error rate and helps frame the limitations that emerge when translating predictions back into dollar terms.

On the test set, the model produced a mean squared error (MSE) of 0.2944, indicating that predicted electricity payments typically differ from actual values by about 29% on the log scale. While this reflects some unexplained variation, the model captures meaningful differences in housing types and household structure.

## calculate root mean squared error on log scale
log_mse <- MSE  
log_rmse <- sqrt(log_mse)  

## Back-transform to multiplicative error factor
error_factor <- exp(log_rmse)  ## gives ratio of predicted to actual

## Retrieve detached household average (from training set)
avg_detached_cost <- mean(OR_TRAIN$ELEP[OR_TRAIN$SFDA == 0], na.rm = TRUE)

## Calculate prediction interval in dollar terms
lower_bound <- avg_detached_cost / error_factor
upper_bound <- avg_detached_cost * error_factor

## Return results rounded for interpretation
list(
  Avg_Detached_Cost = round(avg_detached_cost, 2),
  RMSE_Multiplier = round(error_factor, 2),
  Predicted_Cost_Lower_Bound = round(lower_bound, 0),
  Predicted_Cost_Upper_Bound = round(upper_bound, 0)
)

Back-Transformed Interpretation

While the model’s MSE is calculated on the log scale, it’s helpful to translate this into dollar terms for better interpretability. When back-transformed, an MSE of 0.2944 implies that predicted electricity payments typically differ from actual values by an average multiplicative error factor of about 1.72. For example, if the model predicts a typical detached household will spend $122.62 on electricity, the actual payment could reasonably fall between $71 and $211. This equates to an approximate ±72% margin of error in dollar terms.

This wide range is not a model error but rather a natural byproduct of using a log transformation on a highly skewed outcome variable. Log transformations stabilize variance and improve model fit, but they tend to compress larger values, which can lead to underprediction when results are back-transformed. This is a known tradeoff and should be made transparent to readers evaluating model accuracy in practical settings.

Figure 7. Actual vs. Predicted Electricity payments (Back-Transformed $ Scale)

Figure 7. Actual vs. Predicted Electricity payments (Back-Transformed $ Scale)

As seen in Figure 7, which presents the same predictions back-transformed to the dollar scale on the test set, values tend to cluster below the 1:1 reference line—particularly for households with high actual payments. This suggests that while the model performs reasonably well for most typical households, it tends to underestimate larger electricity bills—likely due to either the log transformation or omitted variables that would explain unusually high usage (e.g., square footage, heating type, or local climate).

This pattern doesn’t invalidate the model—it still identifies general structural relationships—but highlights an important limitation: the model is best used for estimating typical costs across a population, rather than precisely predicting high-cost outliers.

Discussion and Conclusion

This analysis examined how structural housing type relates to monthly electricity payments in Oregon using 2015 American Community Survey data. Households were grouped into a binary classification: detached single-unit housing versus attached or multi-unit housing, based on distinctions relevant to density, space configuration, and potential energy efficiency.

After adjusting for household size, number of bedrooms, and their interaction, households in attached or multi-unit housing were associated with approximately 18.4% lower monthly electricity payments than those in detached homes.

Interpreting the Payment Difference

While this difference may seem intuitive—given shared walls and reduced exposure to exterior conditions—this analysis quantifies it in practical terms: An average detached Oregonian household in 2015 pays approximately $122.62 per month in electricity, whereas a household in an attached or multi-unit home pays about $100.00. This results in an estimated $22.62 monthly savings per household.

At scale, modest shifts in housing composition could yield substantial savings. For example, 1,000 attached households save an estimated $271,000 per year in electricity payments compared to 1,000 detached homes. However, it’s important to note that when comparing similar households, the model has approximately ±72% margin of error. This level of uncertainty reflects the compression that results from modeling on a log scale and then back-transforming to dollars. It reinforces that these estimates should be viewed as indicative patterns, not precise forecasts. Still, the analysis demonstrates a meaningful and consistent association between housing structure and reported electricity costs.

Limitations

A central limitation is that the model estimates reported household electricity payments, not total energy use. In many attached buildings, electricity for common areas—hallways, elevators, HVAC—may be paid by landlords or bundled into rent, and not reflected in reported payments. As a result, lower household bills may not indicate lower consumption.

Other limitations include:

  • The use of cross-sectional data from a single year (2015)
  • Exclusion of variables like income, appliance efficiency, or building age
  • The observational nature of the data, which prevents causal inference

These findings should be interpreted as associations with reported electricity payments, not direct evidence of energy efficiency.

Model Considerations

The model captures broad structural patterns, but shows increasing variability and underestimation for households with high electricity payments. This is partly a byproduct of the log transformation used to manage skewness and heteroscedasticity—useful for improving model fit, but prone to compressing high values after back-transformation. This tradeoff is common in predictive modeling but should be acknowledged when interpreting the results.

To better model nonlinear or threshold effects—especially for high-cost outliers—future work could explore alternative approaches, including:

  • Random forest or gradient boosting models (e.g., XGBoost), which can capture complex interactions without the need for transformation
  • Generalized additive models (GAMs), which allow for smooth, nonlinear relationships between predictors and outcomes

These methods may improve accuracy and interpretability, particularly when modeling electricity costs across a broader range of household types and usage behaviors.

Policy Relevance

While the analysis does not confirm greater energy efficiency in multi-unit housing, it identifies a consistent pattern: even after adjusting for size and occupancy, housing structure is associated with notable payment differences that may reflect physical form, billing structures, or both.

These findings have relevance for:

  • Planners and developers: Infill development and shared-wall housing may reduce per-household electricity expenses
  • Policy designers: Incentives for mid-density or attached units may align with energy cost savings
  • Public messaging: Multi-unit housing can be framed as cost-conscious and energy-conscious—not just affordable

These benefits complement other policy goals, including sustainable land use and reduced infrastructure strain.

Future Research

Further work could strengthen and extend these findings by:

  • Using multi-year data to assess trend stability
  • Including explanatory variables like square footage, HVAC type, or climate zone
  • Expanding beyond Oregon to test generalizability
  • Comparing actual energy use (e.g., kilowatt hours) to reported payments

These steps would clarify whether the observed differences reflect true efficiency, behavior, or billing design—and help inform more targeted energy and housing strategies.


Appendix

A. Methodological Notes: Interpreting Log-Linear Models

This analysis uses a log-linear regression model where the dependent variable (electricity payments) is log-transformed. This approach is common for modeling skewed data and offers several advantages:

  1. Interpretation: In a log-linear model, coefficients represent the approximate percentage change in the dependent variable for a one-unit change in the predictor. For example, a coefficient of -0.2 for a binary variable implies approximately a 20% decrease when that variable equals 1.

  2. Exact Interpretation: For more precise interpretation, using the transformation: \((e^{\beta} - 1) \times 100\%\), where \(\beta\) is the coefficient. This gives the exact percentage change in the original (non-log) scale.

  3. Confidence Intervals: Similarly, confidence intervals in log scale can be transformed to percentage terms using the same exponential transformation.

This transformation enables intuitive interpretation of effects in percentage terms rather than log units, making results more accessible to non-technical audiences.

B. R Code

Data Prep & EDA

# Load required libraries
library(Sleuth3)   # For statistical analyses
library(ggplot2)   # For data visualization
library(MASS)      # For statistical modeling
library(broom)     # For tidying model outputs
library(tidyr)     # For data reshaping
library(leaps)     # For subset selection methods
library(nlme)      # For mixed-effects models
library(dplyr)     # For data manipulation
library(knitr)     # For report formatting
library(kableExtra)# For enhanced tables
library(gridExtra) # For arranging multiple plots
library(caret)
library(car)

# Set random seed for reproducibility
set.seed(1)

# Import dataset
OR_acs_house_occ <- read.csv("../data/ElectricityAnalysis/OR_acs_house_occ.csv")
OR_acs_raw <- read.csv("../data/ElectricityAnalysis/OR_acs_house_occ.csv")

# Define variable descriptions and levels
var_descriptions <- data.frame(
  Variable = c("NP", "BDSP", "ELEP", "BLD"),
  Description = c(
    "Number of people in the household",
    "Number of bedrooms in the housing unit",
    "Monthly electricity payment (in US dollars)",
    "Type of housing structure"
  )
)

# Display formatted table
kable(var_descriptions, caption = "Table 1. Variables Used in the Analysis", align = "l") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

# Define original ACS BLD categories and binary classification
bld_mapping <- data.frame(
  `ACS Code` = c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10"),
  `ACS Label` = c(
    "Mobile home or trailer",
    "One-family house detached",
    "One-family house attached",
    "2 apartments",
    "3–4 apartments",
    "5–9 apartments",
    "10–19 apartments",
    "20–49 apartments",
    "50 or more apartments",
    "Boat, RV, van, etc."
  ),
  `Used in Analysis?` = c("Excluded", "Included", "Included", "Included", "Included", 
                          "Included", "Included", "Included", "Included", "Excluded"),
  `SFDA Value` = c(NA, 0, 1, 1, 1, 1, 1, 1, 1, NA)
)

kable(bld_mapping, caption = "Table 2. ACS Building Type (`BLD`) Categories and Use in Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, position = "left")


# Data preparation and cleaning
# Remove non-residential entries
rm_ix <- which(OR_acs_house_occ$BLD=="Boat, RV, van, etc." | 
               OR_acs_house_occ$BLD=="Mobile home or trailer")
OR_acs_house_occ <- OR_acs_house_occ[-rm_ix,]

# Create binary housing type variable (0=detached home, 1=attached home)
OR_acs_house_occ$SFDA <- ifelse(OR_acs_house_occ$BLD %in% 
                                c("One-family house detached"), 0, 1)

# Create frequency table as data frame
housing_table <- as.data.frame(table(OR_acs_house_occ$BLD))

if (ncol(housing_table) == 2) {
  colnames(housing_table) <- c("Housing Type", "Frequency")
}

if (require(kableExtra)) {
  kable(housing_table, caption = "Table 3. Frequency of Housing Types in Cleaned Dataset (ACS `BLD` codes)") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  print(housing_table)
}


# Stratified sampling by SFDA group
detached <- OR_acs_house_occ[OR_acs_house_occ$SFDA == 0, ]
attached <- OR_acs_house_occ[OR_acs_house_occ$SFDA == 1, ]

# Debug - only proceed if both groups exist
if (nrow(detached) > 0 & nrow(attached) > 0) {
  set.seed(1)
  detached_sample <- detached[sample(nrow(detached), size = nrow(attached)), ]

  # Combine into balanced dataset
  OR_balanced <- rbind(detached_sample, attached)

  # Calculate count and percent
  sfda_counts <- table(OR_balanced$SFDA)
  sfda_percent <- prop.table(sfda_counts) * 100

  # Create housing type labels based on SFDA values
  housing_labels <- if ("0" %in% names(sfda_counts) & "1" %in% names(sfda_counts)) {
    c("Detached Single-Family", "Attached Housing")
  } else {
    rep("Unknown", length(sfda_counts))
  }

  # Clean table
  sfda_table <- data.frame(
    `SFDA Value` = names(sfda_counts),
    `Housing Type` = housing_labels,
    Count = as.integer(sfda_counts),
    Percent = round(as.numeric(sfda_percent), 1)
  )

  # Display kableExtra table
  kable(sfda_table,
        caption = "Table 4. Distribution of Balanced Subsample",
        align = "lclc") %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
} else {
  cat("Unable to generate balanced subsample: One or both housing types are missing from the data.")
}

# Exploratory visualizations

# Histogram of electricity payments
p1 <- ggplot(OR_acs_house_occ, aes(x = ELEP)) + 
  geom_histogram(bins = 30, fill = "darkred", color = "white") +
  labs(title = "Figure 1a. Histogram of Electricity Payments (ELEP)",
       x = "Monthly Electricity Payment ($)",
       y = "Frequency") +
  theme_minimal()

# Boxplot of electricity payments by housing type
p2 <- ggplot(OR_acs_house_occ, aes(x = ELEP, y = BLD, fill = BLD)) +
  geom_boxplot() +
  labs(title = "Figure 1b. Electricity Payments by Housing Type",
       x = "Monthly Electricity Payment ($)",
       y = "Housing Type") +
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 0, hjust = 1),
        legend.position = "none")

# Arrange plots in grid
if(require(gridExtra)) {
  grid.arrange(p1, p2, ncol = 1)
} else {
  print(p1)
  print(p2)
}


# Demonstrate log transformation effect
# Create side-by-side plots showing original vs log-transformed data
p3 <- ggplot(OR_acs_house_occ, aes(x = ELEP)) + 
  geom_histogram(bins = 30, fill = "darkred", color = "white") +
  labs(title = "Figure 2a. Original Scale",
       x = "Electricity Payment ($)",
       y = "Frequency") +
  theme_minimal()

p4 <- ggplot(OR_acs_house_occ, aes(x = log(ELEP))) + 
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  labs(title = "Figure 2b. Log-Transformed Scale",
       x = "log(Electricity Payment)",
       y = "Frequency") +
  theme_minimal()

if(require(gridExtra)) {
  grid.arrange(p3, p4, ncol = 2)
} else {
  print(p3)
  print(p4)
}


# Create a stratified 50% training set based on SFDA
trainIndex <- createDataPartition(OR_acs_house_occ$SFDA, p = 0.5, list = FALSE)
OR_TRAIN <- OR_acs_house_occ[trainIndex, ]
remaining <- OR_acs_house_occ[-trainIndex, ]

# Stratify the remaining 50% into validation and test sets (25% each)
validIndex <- createDataPartition(remaining$SFDA, p = 0.5, list = FALSE)
OR_VALID <- remaining[validIndex, ]
OR_TEST <- remaining[-validIndex, ]

# Create a summary table of split sizes and SFDA proportions
split_summary <- data.frame(
  Dataset = c("Training", "Validation", "Test"),
  Size = c(nrow(OR_TRAIN), nrow(OR_VALID), nrow(OR_TEST)),
  SFDA_Proportion = c(
    mean(OR_TRAIN$SFDA),
    mean(OR_VALID$SFDA),
    mean(OR_TEST$SFDA)
  )
)

# Display the table
if (require(kableExtra)) {
  kable(split_summary,
        caption = "Table 5. Stratified Data Split Summary (Caret)",
        digits = c(0, 0, 2),
        align = "lcc") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  split_df
}

Model Fitting & Evaluation

# Define the two models (additive and interaction)
MOD1 <- lm(log(ELEP) ~ SFDA + NP + BDSP, data = OR_TRAIN)
MOD2 <- lm(log(ELEP) ~ SFDA + NP + BDSP + NP:BDSP, data = OR_TRAIN)

# Predict on validation set with both models
val_pred_MOD1 <- predict(MOD1, OR_VALID)
val_pred_MOD2 <- predict(MOD2, OR_VALID)

# Calculate validation MSE for both models
val_MSE_MOD1 <- mean((val_pred_MOD1 - log(OR_VALID$ELEP))^2)
val_MSE_MOD2 <- mean((val_pred_MOD2 - log(OR_VALID$ELEP))^2)

# Compare validation MSE
val_results <- data.frame(
  Model = c("MOD1 (Additive)", "MOD2 (Interaction)"),
  `Validation MSE` = c(val_MSE_MOD1, val_MSE_MOD2)
)

# Display comparison
if(require(kableExtra)) {
  kable(val_results, 
        caption = "Validation Set Performance",
        digits = 4) %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  val_results
}


# Compare models using ANOVA, AIC and BIC
anova_result <- anova(MOD1, MOD2)
aic_result <- AIC(MOD1, MOD2)
bic_result <- BIC(MOD1, MOD2)

# Comparison table
model_comp <- data.frame(
  Model = c("MOD1 (Additive)", "MOD2 (With Interaction)"),
  DF = c(anova_result$Df[1], anova_result$Df[2]),
  AIC = c(aic_result[1], aic_result[2]),
  BIC = c(bic_result[1], bic_result[2]),
  `p-value` = c(NA, formatC(anova_result$`Pr(>F)`[2], format = "e", digits = 4))
)

if(require(kableExtra)) {
  kable(model_comp, 
        caption = "Table 6. Model Comparison Statistics",
        digits = c(0, 0, 2, 2, 8)) %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
} else {
  model_comp
}


# Diagnostic plots for the chosen model
par(mfrow=c(2,2))
plot(MOD2)


# Calculate Cook's distances
cooks_d <- cooks.distance(MOD2)

# Identify the top influential points
influential_ix <- order(cooks_d, decreasing = TRUE)[1:10]
influential_labels <- rownames(OR_TRAIN)[influential_ix]

# Plot Cook's distances
plot(cooks_d, 
     type = "h", 
     lwd = 2,
     col = "steelblue",
     ylab = "Cook's Distance",
     xlab = "Observation Index",
     main = "Cook's Distance for Model Observations")

# Add threshold line
cook_cutoff <- 4 / nrow(OR_TRAIN)
abline(h = cook_cutoff, col = "red", lty = 2, lwd = 2)

# Add text label for the threshold
text(x = nrow(OR_TRAIN) * 0.9, 
     y = cook_cutoff * 1.2, 
     labels = paste0("Threshold (4/n): ", round(cook_cutoff, 4)),
     col = "red")

# Highlight top influential points
points(influential_ix, cooks_d[influential_ix], col = "red", pch = 19)

# Label top 3 influential points
top3_ix <- order(cooks_d, decreasing = TRUE)[1:3]
text(top3_ix, cooks_d[top3_ix], 
     labels = rownames(OR_TRAIN)[top3_ix], 
     pos = 3,
     col = "darkred")


# Residual vs Fitted
plot1 <- ggplot(data = data.frame(
  Fitted = fitted(MOD2),
  Residuals = resid(MOD2)
), aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals") +
  theme_minimal()


# Residual vs log(ELEP)
plot2 <- ggplot(data = data.frame(
  ELEP = OR_TRAIN$ELEP,
  Residuals = resid(MOD2)
), aes(x = log(ELEP), y = Residuals)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs log(ELEP)", x = "log(Electricity Payments)", y = "Residuals") +
  theme_minimal()

# Residual vs SFDA
plot3 <- ggplot(data = data.frame(
  SFDA = OR_TRAIN$SFDA,
  Residuals = resid(MOD2)
), aes(x = as.factor(SFDA), y = Residuals)) +
  geom_boxplot(fill = c("lightblue", "lightpink")) +
  labs(title = "Residuals vs Housing Type", x = "Housing Type (0 = Detached, 1 = Attached)", y = "Residuals") +
  theme_minimal()

# Residual vs NP
plot4 <- ggplot(data = data.frame(
  NP = OR_TRAIN$NP,
  Residuals = resid(MOD2)
), aes(x = NP, y = Residuals)) +
  geom_jitter(alpha = 0.4, width = 0.2) +
  geom_smooth(method = "loess", se = FALSE, color = "darkred") +
  labs(title = "Residuals vs Number of People", x = "Number of People (NP)", y = "Residuals") +
  theme_minimal()

# Residual vs BDSP
plot5 <- ggplot(data = data.frame(
  BDSP = OR_TRAIN$BDSP,
  Residuals = resid(MOD2)
), aes(x = BDSP, y = Residuals)) +
  geom_jitter(alpha = 0.4, width = 0.2) +
  geom_smooth(method = "loess", se = FALSE, color = "darkgreen") +
  labs(title = "Residuals vs Number of Bedrooms", x = "Bedrooms (BDSP)", y = "Residuals") +
  theme_minimal()

if(require(gridExtra)) {
  grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol = 2)
} else {
  print(plot1); print(plot2); print(plot3); print(plot4); print(plot5)
}


# Identify key influence metrics
infl <- influence.measures(MOD2)

# Cook’s distance cutoff
cook_cutoff <- 4 / nrow(OR_TRAIN)

# Flags
cook_flags <- infl$is.inf[, "cook.d"]
hat_flags <- infl$is.inf[, "hat"]
residuals <- rstandard(MOD2)

# Combine flags and get row numbers
flagged_cases <- which(cook_flags | hat_flags | abs(residuals) > 3)
flagged_summary <- data.frame(
  Row = flagged_cases,
  Residual = round(residuals[flagged_cases], 2),
  CookD = round(cooks.distance(MOD2)[flagged_cases], 4),
  Leverage = round(hatvalues(MOD2)[flagged_cases], 4)
)

# Output only the top 10 rows
flagged_summary <- flagged_summary[order(-flagged_summary$CookD), ]

if (require(kableExtra)) {
  kable(head(flagged_summary, 10),
        caption = "Top Flagged Observations by Influence Measures") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
}

Final Model Figures and Output

# Predict on test set
predictions <- predict(MOD_FINAL, OR_TEST)

# Calculate MSE
MSE <- mean((predictions - log(OR_TEST$ELEP))^2)

# Create actual vs. predicted plot
pred_data <- data.frame(
  Actual = log(OR_TEST$ELEP),
  Predicted = predictions,
  Housing_Type = as.factor(OR_TEST$SFDA)
)

# Plot of predicted vs actual log(ELEP)
ggplot(pred_data, aes(x = Actual, y = Predicted, color = Housing_Type)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  annotate("text", x = 5.0, y = 3.97, label = "Dashed line: 1:1 reference line", hjust = 1, size = 3.2, color = "black") +
  labs(
    title = "Figure 6. Actual vs. Predicted Electricity Payments (log scale)",
    x = "log(Actual ELEP)",
    y = "log(Predicted ELEP)",
    color = "Housing Type"
  ) +
  scale_color_manual(
    values = c("blue", "red"),
    labels = c("Detached Home", "Attached Home")
  ) +
  theme_minimal()


## calculate root mean squared error on log scale
log_mse <- MSE  
log_rmse <- sqrt(log_mse)  

## Back-transform to multiplicative error factor
error_factor <- exp(log_rmse)  ## gives ratio of predicted to actual
# This multiplier represents the average prediction error range (e.g., ±72%)

## Retrieve detached household average (from training set)
avg_detached_cost <- mean(OR_TRAIN$ELEP[OR_TRAIN$SFDA == 0], na.rm = TRUE)

## Calculate prediction interval in dollar terms
lower_bound <- avg_detached_cost / error_factor
upper_bound <- avg_detached_cost * error_factor

## Return results rounded for interpretation
list(
  Avg_Detached_Cost = round(avg_detached_cost, 2),
  RMSE_Multiplier = round(error_factor, 2),
  Predicted_Cost_Lower_Bound = round(lower_bound, 0),
  Predicted_Cost_Upper_Bound = round(upper_bound, 0)
)


## Back-transform log-scale predictions and actuals to dollar scale
pred_data_dollar <- data.frame(
  Actual = exp(log(OR_TEST$ELEP)),           ## ensures consistency with log() usage in model
  Predicted = exp(predictions),
  Housing_Type = as.factor(OR_TEST$SFDA)
)

# Generate ggplot comparing predicted vs. actual dollar values
ggplot(pred_data_dollar, aes(x = Actual, y = Predicted, color = Housing_Type)) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  annotate("text", x = 400, y = 194, label = "Dashed line: 1:1 reference line (perfect prediction)", hjust = 1, size = 3.2, color = "black") +
  labs(
    title = "Figure 7. Actual vs. Predicted (Back-Transformed $ Scale)",
    subtitle = "Model predictions of payments plotted on the original dollar scale",
    x = "Actual Electricity payment ($)",
    y = "Predicted Electricity payment ($)",
    color = "Housing Type"
  ) +
  scale_color_manual(
    values = c("blue", "red"),
    labels = c("Detached Home", "Attached Home")
  ) +
  theme_minimal()

C. Data Source and Availability

Dataset Source

The dataset used in this analysis comes from the American Community Survey (ACS) 2015 1-Year Public Use Microdata Sample (PUMS), conducted by the U.S. Census Bureau. The ACS PUMS files contain individual records of the characteristics for a sample of housing units and the people in those housing units.

Data Access

The ACS PUMS data can be accessed through the U.S. Census Bureau website: https://www.census.gov/programs-surveys/acs/microdata.html

For this analysis, the Oregon state subset was used, specifically focusing on housing characteristics including:

  • Housing structure type (BLD)
  • Number of people in household (NP)
  • Number of bedrooms (BDSP)
  • Monthly electricity payments (ELEP)

Data Preparation

The raw PUMS data requires some preparation before analysis. The key preprocessing steps were:

  1. Filtering to Oregon state records
  2. Excluding non-residential units (mobile homes, boats, RVs)
  3. Creating the binary housing classification variable (SFDA)
  4. Log-transforming the electricity payment variable (ELEP)

The complete data preparation code is available in the “Data Preparation & Cleaning” section of this appendix.

Reproducibility Notes

To reproduce this analysis:

  1. Download the 2015 ACS 1-Year PUMS housing data for Oregon
  2. Apply the filtering and preparation steps as documented in this report
  3. Follow the modeling approach described in the methodology section

The full R code provided in this appendix should enable complete reproducibility of all results presented.