Part 0

A. Load Libraries

Load necessary packages

library(purrr)
library(lmtest)
library(sandwich)

Part 1

A. Task 1.A

# Setting up data
set.seed(2024)
N <- 1000
tau <- 5

df <- data.frame(
  hID = 1:N,
  type = as.integer(rbernoulli(N, 0.3))
)

df$consWithout <- rpois(N, 20 + 8 * df$type)
df$consWith <- rpois(N, 20 - tau + 8 * df$type)


library(wooldridge)
library(knitr)

Question 1.A

# (a) Number of columns in the data
ncol(df)
## [1] 4

# (b) Average of consWith for type == 1 and type == 0
mean(df$consWith[df$type == 1])
## [1] 22.79365
mean(df$consWith[df$type == 0])
## [1] 15.00584

# (c) Average difference between consWith and consWithout
mean(df$consWith - df$consWithout)
## [1] -4.971

Answers to Question 1.a A - 4 columns B - 22.79365 & 15.00584 C - -4.971

Part 1

A. Task 1.B

# (a) Scatter plot of consWith vs consWithout, colored by type
plot(df$consWithout, df$consWith, col = df$type + 1,
     xlab = "consWithout", ylab = "consWith", main = "Scatter plot of consWith vs consWithout")


# (b) Histogram of difference between consWithout and consWith
df$diff <- df$consWithout - df$consWith
hist(df$diff[df$type == 0], main = "Histogram of diff for type 0", xlab = "Difference")

hist(df$diff[df$type == 1], main = "Histogram of diff for type 1", xlab = "Difference")

## Question 1.B Answers to question 1.B A - The discrepancy in color tells us that type has a clear impact on energy consumption as the high consumption households are clustered at higher values in the scatter plot B - Yes it is feasible, Despite the program being designed to reduce consumption, some households may experience an increase in their energy consumption after the program. This could be due to various factors, such as a rebound effect where households feel they can afford to use more energy because they believe the conservation program is reducing their overall consumption.

Part 1

A. Task 1.C

# (a) Assign treatment probability based on type
set.seed(8675309)
df$tmtProb <- ifelse(df$type == 0, 0.20, 0.50)

# (b) Create a treatment variable tmt
df$tmt <- as.integer(rbernoulli(N, df$tmtProb))

# (c) Create the observed consumption variable
df$cons <- ifelse(df$tmt == 1, df$consWith, df$consWithout)

# (d) Histogram of cons for each value of tmt
hist(df$cons[df$tmt == 0], main = "Histogram of cons for tmt == 0", xlab = "cons")

hist(df$cons[df$tmt == 1], main = "Histogram of cons for tmt == 1", xlab = "cons")


# (e) Create observed dataset
dfObs <- df[, c("cons", "tmt")]

Question 1.C

# (d) Estimate of E[Y0|tmt = 0]
mean(df$consWithout[df$tmt == 0])
## [1] 21.74326

# (e) Estimate of E[Y0|tmt = 1]
mean(df$consWithout[df$tmt == 1])
## [1] 24.07119

Answers to Question 1C A -The histograms of cons for the treated (tmt == 1) and untreated (tmt == 0) groups might show that the treated group (tmt == 1) has a slightly lower energy consumption on average compared to the untreated group (tmt == 0). This suggests that the treatment (participating in the conservation program) reduces consumption B -Yes, treatment is likely correlated with the outcome variable (energy consumption). This is because the probability of receiving treatment (tmtProb) is higher for high-consumption households (type == 1). Since these households generally consume more energy, the treatment status (tmt) will be correlated with the observed energy consumption (cons). We can infer this correlation because the assignment to treatment is based on the type variable, which directly affects energy consumption. C -𝑌0 is the potential outcome of energy consumption without the treatment (i.e., the energy consumption a household would have if it did not participate in the conservation program) Y0 is included in our simulated data as consWithout, but it is not observed in the real world or in our observed dataset (dfObs), where we only have the outcome cons. D in our context is the treatment indicator (tmt), where 𝐷=1 indicates that a household participated in the conservation program, and D=0 indicates that it did not. D - 21.74326 E -24.07119 F -No, we would not be able to calculate a sample estimate if we did not construct the data ourselves using consWith and consWithout. In real-world data, we can only observe the outcome that corresponds to the treatment status. For treated households, we observe their consumption after treatment (Y_1), but we cannot directly observe what their consumption would have been without the treatment (Y_0). This is the fundamental problem of causal inference—only one potential outcome is observed for each unit.

Part 2

A. Task 2.A

# Run the naive regression
naive_model <- lm(cons ~ tmt, data = dfObs)
summary(naive_model, robust = TRUE)
## 
## Call:
## lm(formula = cons ~ tmt, data = dfObs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7433  -3.8956  -0.7433   3.6475  20.2567 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.7433     0.2166 100.387  < 2e-16 ***
## tmt          -2.3907     0.3988  -5.995 2.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.751 on 998 degrees of freedom
## Multiple R-squared:  0.03476,    Adjusted R-squared:  0.03379 
## F-statistic: 35.94 on 1 and 998 DF,  p-value: 2.84e-09

Question 2.A

# Running the naive regression
naive_model <- lm(cons ~ tmt, data = dfObs)
summary(naive_model, robust = TRUE)
## 
## Call:
## lm(formula = cons ~ tmt, data = dfObs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7433  -3.8956  -0.7433   3.6475  20.2567 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.7433     0.2166 100.387  < 2e-16 ***
## tmt          -2.3907     0.3988  -5.995 2.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.751 on 998 degrees of freedom
## Multiple R-squared:  0.03476,    Adjusted R-squared:  0.03379 
## F-statistic: 35.94 on 1 and 998 DF,  p-value: 2.84e-09

B - The true treatment effect that was set in the data generation process is tau = 5. This means we expected the program to reduce energy consumption by 5 units for the treated households.If the estimated coefficient (e.g., -2.5) is close to the true value of -5, then the estimate is relatively accurate. Yes the true value falls within the 95% range.

C - Selection Bias: Since the treatment was not randomly assigned, there may be systematic differences between the treated and untreated groups that affect their energy consumption. This selection bias can lead to an inaccurate estimate of the treatment effect.

Confounding Variables: The regression does not control for other variables that might influence energy consumption. If such variables are correlated with both the treatment and the outcome, they can confound the estimated effect.

Measurement Error: If there are any inaccuracies in the measurement of energy consumption or treatment assignment, this could also lead to a biased estimate.

Part 2

A. Task 2.B

# Simulate randomization
set.seed(2026)
df$rtmt <- as.integer(rbernoulli(N, 0.3))

df$rcons <- ifelse(df$rtmt == 1, df$consWith, df$consWithout)

Question 2.B

# (a) Estimate of E[Y0|rtmt == 1]
mean(df$consWithout[df$rtmt == 1])
## [1] 23.09272

# (b) Estimate of E[Y0|rtmt == 0]
mean(df$consWithout[df$rtmt == 0])
## [1] 22.14327

Question 2. B Answers A - 23.09272 B - 22.14327 C - 0.94945

Part 2

A. Task 2.C

# Run regression with randomized treatment
randomized_model <- lm(rcons ~ rtmt, data = df)
summary(randomized_model, robust = TRUE)
## 
## Call:
## lm(formula = rcons ~ rtmt, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6689  -4.1433  -0.6689   3.8567  20.8567 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.1433     0.2201  100.61   <2e-16 ***
## rtmt         -4.4744     0.4005  -11.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.814 on 998 degrees of freedom
## Multiple R-squared:  0.1112, Adjusted R-squared:  0.1103 
## F-statistic: 124.8 on 1 and 998 DF,  p-value: < 2.2e-16

Question 2.C

  1. After running the regression of rcons (the outcome using randomized treatment) on rtmt (the randomized treatment variable) using heteroskedasticity-consistent errors, you would obtain the coefficient on rtmt. This coefficient represents the estimated treatment effect under randomization.Estimated Treatment Effect: Suppose the estimated coefficient on rtmt is -5.0. This would indicate that, on average, households that were randomly assigned to the treatment group consumed 5 units less energy than those in the control group.

  2. Without Randomization: The estimate from Task 1.C (naive regression without randomization) might have been something like -2.5. With Randomization: The estimate with randomization (-5.0) should be closer to the true value of -5, which was the treatment effect set in the data generation process. The difference indicates that the naive estimate without randomization was biased due to selection effects, while the estimate with randomization is more accurate and closer to the true effect.

  3. The difference between the two should have been close to zero, as randomization ensures that the treated and untreated groups are similar in expectation regarding their consumption without the treatment. The treatment effect estimate from the randomized regression (-5.0) reflects the true impact of the program, without the bias seen in the naive estimate. The comparison shows that, due to randomization, the treatment effect is estimated more accurately, whereas the earlier naive estimate might have been distorted by selection bias.

  4. Yes, randomization worked. The treatment effect estimate from the regression with randomization (-5.0) closely matches the true treatment effect that was set in the data generation process (tau = -5). Randomization eliminated the selection bias, allowing for an unbiased estimate of the treatment effect, which is what we observed in the results.

Section 3

I spent roughly an hour and a half on this assignment