library(readxl)
## Warning: package 'readxl' was built under R version 4.3.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
reddit_data <- read.csv("C:/Users/ameya/OneDrive/Desktop/Fall/ECON/HW1/data_Q1.csv")
head(reddit_data)
acad_data <- read.csv("C:/Users/ameya/OneDrive/Desktop/Fall/ECON/HW1/data_Q2.csv")
reddit_data %>% group_by(treated) %>% count()
  1. If the control and treatment groups are similar across tenure, premium_user, and num_posts_before metrics.
print(t.test(tenure ~ treated, reddit_data))
## 
##  Welch Two Sample t-test
## 
## data:  tenure by treated
## t = 1.373, df = 1789.6, p-value = 0.1699
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -19.09774 108.23144
## sample estimates:
## mean in group 0 mean in group 1 
##        572.1680        527.6011
print(t.test(premium_user ~ treated, reddit_data))
## 
##  Welch Two Sample t-test
## 
## data:  premium_user by treated
## t = 0.95906, df = 1769.9, p-value = 0.3377
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.006928414  0.020188082
## sample estimates:
## mean in group 0 mean in group 1 
##      0.02541436      0.01878453
print(t.test(num_post_before ~ treated, reddit_data))
## 
##  Welch Two Sample t-test
## 
## data:  num_post_before by treated
## t = 0.56253, df = 1796.1, p-value = 0.5738
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.2307971  0.4164325
## sample estimates:
## mean in group 0 mean in group 1 
##        1.643094        1.550276

For all cases p-value> 0.05. So we can consider the means to be the same. So, its examined that the control and treatment groups are similar across tenure, premium_user, and num_posts_before metrics.

  1. Does getting reddit gold increase likelihood that the user will post (use the posted metric as the dependent variable and treated as the independent variable)? Use a simple linear model (not a logit) for the analysis.
summary(lm(posted ~ treated, data = reddit_data))
## 
## Call:
## lm(formula = posted ~ treated, data = reddit_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6232 -0.5602  0.3768  0.4398  0.4398 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.56022    0.01631   34.34   <2e-16 ***
## treated      0.06298    0.02307    2.73   0.0064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4908 on 1808 degrees of freedom
## Multiple R-squared:  0.004105,   Adjusted R-squared:  0.003554 
## F-statistic: 7.452 on 1 and 1808 DF,  p-value: 0.006396

Since this p-value < 0.05, it suggests that the treated variable (receiving Reddit Gold) is statistically significant in explaining the variation in the likelihood of posting. Linear model developed: posted = 0.56 + 0.0639 * (treated) In conclusion, reddit gold will increase likelihood that the user will post.

  1. What sorts of users are more likely to increase their contribution? (use the tenure and the first_timer variables)
summary(lm(posted ~ treated * first_timer, data = reddit_data))
## 
## Call:
## lm(formula = posted ~ treated * first_timer, data = reddit_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6370 -0.6120  0.3630  0.3880  0.5031 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.630841   0.023621  26.706  < 2e-16 ***
## treated              0.006196   0.033877   0.183   0.8549    
## first_timer         -0.133986   0.032536  -4.118 3.99e-05 ***
## treated:first_timer  0.108949   0.046107   2.363   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4887 on 1806 degrees of freedom
## Multiple R-squared:  0.01369,    Adjusted R-squared:  0.01205 
## F-statistic: 8.354 on 3 and 1806 DF,  p-value: 1.623e-05

-The coefficient for the interaction term treated:first_timer is approximately 0.108949 with a p-value of 0.0182, indicating that it is statistically significant. -The positive coefficient suggests that the interaction between receiving Reddit Gold (treated) and being a first-time Reddit user (first_timer) has a positive effect on the likelihood of increasing contribution. -In other words, first-time users who received Reddit Gold are more likely to increase posts compared to first-time users who did not receive Reddit Gold.

summary(lm(posted ~ treated * tenure, data = reddit_data))
## 
## Call:
## lm(formula = posted ~ treated * tenure, data = reddit_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6548 -0.5599  0.3477  0.4382  0.5396 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.585e-01  2.077e-02  26.893  < 2e-16 ***
## treated         9.622e-02  2.949e-02   3.263  0.00112 ** 
## tenure          2.934e-06  2.250e-05   0.130  0.89627    
## treated:tenure -6.275e-05  3.357e-05  -1.869  0.06174 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4903 on 1806 degrees of freedom
## Multiple R-squared:  0.007284,   Adjusted R-squared:  0.005635 
## F-statistic: 4.417 on 3 and 1806 DF,  p-value: 0.004211

-The coefficient for the interaction term treated:tenure is approximately -6.275e-05 with a p-value of 0.06174, indicating that it is statistically insignificant (p-value > 0.05) aasuming level of significance is 0.05. -The negative coefficient suggests that the interaction between receiving Reddit Gold (treated) and the tenure has a negative effect on the likelihood of increasing contribution. -However, the effect is insignifacant, meaning it should be interpreted as per the accepted level of significance.

  1. Is the SUTVA assumption likely to be violated in the experiment

In the experiment described, researchers randomly selected 905 users to receive Reddit Gold and included data for a similar number of users in the control group who did not receive gold. The SUTVA assumption assumes that the treatment (receiving Reddit Gold) for one user does not affect the potential outcomes or treatments of other users in the experiment.

However, there are a few factors that may impact the SUTVA assumption:

  1. Reddit is a social platform, and user interactions can be influenced by the behavior of others. This behavior change can influence or interact with the behavior of other users, then the SUTVA assumption may be violated. Users who did not receive Reddit Gold may react differently to posts from users who did receive it.
  2. Reddit has various features, such as karma points, badges, and visibility of posts, that can change as a result of user behavior and interactions. For example, users might be more likely to engage with posts from users who have received Reddit Gold.
  3. If users are aware of who received Reddit Gold and who did not, it could lead to differences in how they engage with posts or interact with other users. This awareness could potentially encourage or discourage the users in both groups to post more.

Q2. Academic Data

head(acad_data)
acad_data %>% group_by(bal) %>% count()

Q2. (a) Use a t-test to see if there is a statistical difference in the pre-period between schools in the treatment (bal = 1) and control (bal = 0). This will check if randomization has been done correctly. To do this, calculate the average normalized test score(norm) for the pre period (pre = 1) for math (test_type = 0). Is there a statistical difference between students who got the Balsakhi program and did not get the program? Perform the same test for language (test_type = 1).

# Perform t-test for math to compare treatment and control groups
pre_period_math_data = acad_data %>% filter(pre == 1 & test_type == 0)
t.test(norm ~ bal, pre_period_math_data)
## 
##  Welch Two Sample t-test
## 
## data:  norm by bal
## t = 0.34151, df = 10159, p-value = 0.7327
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.03235140  0.04600232
## sample estimates:
## mean in group 0 mean in group 1 
##   -6.854839e-09   -6.825465e-03
pre_period_lang_data = acad_data %>% filter(pre == 1 & test_type == 1)
t.test(norm ~ bal, pre_period_lang_data)
## 
##  Welch Two Sample t-test
## 
## data:  norm by bal
## t = -1.2176, df = 10140, p-value = 0.2234
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.06397876  0.01495078
## sample estimates:
## mean in group 0 mean in group 1 
##   -1.313364e-08    2.451397e-02

Q2. (b) Calculate the average test scores for the post period (post = 1) for math for treatment and control. Is there a statistical difference between students in the two groups of schools? Use a ttest model to test the increase. Perform the same analysis for language test scores.

# Perform t-test for math to compare treatment and control groups
post_period_math_data = acad_data %>% filter(post == 1 & test_type == 0)
t.test(norm ~ bal, post_period_math_data)
## 
##  Welch Two Sample t-test
## 
## data:  norm by bal
## t = -7.2622, df = 8386.9, p-value = 4.149e-13
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.2247119 -0.1291860
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1706217       0.3475706
post_period_lang_data = acad_data %>% filter(post == 1 & test_type == 1)
t.test(norm ~ bal, post_period_lang_data)
## 
##  Welch Two Sample t-test
## 
## data:  norm by bal
## t = -4.9514, df = 8395.6, p-value = 7.51e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.17763725 -0.07687679
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6667658       0.7940228

Q2. (c) Can you conclude if the Balsakhi program increase test scores in reading and mathematics?

summary(lm(norm ~ bal, data = post_period_math_data))
## 
## Call:
## lm(formula = norm ~ bal, data = post_period_math_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30110 -0.95059 -0.05057  0.81885  3.11400 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17062    0.01696  10.060  < 2e-16 ***
## bal          0.17695    0.02436   7.263 4.11e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.118 on 8424 degrees of freedom
##   (1772 observations deleted due to missingness)
## Multiple R-squared:  0.006224,   Adjusted R-squared:  0.006106 
## F-statistic: 52.76 on 1 and 8424 DF,  p-value: 4.113e-13

Since this p-value < 0.05, it suggests that effect of the bal variable (Balsakhi program) statistically significant in explaining the variation in the Math scores. Linear model developed: norm = 0.17062 + 0.17695 * (bal) - Program has a positive impact on math scores.

summary(lm(norm ~ bal, data = post_period_lang_data))
## 
## Call:
## lm(formula = norm ~ bal, data = post_period_lang_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4354 -0.9608 -0.1538  0.9440  3.2953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.66677    0.01789  37.261  < 2e-16 ***
## bal          0.12726    0.02570   4.951 7.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.179 on 8424 degrees of freedom
##   (1772 observations deleted due to missingness)
## Multiple R-squared:  0.002901,   Adjusted R-squared:  0.002783 
## F-statistic: 24.51 on 1 and 8424 DF,  p-value: 7.527e-07

Since this p-value < 0.05, it suggests that effect of the bal variable (Balsakhi program) statistically significant in explaining the variation in the lang scores. Linear model developed: norm = 0.66677 + 0.12726 * (bal) - Program has a positive impact on lang scores.

Q2. (d) Is the SUTVA assumption violated in the example?