Exponential Simulation

rm(list=ls())
library(tidyverse)
library(ggplot2)
library(cowplot)
library(purrr)

Exploring Central Limit Theorem

In this problem, we will use R to simulate the distributions of the sample mean as the sample size increases and to illustrate the central limit theorem.

The simulations will help us answer or verify our answers of the questions below; however, strictly speaking, all questions except the first one can be answered even without using R.

First, do the following steps in R:

1. Draw a sample from the exponential distribution with parameter lambda = 2 of size n=5

#X_5_1
n_5 = 5
lambda = 2
nosim = 10000

set.seed(321) 

df5_1 <- data.frame(rexp(n_5, lambda))
df5_1

  rexp.n_5..lambda.
1        0.08256092
2        0.35672086
3        0.62759753
4        0.52893331
5        0.44705665

2. Compute the mean of the sample. Denote the sample mean of size by bar_X_n; hence, we have just obtained a realization of bar_X_5.

# Calculating Sample Mean
df5_1 <- unlist(df5_1)
bar_X_n <- mean(df5_1)
bar_X_n

[1] 0.4085739

# Calculating the Theoretical Mean
theo_mean <- 1/lambda
theo_mean

[1] 0.5

# Putting them together 
data.frame(bar_X_n, theo_mean)

    bar_X_n theo_mean
1 0.4085739       0.5

Based on the outcome, we can see that true mean is 22.38% higher than the sample mean. What happens when we repeat this process 10000 times and when we have variable sample sizes? Let’s check.

3. Repeat this procedure 10000 times, and record the sample mean of each iteration (so there should be 10000 sample means in total, with each sample mean being the mean of sample of size 5). We have now simulated the distribution of bar_X_5. When we need to repeat the process multiple times and record the results, the best way is to create an appropriate function. In this case, when I have to repeat the process 10000 times the easiest way for me is to create a for loop that generates samples of 5 sizes over and over until the maximum threshold is met. Here’s how I do it:

create and empty vector where I can input the generated values
Create a function using a for loop, and
Rbind the values until we have 10000 iterations

#X_5
n_5 = 5 # Sample Size
lambda = 2 # lambda is the essential parameter in exponential distribution
nosim = 10000 # Number of iterations

set.seed(321) # for reproducibility

df5 = data.frame()# empty vector to store values
for (i in 1:10000){ df5<-rbind(df5,rexp(n_5,lambda))} # generates 10000 samples of size 5
tail(df5) #checking if it worked

      X0.0825609171197042 X0.356720861026502 X0.627597526160485
9995          0.002788028         0.19910490          0.2469435
9996          0.069378955         0.48966865          1.6182932
9997          1.358695258         1.23842787          0.4008296
9998          1.932344827         0.32406711          0.3147533
9999          0.410072513         0.07661582          0.4136712
10000         0.287830784         0.16447746          0.5968369
      X0.528933308997607 X0.447056648894528
9995          1.05443553         0.21282525
9996          2.14219332         0.36046127
9997          0.02617753         0.20110714
9998          0.39803684         0.57822052
9999          0.14114070         0.27934351
10000         0.11987528         0.06581991

# Because I want create a combined plot afterward, I want to create a copy of df5 data set
df5_copy <- df5

df5_copy$mean<-rowMeans(df5_copy,na.rm=TRUE,dims=1)
mean(df5_copy$mean)

[1] 0.4969809

round(mean(df5_copy$mean),1)

[1] 0.5

Everything worked, so far! We can see the last 6 rows of the entire data frame having 5 columns. As stated earlier, one row represents one iteration. Thus, when we are calculating the mean of each iteration, we have to calculate the row means.

# Calculating Row Means
df5$mean<-rowMeans(df5,na.rm=TRUE,dims=1) # It will add a new column called 'mean' to the existing data file
tail(df5)

      X0.0825609171197042 X0.356720861026502 X0.627597526160485
9995          0.002788028         0.19910490          0.2469435
9996          0.069378955         0.48966865          1.6182932
9997          1.358695258         1.23842787          0.4008296
9998          1.932344827         0.32406711          0.3147533
9999          0.410072513         0.07661582          0.4136712
10000         0.287830784         0.16447746          0.5968369
      X0.528933308997607 X0.447056648894528      mean
9995          1.05443553         0.21282525 0.3432194
9996          2.14219332         0.36046127 0.9359991
9997          0.02617753         0.20110714 0.6450475
9998          0.39803684         0.57822052 0.7094845
9999          0.14114070         0.27934351 0.2641688
10000         0.11987528         0.06581991 0.2469681

I have added a new column named ‘mean’ in my df5 data frame. Now, I want to calculate the column wise mean:

df5 %>%
  map_dbl(mean)

X0.0825609171197042  X0.356720861026502  X0.627597526160485  X0.528933308997607 
          0.4906613           0.4914589           0.4960988           0.5000257 
 X0.447056648894528                mean 
          0.5066597           0.4969809

# Maximum and minimum values in the column mean
min(df5$mean)

[1] 0.04577098

max(df5$mean)

[1] 1.665608

The row means seem to be ranging from 0.04577098 through 1.665608. That’s a big range! What is the actual mean of all of these rows? Let’s check the summary.

summary(df5$mean) # Summarizing the results

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.04577 0.33358 0.46410 0.49698 0.62350 1.66561

The mean of all of the row means turns out to be 0.49698, which is much closer to the theoretical mean, i.e., 0.5 compared to the previous mean, i.e., 0.4085739. Looks like, if we have larger sample size, the sample mean gets closer to the actual mean.

4. Plot a histogram of the pdf of bar_X_5 using a bin-width of 0.01. We recommend you use ggplot with the option aes(y=..density..) . You are also welcome to plot a kernel density plot with a suitable bandwidth if preferred.

# Plotting histogram
x_5_hist <- ggplot()+
  geom_histogram(data=df5,aes(mean, ..density..), binwidth = 0.01, fill="lightblue", color="darkred")+
  geom_density(kernel="gaussian", aes(df5$mean))+
  xlab("Sample Size=5, 10000 Iterations")
x_5_hist

Obviously! The data we generated were exponential and the histogram seems to represent such. However, we can see the beautiful Gaussian shape in slightly right skewed form.

Now, let’s test rest of the things asked in the previous question. Although, we already saw the mean of mean, I am going to do it again and round it to the one decimal point. I will then, calculate the variance and the standard deviation of this distribution to gain some robust grasp on its type.

# Calculating Mean of Mean
meanofmean <- mean(df5$mean)
meanofmean

[1] 0.4969809

# Rounding the Value of Mean to One Digit Decimal Point
round(mean(df5$mean),1)

[1] 0.5

# Unlisting the data frame because calculating variance did not work without it
df5 <- unlist(df5)

# Calculating variance and rounding it to one decimal point
var_5 <- round(var(df5),1)
var_5 #Checking

[1] 0.2

# Calculating Standard Deviation and rounding to one decimal point
sd_5 <- round(sd(df5),1)
sd_5

[1] 0.5

Exponential distribution, in probability, said to possess the Memoryless Property, i.e., the knowledge of what has happened in the past doesn’t influence future probabilities. It is often used to calculate the longevity of an electrical or mechanical devices.

Here’s some properties of the exponential distribution:

pdf: f(x) = me^(-mx), where x ≥ 0 and m > 0
cdf: P(X ≤ x) = 1 – e^(-mx)
mean μ = 1/m
standard deviation σ = µ
percentile, k: k = ln(AreaToTheLeftOfK)/−m
Additionally
- P(X > x) = e^(–mx)
- P(a < X < b) = e^(–ma) – e^(–mb)
Memoryless Property: P(X > x + k|X > x) = P (X > k)
Poisson probability: P(X=k)= (λ^k * e^-λ)/k! with mean λ
k! = k(k-1)(k-2)(k-3)…32*1 (https://courses.lumenlearning.com/introstats1/chapter/the-exponential-distribution/)

The data we generated seem to follow these basic qualities!

Repeat the procedure above for 3 other sample sizes: n=1, n=10, and n=30. That is, obtain 10000 realizations of , 10000 realizations of and 10000 realizations of . Plot each histogram with binwidth .

A. n=1

#X_1
n_1 = 1
lambda = 2
nosim = 10000

set.seed(321) 

df1=data.frame()
for (i in 1:nosim){ df1<-rbind(df1,rexp(n_1,lambda))}
tail(df1)

      X0.0825609171197042
9995            0.0974078
9996            0.5127457
9997            0.5860184
9998            1.2658391
9999            1.4962940
10000           1.5561905

df1$mean<-rowMeans(df1,na.rm=TRUE,dims=1)
mean(df1$mean)

[1] 0.4951067

round(mean(df1$mean),1)

[1] 0.5

# Because I want create a combined plot afterward, I want to create a copy of df1 data set
df1_copy <- df1

df1_copy$mean<-rowMeans(df1_copy,na.rm=TRUE,dims=1)
mean(df1_copy$mean)

[1] 0.4951067

round(mean(df1_copy$mean),1)

[1] 0.5

# Creating Histogram
x_1_hist <- ggplot()+
  geom_histogram(data = df1,aes(mean, ..density..), binwidth = 0.01, fill = "green", color = "red")+
  geom_density(kernel = "gaussian", aes(df1$mean))+
  xlab("Sample Size = 1, 10000 Iterations")
x_1_hist

df1 <- unlist(df1)

var_1 <- round(var(df1),1)
sd_1 <- round(sd(df1),1)
sd_1

[1] 0.5

B. n = 10

#X_10
n_10 = 10
lambda = 2
nosim = 10000 

set.seed(321)

df10=data.frame()
for (i in 1:nosim){ df10<-rbind(df10,rexp(n_10,lambda))}

df10$mean<-rowMeans(df10,na.rm=TRUE,dims=1)
mean(df10$mean)

[1] 0.4985819

round(mean(df10$mean),1)

[1] 0.5

# Because I want create a combined plot afterward, I want to create a copy of df10 data set
df10_copy <- df10

df10_copy$mean<-rowMeans(df10_copy,na.rm=TRUE,dims=1)
mean(df10_copy$mean)

[1] 0.4985819

round(mean(df10_copy$mean),1)

[1] 0.5

x_10_hist <- ggplot()+
  geom_histogram(data=df10,aes(mean, ..density..), binwidth = 0.01, fill="white", color="darkred")+
  geom_density(kernel="gaussian", aes(df10$mean))+
  xlab("Sample Size=10, 10000 Iterations")
x_10_hist

df10 <- unlist(df10)

var_10 <- round(var(df10),1)
sd_10 <- round(sd(df10),1)
sd_10

[1] 0.5

C. n=30

AND

Compare the 4 histograms of bar_X_5, bar_X_1, bar_X_10, bar_X_30 shapes by plotting them on the same graph.

#X_30
n_30 = 30
lambda = 2
nosim = 10000

set.seed(321)

df30=data.frame()
for (i in 1:nosim){ df30<-rbind(df30,rexp(n_30,lambda))}

df30$mean<-round(rowMeans(df30,na.rm=TRUE,dims=1),1)
mean(df30$mean)

[1] 0.50039

round(mean(df30$mean),1)

[1] 0.5

# Because I want create a combined plot afterward, I want to create a copy of df30 data set
df30_copy <- df30

df30_copy$mean<-rowMeans(df30_copy,na.rm=TRUE,dims=1)
mean(df30_copy$mean)

[1] 0.5001816

round(mean(df30_copy$mean),1)

[1] 0.5

x_30_hist <- ggplot()+
  geom_histogram(data=df30,aes(mean, ..density..), binwidth = 0.01, fill="yellow", color="darkred")+
  geom_density(kernel="gaussian", aes(df30$mean))+
  xlab("Sample Size=30, 10000 Iterations")
x_30_hist

df30 <- unlist(df30)

var_30 <- round(var(df30),1)
sd_30 <- round(sd(df30),1)
sd_30

[1] 0.5

Can I put them together in a single graph? Let’s try:

ggplot()+
  geom_histogram(data = df1_copy, aes(x = mean, y = ..density..), binwidth = 1/1000, color = "orange", fill = "orange", alpha = 0.2)+
  geom_histogram(data = df5_copy, aes(x = mean, y = ..density..),binwidth = 1/1000, color = "blue",fill = "blue", alpha = 0.2)+
  geom_histogram(data = df10_copy, aes(x = mean, y = ..density..), binwidth = 1/1000, color = "red", fill = "red", alpha = 0.1)+
  geom_histogram(data = df30_copy, aes(x = mean, y = ..density..),binwidth = 1/1000,color = "purple",fill = "purple",alpha = 0.01)+
  xlim(c(0,1.5))+
  xlab("Orange:Sample Size 1, Blue:Sample Size 5, Red: Sample Size 10, & Purple: Sample Size 30")

Now, the comparison is clearly visible. As can be seen, the distribution becomes more and more Gaussian as the sample size increase. The distribution is constantly being symmetric as the number of variables are increasing.

Finally, I am going to check the simulated distribution of sample size 30 with the normal distribution.

# Simulating 10000 random normal data with a mean of 1, sd of 1/2
df30_2 <- data.frame()
for(i in 1:10000){
  df30_2 <- rbind(df30_2, rnorm(1, 1/2, 1/sqrt(120)))
}
# Calculating Mean
df30_2$mean <- rowMeans(df30_2, na.rm=TRUE, dims=1)
# Plotting a combined histogram
ggplot()+
  geom_freqpoly(data=df30_copy, aes(x=mean, y=..density..), fill="blue", color="darkblue")+
  geom_freqpoly(data=df30_2, aes(x=mean, y=..density..),color="red", fill="darkred")+
  xlab("Red:Normal Distribution, Blue:Exponential Distribution")

They seem to be symmetrical. In absence of the red distribution, we would assume blue to be the normal Gaussian distribution.

Civic Engagement and Corruption

In this problem, we will examine a famous study on corruption conducted by the economist Benjamin Olken. The study used a number of experiments to evaluate the effects of increased monitoring on levels of misappropriation by village officials during the construction of roads in Indonesia (the full paper is available here: https://economics.mit.edu/files/2913 ). In one experiment, villages were randomly assigned to receive invitations to public ‘accountability meetings’ i.e. public meetings where community residents would hear village officials explain how they spent the project funds. The idea behind this experiment was that if village officials know in advance that more community members will be coming to accountability meetings, they might be less likely to misappropriate funding.

Load the dataset olken.csv.

simul_stat <- as.vector(NULL)

olken_data <- read.csv("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/Assessing 14.310x/rcodes/datasets_olken.csv")
head(olken_data)

  X.1 X treat_invite pct_missing head_edu   mosques  pct_poor total_budget
1   1 1            0  0.38527447        6 0.9083831 0.4001222     40.56500
2   2 2            1 -0.09574836       14 1.0666667 0.1856149     69.32150
3   3 3            1  0.14771932       12 0.7117438 0.4000000     41.10650
4   4 4            1 -0.18259122        9 0.9489917 0.4379366     17.06200
5   5 5            0 -0.29304767        9 1.6233766 0.3126954     72.08600
6   6 6            1 -0.09736358        9 0.7381890 0.4076087     69.83308

The dataset contains 472 villages that treat_invite a binary indicator of whether or not the village was “treated” with invitations to accountability meetings. Hence, denotes the village is in the “treatment group”, and denotes the village is in the “control group”.

pct_missing : the amount of funding misappropriated during the construction project.

head_edu : an ordinal variable indicating the level of education of the head village official.

mosques : the number of mosques per 1000 residents in the village.

pct_poor : the percentage of households in the village below the poverty line.

total_budget : the total budget allocated to the village for the road project.

library(tidyverse)
olken_data <- select(olken_data, treat_invite, pct_missing, head_edu, mosques, 
                     pct_poor, total_budget)
olken_data <- as.data.frame(olken_data)
head(olken_data)

  treat_invite pct_missing head_edu   mosques  pct_poor total_budget
1            0  0.38527447        6 0.9083831 0.4001222     40.56500
2            1 -0.09574836       14 1.0666667 0.1856149     69.32150
3            1  0.14771932       12 0.7117438 0.4000000     41.10650
4            1 -0.18259122        9 0.9489917 0.4379366     17.06200
5            0 -0.29304767        9 1.6233766 0.3126954     72.08600
6            1 -0.09736358        9 0.7381890 0.4076087     69.83308

summary(olken_data)

  treat_invite     pct_missing          head_edu        mosques      
 Min.   :0.0000   Min.   :-1.10282   Min.   : 6.00   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.: 0.01996   1st Qu.: 9.00   1st Qu.:0.8602  
 Median :1.0000   Median : 0.22717   Median :12.00   Median :1.2629  
 Mean   :0.6589   Mean   : 0.23633   Mean   :11.56   Mean   :1.4377  
 3rd Qu.:1.0000   3rd Qu.: 0.41690   3rd Qu.:12.00   3rd Qu.:1.8841  
 Max.   :1.0000   Max.   : 1.67431   Max.   :20.00   Max.   :6.8891  
    pct_poor        total_budget    
 Min.   :0.01867   Min.   :  8.758  
 1st Qu.:0.23317   1st Qu.: 53.770  
 Median :0.38596   Median : 73.177  
 Mean   :0.40713   Mean   : 83.229  
 3rd Qu.:0.56572   3rd Qu.:103.448  
 Max.   :0.94462   Max.   :890.242

# Later, I want to plot density plots in a single graph. Thus, i want to create a copy of the Olken data. 
olken_data_copy <- olken_data
head(olken_data_copy)

  treat_invite pct_missing head_edu   mosques  pct_poor total_budget
1            0  0.38527447        6 0.9083831 0.4001222     40.56500
2            1 -0.09574836       14 1.0666667 0.1856149     69.32150
3            1  0.14771932       12 0.7117438 0.4000000     41.10650
4            1 -0.18259122        9 0.9489917 0.4379366     17.06200
5            0 -0.29304767        9 1.6233766 0.3126954     72.08600
6            1 -0.09736358        9 0.7381890 0.4076087     69.83308

We will look at the average treatment effect on the amount of misappropriated funding pct_missing. In the following, use R to help answer the questions, use the lm function if wished.

Separating the given dataset to by treatment and control group:

#Trreatment Group
treatment_group <- filter(olken_data, treat_invite==1)
tail(treatment_group)

    treat_invite pct_missing head_edu   mosques  pct_poor total_budget
306            1 -0.20104955       12 1.9786308 0.6354749     138.7556
307            1 -0.19787270       12 2.1459227 0.5958816      98.7985
308            1 -0.09951322       12 0.9589567 0.5052515     106.9569
309            1  0.10769557       16 1.0868384 0.7793258      84.4882
310            1  0.24107966       14 1.0576415 0.5776627     123.4696
311            1  0.20130354       16 0.9394965 0.3533671     140.2379

dim(treatment_group)

[1] 311   6

#Control Group
control_group <- filter(olken_data, treat_invite==0)
tail(control_group)

    treat_invite pct_missing head_edu   mosques  pct_poor total_budget
156            0   0.1532032        9 0.7312614 0.3879004     98.40993
157            0  -0.1434073       16 1.6062413 0.2776141    141.59979
158            0   0.2352902       16 1.1398176 0.3291284    104.98269
159            0   0.2973622       12 2.5575447 0.6444111     48.56191
160            0   0.2484317        9 2.3897059 0.3427230     50.28449
161            0   0.1516978       16 0.1238237 0.9391992    137.72064

str(control_group)

'data.frame':   161 obs. of  6 variables:
 $ treat_invite: int  0 0 0 0 0 0 0 0 0 0 ...
 $ pct_missing : num  0.3853 -0.293 -0.1215 -0.0551 0.1782 ...
 $ head_edu    : int  6 9 12 12 9 6 12 12 9 12 ...
 $ mosques     : num  0.908 1.623 1.28 2.055 1.094 ...
 $ pct_poor    : num  0.4 0.313 0.585 0.326 0.228 ...
 $ total_budget: num  40.6 72.1 82.1 55.9 87.1 ...

Treatment groups have 311 data points while the control group has 161 data points.

1. Make two kernel density plots of the variable head_edu, one for the treatment group and one for the control group. Place these on the same graph. Do the two distributions look similar for treatment and control groups?

# Loading Required Libraries
library(ggplot2)
library(cowplot)

# Treatment + head_edu
kernel_tretment_head <- ggplot(treatment_group)+
  geom_histogram(aes(x=head_edu,y=..density..), binwidth = 1.5, fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(head_edu), bw=2, color="red")

# Control + head_edu
kernel_control_head <- ggplot(control_group)+
  geom_histogram(aes(x=head_edu,y = ..density..), binwidth = 1.5, fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(x=head_edu),bw=2, color="red")

# Putting them Together
plot_grid(kernel_tretment_head, kernel_control_head)

I don’t see any differences in these distributions. They look exactly alike.

Do the same for the other 3 variables mosques , pct_poor , and total_budget . Do the distributions of each variable look similar for treatment and control groups?

MOSQUES Variable

# Treatment + mosques
kernel_tretment_mosques <- ggplot(treatment_group)+
  geom_histogram(aes(x=mosques,y=..density..), fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(mosques),  color="red")

# Control + mosques
kernel_control_mosques <- ggplot(control_group)+
  geom_histogram(aes(x=mosques,y = ..density..), fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(x=mosques), color="red")

# Putting them Together
plot_grid(kernel_tretment_mosques, kernel_control_mosques)

PCT_POOR Variable

# Treatment + pct_poor
kernel_tretment_pct_poor <- ggplot(treatment_group)+
  geom_histogram(aes(x=pct_poor,y=..density..), fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(pct_poor),  color="red")

# Control + pct_poor
kernel_control_pct_poor <- ggplot(control_group)+
  geom_histogram(aes(x=pct_poor,y = ..density..), fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(x=pct_poor), color="red")

# Putting them Together
plot_grid(kernel_tretment_pct_poor, kernel_control_pct_poor)

TOTAL_BUDGET Variable

# Treatment + total_budget
kernel_tretment_total_budget <- ggplot(treatment_group)+
  geom_histogram(aes(x=total_budget,y=..density..), fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(total_budget),  color="red")

# Control + total_budget
kernel_control_total_budget <- ggplot(control_group)+
  geom_histogram(aes(x=total_budget,y = ..density..), fill="grey", color="darkblue")+
  geom_density(kernel="gaussian", aes(x=total_budget), color="red")

# Putting them Together
plot_grid(kernel_tretment_total_budget, kernel_control_total_budget)

OR I can use the olken_data_copy and plot all of them together. Just the density plot for clear comparison.

# Inspecting the shapes of covariates
#A. Head_edu
ggplot(olken_data_copy, aes(x = head_edu, colour =(treat_invite == 0))) + geom_density()

#B. Mosques
ggplot(olken_data_copy, aes(x = mosques, colour =(treat_invite == 0))) + geom_density()

#B. pct_poor
ggplot(olken_data_copy, aes(x = pct_poor, colour =(treat_invite == 0))) + geom_density()

#B. total)budget
ggplot(olken_data_copy, aes(x = total_budget, colour =(treat_invite == 0))) + geom_density()

Based on the data, give a numerical estimate of the ATE. Round your answer to 3 decimal places.

A. Checking the Number of Treatment and Control Cases

sum(olken_data$treat_invite)

[1] 311

count(olken_data)

    n
1 472

It did work! There are total of 311 treatment cases and total of 472 cases.

B. Creating a function that calculates the treatment and control effect for all possible combinations.

set.seed(123)
for(i in 1:472){
olken_data$rand <- runif(472, min=0, max=1)# Creating 472 random uniform observations between 0 and 1
olken_data$treatment_rand <- as.numeric(rank(olken_data$rand)<= 311) #Randomly selecting total of 311 placeholders for treatment cases
olken_data$control_rand <- 1 - olken_data$treatment_rand # Randomly selecting total of 161 cases control groups

# Appending the total data in a single data.frame
simul_stat <- append(simul_stat, sum(olken_data$treatment_rand * olken_data$pct_missing)/sum(olken_data$treatment_rand)- sum(olken_data$control_rand * olken_data$pct_missing)/sum(olken_data$control_rand)
)
}

# calculating effects on control cases
olken_data$control <- 1 -  olken_data$treatment
olken_data$control

  [1] 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0
 [38] 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1
 [75] 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
[112] 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0
[149] 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0
[186] 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1
[223] 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1
[260] 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0
[297] 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 1
[334] 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1
[371] 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0
[408] 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 0
[445] 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0

*** Calculating actual statistics from the given data***

act_stat <- sum(olken_data$treatment * olken_data$pct_missing)/sum(olken_data$treatment) - sum(olken_data$control *  olken_data$pct_missing)/sum(olken_data$control)
act_stat

[1] -0.01428412

## Calculating Average Treatment Effect (ATE)
control_v <- filter(olken_data_copy, treat_invite == 0)#subsetting control data
treatment_v <- filter(olken_data_copy, treat_invite == 1)#subsetting the treatment data
average_treatment_effect <- mean(treatment_v$pct_missing) - mean(control_v$pct_missing)# Calculating ATE
average_treatment_effect

[1] -0.02494953

The results show that -0.0148412 is the actual statics. Now, checking if the simulated value is bigger than the actual statistics. Average Treatment Effect is -0.02494953. In a way, actual statistics is like hand calculated ATE. There are differences in these two values.

sum(abs(simul_stat) >= act_stat)/NROW(simul_stat)

[1] 1

Results show that there is only one row having bigger value than the actual statistics.

Calculating Average of Average mean of Control Schools.

control_mean <- sum(olken_data$control * olken_data$pct_missing)/sum(olken_data$control)

Calculating Average of Average Mean of Treatment Schools.

treatment_mean <- sum(olken_data$treatment * olken_data$pct_missing)/sum(olken_data$treatment)

# Putting them Together
data.frame(control_mean, treatment_mean)

  control_mean treatment_mean
1    0.2457397      0.2314556

We can see that control mean is higher (by 0.0142841) than treatment mean, showing that the amount of fund misappropriation among control villages is higher than treatment villages.

Now, let’s calculate the difference in sample average between control and treatment cases, which gives the unbiased estimate of the treatment effect. The sum of the control variance is denoted by S_C and the treatment effect S_T.

Average Variance in Control Cases

S_C <- (1/(sum(olken_data$control)-1)) * sum(((olken_data$pct_missing - control_mean) * olken_data$control)^2)

# Second Method
standard_error <- sqrt(var(treatment_v$pct_missing)/length(treatment_v$pct_missing) + var(control_v$pct_missing)/length(control_v$pct_missing))
standard_error

[1] 0.03310019

# T-statistics
T_value <- average_treatment_effect/standard_error
T_value

[1] -0.7537578

# Over all P-Value
p_value <- 2*pnorm(T_value)
p_value

[1] 0.4509946

# p_value for Treatment 
p_treatment <- 2*pt(T_value, 471)
p_treatment

[1] 0.4513713

## Comparing these values with the results from Linear Model
summary(lm(pct_missing ~ treat_invite, data = olken_data_copy))


Call:
lm(formula = pct_missing ~ treat_invite, data = olken_data_copy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.33064 -0.21249 -0.01284  0.18281  1.42154 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.25277    0.02716   9.306   <2e-16 ***
treat_invite -0.02495    0.03346  -0.746    0.456    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3447 on 470 degrees of freedom
Multiple R-squared:  0.001181,  Adjusted R-squared:  -0.0009438 
F-statistic: 0.5559 on 1 and 470 DF,  p-value: 0.4563

Average Variance in Treatment Cases

S_T <- (1/(sum(olken_data$treatment)-1)) * sum(((olken_data$pct_missing - treatment_mean) * olken_data$treatment)^2)

# Putting them Together
data.frame(S_C,S_T)

        S_C       S_T
1 0.1099838 0.1234812

Now, that I calculated the variance for both control and treatment villages. I am now going to estimate the upper bound of the standard error of the point using the Neyman Inference. It is often known as the conservative estimator of sampling standard deviation.

# Upper Bound
Vneyman <- S_T/sum(olken_data$treatment)+(S_C/sum(olken_data$control))
print(sqrt(Vneyman))

[1] 0.03286601

We can calculate the T-statistics using the Vneyman statistics and the ATE value:

t_value <- (average_treatment_effect/sqrt(Vneyman))
t_value

[1] -0.7591286

Based on the outcome, the treatment has lowered the misappropriation of the budget but it is not statistically significant. It suggests that there is nominal to no-impact.

Democracy and Economy

In this problem, we will investigate transformations to the linear regression model. Download the demo.csv dataset. The dataset contains information from a sample of countries in the year 2000, taken from the Democracy Time-Series Dataset. It includes the following variables:

Nation : country name

GDP : Gross Domestic Product (GDP) per capita in constant US dollars

FHouse : Freedom House rating–a measure of the level of political and civil liberties in a country, on a scale from 1.0 (most free) to 7.0 (least free).

OECD : a dummy variable indicating OECD status.

regime : a categorical variable coded from the Freedom House rating that indicates whether a country is free (1.0-1.5), partly free (2.0-3.0), or not free (3.5-7.0).

Load the dataset demo.csv

asset_data <- read.csv("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/Assessing 14.310x/rcodes/asset_demo.csv")
asset_data <- select(asset_data,Nation, GDP, FHouse, OECD, regime)
asset_data <- as.data.frame(asset_data)
head(asset_data)

     Nation   GDP FHouse OECD   regime
1   Albania  3719    4.5    0 not free
2   Algeria  5327    5.5    0 not free
3    Angola  1462    6.0    0 not free
4 Argentina 12095    1.5    0     free
5   Armenia  2421    4.0    0 not free
6 Australia 27390    1.0    1     free

summary(asset_data)

    Nation               GDP            FHouse          OECD       
 Length:161         Min.   :  463   Min.   :1.00   Min.   :0.0000  
 Class :character   1st Qu.: 1857   1st Qu.:1.50   1st Qu.:0.0000  
 Mode  :character   Median : 4555   Median :3.50   Median :0.0000  
                    Mean   : 8469   Mean   :3.46   Mean   :0.1863  
                    3rd Qu.: 9677   3rd Qu.:5.50   3rd Qu.:0.0000  
                    Max.   :50564   Max.   :7.00   Max.   :1.0000  
    regime         
 Length:161        
 Class :character  
 Mode  :character

make a scatter plot with FHouse on the X-axis and GDP on the -axis

ggplot(asset_data)+
  geom_point(aes(x=FHouse, y=GDP), color = "red")

run a linear regression using lm with GDP as the dependent variable, and FHouse as the explanatory variable

lm(formula = GDP ~ FHouse, data = asset_data)


Call:
lm(formula = GDP ~ FHouse, data = asset_data)

Coefficients:
(Intercept)       FHouse  
      18214        -2817

Plot the regression line on your existing scatter plot (e.g. by adding + geom_smooth(method=‘lm’) to your ggplot command); Double check that the correct line is plotted by comparing with the results in the lm.

g = ggplot(asset_data, aes(x=FHouse, y=GDP))
g = g + geom_point(size=3,colour="red",alpha=0.4)
g = g + geom_smooth(method='lm', colour="black")
g

Compute the conditional sample mean of the GDP at each value of the FHouse score and store these in a vector. (This vector should contain 1 GDP sample mean for each possible FHouse score.) These are estimates of the conditional expectation of the GDP at different values of FHouse scores;

group_1 <- filter(asset_data, FHouse==1)
head(group_1)

     Nation   GDP FHouse OECD regime
1 Australia 27390      1    1   free
2   Austria 28481      1    1   free
3   Bahamas 17055      1    0   free
4    Belize  5682      1    0   free
5    Canada 27503      1    1   free
6    Cyprus 19175      1    0   free

group_2 <- filter(asset_data, FHouse==2)
group_3 <- filter(asset_data, FHouse==3)
group_4 <- filter(asset_data, FHouse==4)
group_5 <- filter(asset_data, FHouse==5)
group_6 <- filter(asset_data, FHouse==6)
group_7 <- filter(asset_data, FHouse==7)
summary(group_1$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4545   17922   26987   23713   29105   50564

summary(group_2$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    975    3678    6144    8123    9122   23015

summary(group_3$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    583    1230    3002    2987    3604    7154

summary(group_4$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    521     986    2421    2960    4162    6568

summary(group_5$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    815    1206    1718    6303    7791   23594

summary(group_6$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  650.0   968.5  1546.5  2239.0  2498.2  5806.0

summary(group_7$GDP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1506    3162    3416    5477    5707   13593

#OR- the Easy Method- Calculating Mean Directly
mean_GDP <- data.frame(matrix(ncol = 2, nrow = 0)) # Create an empty matrix with 2-columns and 0-rows
x =  c(2:14)/2
# creating a function
for(i in x){
  mean_GDP <- rbind(c(i, mean(filter(asset_data, FHouse == i)$GDP)))
}
colnames(mean_GDP) <- c("FHouse", "mean_GDP")
mean_GDP

     FHouse mean_GDP
[1,]      7   5476.8

Create a new variable, FHouse_sq, defined as the square of FHouse.

asset_data <-  mutate(asset_data, FHouse_sq = FHouse^2, .keep = "all")
head(asset_data)

     Nation   GDP FHouse OECD   regime FHouse_sq
1   Albania  3719    4.5    0 not free     20.25
2   Algeria  5327    5.5    0 not free     30.25
3    Angola  1462    6.0    0 not free     36.00
4 Argentina 12095    1.5    0     free      2.25
5   Armenia  2421    4.0    0 not free     16.00
6 Australia 27390    1.0    1     free      1.00

# Changing FHouse to a Factor Variable
asset_data$FHouse <- as.factor(asset_data$FHouse)

Fit a linear regression model with GDP as the dependent variable, and 2 explanatory variables FHouse and FHouse_sq.

fit <- lm(GDP ~ FHouse + FHouse_sq, data = asset_data)
summary(fit)


Call:
lm(formula = GDP ~ FHouse + FHouse_sq, data = asset_data)

Residuals:
   Min     1Q Median     3Q    Max 
-19169  -2957  -1149   2215  26851 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    23714       1385  17.123  < 2e-16 ***
FHouse1.5      -9188       1937  -4.743 4.91e-06 ***
FHouse2       -15590       2221  -7.020 7.44e-11 ***
FHouse2.5     -19200       2331  -8.236 8.62e-14 ***
FHouse3       -20727       2682  -7.729 1.53e-12 ***
FHouse3.5     -21022       2477  -8.485 2.05e-14 ***
FHouse4       -20753       2272  -9.133 4.65e-16 ***
FHouse4.5     -18900       2477  -7.629 2.66e-12 ***
FHouse5       -17411       2819  -6.177 6.01e-09 ***
FHouse5.5     -19650       2007  -9.791  < 2e-16 ***
FHouse6       -21475       2682  -8.007 3.18e-13 ***
FHouse6.5     -20198       2570  -7.858 7.38e-13 ***
FHouse7       -18237       3218  -5.667 7.37e-08 ***
FHouse_sq         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6496 on 148 degrees of freedom
Multiple R-squared:  0.569, Adjusted R-squared:  0.5341 
F-statistic: 16.28 on 12 and 148 DF,  p-value: < 2.2e-16

Now do a different regression, with FHouse as the single explanatory variable, and log of GDP as the dependent variable (i.e. a regression using and ) How can we now interpret the coefficient for FHouse?

library(AER)
fit1 <- lm(log(GDP) ~ FHouse, data = asset_data)
summary(fit1)


Call:
lm(formula = log(GDP) ~ FHouse, data = asset_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8175 -0.5175 -0.0182  0.5326  2.0864 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.9189     0.1758  56.419  < 2e-16 ***
FHouse1.5    -0.4947     0.2459  -2.012  0.04605 *  
FHouse2      -1.2191     0.2819  -4.324 2.80e-05 ***
FHouse2.5    -1.7354     0.2959  -5.864 2.83e-08 ***
FHouse3      -2.2004     0.3405  -6.463 1.40e-09 ***
FHouse3.5    -2.2146     0.3145  -7.042 6.62e-11 ***
FHouse4      -2.2260     0.2885  -7.717 1.63e-12 ***
FHouse4.5    -1.9857     0.3145  -6.314 3.00e-09 ***
FHouse5      -1.8752     0.3578  -5.240 5.43e-07 ***
FHouse5.5    -2.0306     0.2548  -7.970 3.92e-13 ***
FHouse6      -2.4958     0.3405  -7.331 1.38e-11 ***
FHouse6.5    -2.2325     0.3263  -6.842 1.92e-10 ***
FHouse7      -1.5831     0.4085  -3.875  0.00016 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8246 on 148 degrees of freedom
Multiple R-squared:  0.527, Adjusted R-squared:  0.4887 
F-statistic: 13.74 on 12 and 148 DF,  p-value: < 2.2e-16

IV Regression

iv_health <- read.csv("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/Assessing 14.310x/rcodes/data_iv_health2.csv")
iv_health <- as.data.frame(iv_health)
head(iv_health)

  healthinsu age female illnesses  ssiratio marry logmedexpense logincome
1          1  74      1         0 0.1498765     1      6.388561  4.549085
2          1  73      0         3 0.3958560     1      7.486053  3.577847
3          0  80      1         1 1.0000000     0      5.170484  2.261763
4          1  70      0         5 0.2066395     1      7.798523  3.656221
5          0  91      0         3 0.5371920     1      5.799093  2.166193
6          1  82      1         2 0.3263104     0      6.826545  3.775077

summary(iv_health)

   healthinsu          age            female         illnesses    
 Min.   :0.0000   Min.   :65.00   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:70.00   1st Qu.:0.0000   1st Qu.:1.000  
 Median :0.0000   Median :74.00   Median :1.0000   Median :2.000  
 Mean   :0.3822   Mean   :75.05   Mean   :0.5771   Mean   :1.861  
 3rd Qu.:1.0000   3rd Qu.:80.00   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :91.00   Max.   :1.0000   Max.   :9.000  
    ssiratio          marry        logmedexpense      logincome     
 Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :-6.908  
 1st Qu.:0.2381   1st Qu.:0.0000   1st Qu.: 5.740   1st Qu.: 2.233  
 Median :0.5045   Median :1.0000   Median : 6.678   Median : 2.743  
 Mean   :0.5365   Mean   :0.5636   Mean   : 6.481   Mean   : 2.743  
 3rd Qu.:0.9091   3rd Qu.:1.0000   3rd Qu.: 7.430   3rd Qu.: 3.315  
 Max.   :9.2506   Max.   :1.0000   Max.   :10.180   Max.   : 5.744

reg1 <- ivreg(logmedexpense ~ age+female+illnesses+ssiratio+marry+logincome | age+female+illnesses+ssiratio+marry+logincome+healthinsu, data = iv_health)
summary(reg1)


Call:
ivreg(formula = logmedexpense ~ age + female + illnesses + ssiratio + 
    marry + logincome | age + female + illnesses + ssiratio + 
    marry + logincome + healthinsu, data = iv_health)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3289 -0.6684  0.1547  0.8491  3.6795 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.684271   0.162049  35.078  < 2e-16 ***
age         -0.004143   0.001931  -2.145 0.031954 *  
female       0.055787   0.026459   2.108 0.035018 *  
illnesses    0.438568   0.009594  45.712  < 2e-16 ***
ssiratio     0.171970   0.037670   4.565 5.05e-06 ***
marry        0.031815   0.027322   1.164 0.244272    
logincome    0.054499   0.014965   3.642 0.000272 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 10082 degrees of freedom
Multiple R-Squared: 0.1763, Adjusted R-squared: 0.1758 
Wald test: 359.7 on 6 and 10082 DF,  p-value: < 2.2e-16

Thank you!