rm(list=ls())
library(tidyverse)
library(ggplot2)
library(cowplot)
library(purrr)
In this problem, we will use R to simulate the distributions of the sample mean as the sample size increases and to illustrate the central limit theorem.
The simulations will help us answer or verify our answers of the questions below; however, strictly speaking, all questions except the first one can be answered even without using R.
First, do the following steps in R:
1. Draw a sample from the exponential distribution with parameter lambda = 2 of size n=5
#X_5_1
n_5 = 5
lambda = 2
nosim = 10000
set.seed(321)
df5_1 <- data.frame(rexp(n_5, lambda))
df5_1
rexp.n_5..lambda.
1 0.08256092
2 0.35672086
3 0.62759753
4 0.52893331
5 0.44705665
2. Compute the mean of the sample. Denote the sample mean of size by bar_X_n; hence, we have just obtained a realization of bar_X_5.
# Calculating Sample Mean
df5_1 <- unlist(df5_1)
bar_X_n <- mean(df5_1)
bar_X_n
[1] 0.4085739
# Calculating the Theoretical Mean
theo_mean <- 1/lambda
theo_mean
[1] 0.5
# Putting them together
data.frame(bar_X_n, theo_mean)
bar_X_n theo_mean
1 0.4085739 0.5
Based on the outcome, we can see that true mean is 22.38% higher than the sample mean. What happens when we repeat this process 10000 times and when we have variable sample sizes? Let’s check.
3. Repeat this procedure 10000 times, and record the sample mean of each iteration (so there should be 10000 sample means in total, with each sample mean being the mean of sample of size 5). We have now simulated the distribution of bar_X_5. When we need to repeat the process multiple times and record the results, the best way is to create an appropriate function. In this case, when I have to repeat the process 10000 times the easiest way for me is to create a for loop that generates samples of 5 sizes over and over until the maximum threshold is met. Here’s how I do it:
#X_5
n_5 = 5 # Sample Size
lambda = 2 # lambda is the essential parameter in exponential distribution
nosim = 10000 # Number of iterations
set.seed(321) # for reproducibility
df5 = data.frame()# empty vector to store values
for (i in 1:10000){ df5<-rbind(df5,rexp(n_5,lambda))} # generates 10000 samples of size 5
tail(df5) #checking if it worked
X0.0825609171197042 X0.356720861026502 X0.627597526160485
9995 0.002788028 0.19910490 0.2469435
9996 0.069378955 0.48966865 1.6182932
9997 1.358695258 1.23842787 0.4008296
9998 1.932344827 0.32406711 0.3147533
9999 0.410072513 0.07661582 0.4136712
10000 0.287830784 0.16447746 0.5968369
X0.528933308997607 X0.447056648894528
9995 1.05443553 0.21282525
9996 2.14219332 0.36046127
9997 0.02617753 0.20110714
9998 0.39803684 0.57822052
9999 0.14114070 0.27934351
10000 0.11987528 0.06581991
# Because I want create a combined plot afterward, I want to create a copy of df5 data set
df5_copy <- df5
df5_copy$mean<-rowMeans(df5_copy,na.rm=TRUE,dims=1)
mean(df5_copy$mean)
[1] 0.4969809
round(mean(df5_copy$mean),1)
[1] 0.5
Everything worked, so far! We can see the last 6 rows of the entire data frame having 5 columns. As stated earlier, one row represents one iteration. Thus, when we are calculating the mean of each iteration, we have to calculate the row means.
# Calculating Row Means
df5$mean<-rowMeans(df5,na.rm=TRUE,dims=1) # It will add a new column called 'mean' to the existing data file
tail(df5)
X0.0825609171197042 X0.356720861026502 X0.627597526160485
9995 0.002788028 0.19910490 0.2469435
9996 0.069378955 0.48966865 1.6182932
9997 1.358695258 1.23842787 0.4008296
9998 1.932344827 0.32406711 0.3147533
9999 0.410072513 0.07661582 0.4136712
10000 0.287830784 0.16447746 0.5968369
X0.528933308997607 X0.447056648894528 mean
9995 1.05443553 0.21282525 0.3432194
9996 2.14219332 0.36046127 0.9359991
9997 0.02617753 0.20110714 0.6450475
9998 0.39803684 0.57822052 0.7094845
9999 0.14114070 0.27934351 0.2641688
10000 0.11987528 0.06581991 0.2469681
I have added a new column named ‘mean’ in my df5 data frame. Now, I want to calculate the column wise mean:
df5 %>%
map_dbl(mean)
X0.0825609171197042 X0.356720861026502 X0.627597526160485 X0.528933308997607
0.4906613 0.4914589 0.4960988 0.5000257
X0.447056648894528 mean
0.5066597 0.4969809
# Maximum and minimum values in the column mean
min(df5$mean)
[1] 0.04577098
max(df5$mean)
[1] 1.665608
The row means seem to be ranging from 0.04577098 through 1.665608. That’s a big range! What is the actual mean of all of these rows? Let’s check the summary.
summary(df5$mean) # Summarizing the results
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04577 0.33358 0.46410 0.49698 0.62350 1.66561
The mean of all of the row means turns out to be 0.49698, which is much closer to the theoretical mean, i.e., 0.5 compared to the previous mean, i.e., 0.4085739. Looks like, if we have larger sample size, the sample mean gets closer to the actual mean.
4. Plot a histogram of the pdf of bar_X_5 using a bin-width of 0.01. We recommend you use ggplot with the option aes(y=..density..) . You are also welcome to plot a kernel density plot with a suitable bandwidth if preferred.
# Plotting histogram
x_5_hist <- ggplot()+
geom_histogram(data=df5,aes(mean, ..density..), binwidth = 0.01, fill="lightblue", color="darkred")+
geom_density(kernel="gaussian", aes(df5$mean))+
xlab("Sample Size=5, 10000 Iterations")
x_5_hist
Obviously! The data we generated were exponential and the histogram seems to represent such. However, we can see the beautiful Gaussian shape in slightly right skewed form.
Now, let’s test rest of the things asked in the previous question. Although, we already saw the mean of mean, I am going to do it again and round it to the one decimal point. I will then, calculate the variance and the standard deviation of this distribution to gain some robust grasp on its type.
# Calculating Mean of Mean
meanofmean <- mean(df5$mean)
meanofmean
[1] 0.4969809
# Rounding the Value of Mean to One Digit Decimal Point
round(mean(df5$mean),1)
[1] 0.5
# Unlisting the data frame because calculating variance did not work without it
df5 <- unlist(df5)
# Calculating variance and rounding it to one decimal point
var_5 <- round(var(df5),1)
var_5 #Checking
[1] 0.2
# Calculating Standard Deviation and rounding to one decimal point
sd_5 <- round(sd(df5),1)
sd_5
[1] 0.5
Exponential distribution, in probability, said to possess the Memoryless Property, i.e., the knowledge of what has happened in the past doesn’t influence future probabilities. It is often used to calculate the longevity of an electrical or mechanical devices.
Here’s some properties of the exponential distribution:
The data we generated seem to follow these basic qualities!
A. n=1
#X_1
n_1 = 1
lambda = 2
nosim = 10000
set.seed(321)
df1=data.frame()
for (i in 1:nosim){ df1<-rbind(df1,rexp(n_1,lambda))}
tail(df1)
X0.0825609171197042
9995 0.0974078
9996 0.5127457
9997 0.5860184
9998 1.2658391
9999 1.4962940
10000 1.5561905
df1$mean<-rowMeans(df1,na.rm=TRUE,dims=1)
mean(df1$mean)
[1] 0.4951067
round(mean(df1$mean),1)
[1] 0.5
# Because I want create a combined plot afterward, I want to create a copy of df1 data set
df1_copy <- df1
df1_copy$mean<-rowMeans(df1_copy,na.rm=TRUE,dims=1)
mean(df1_copy$mean)
[1] 0.4951067
round(mean(df1_copy$mean),1)
[1] 0.5
# Creating Histogram
x_1_hist <- ggplot()+
geom_histogram(data = df1,aes(mean, ..density..), binwidth = 0.01, fill = "green", color = "red")+
geom_density(kernel = "gaussian", aes(df1$mean))+
xlab("Sample Size = 1, 10000 Iterations")
x_1_hist
df1 <- unlist(df1)
var_1 <- round(var(df1),1)
sd_1 <- round(sd(df1),1)
sd_1
[1] 0.5
B. n = 10
#X_10
n_10 = 10
lambda = 2
nosim = 10000
set.seed(321)
df10=data.frame()
for (i in 1:nosim){ df10<-rbind(df10,rexp(n_10,lambda))}
df10$mean<-rowMeans(df10,na.rm=TRUE,dims=1)
mean(df10$mean)
[1] 0.4985819
round(mean(df10$mean),1)
[1] 0.5
# Because I want create a combined plot afterward, I want to create a copy of df10 data set
df10_copy <- df10
df10_copy$mean<-rowMeans(df10_copy,na.rm=TRUE,dims=1)
mean(df10_copy$mean)
[1] 0.4985819
round(mean(df10_copy$mean),1)
[1] 0.5
x_10_hist <- ggplot()+
geom_histogram(data=df10,aes(mean, ..density..), binwidth = 0.01, fill="white", color="darkred")+
geom_density(kernel="gaussian", aes(df10$mean))+
xlab("Sample Size=10, 10000 Iterations")
x_10_hist
df10 <- unlist(df10)
var_10 <- round(var(df10),1)
sd_10 <- round(sd(df10),1)
sd_10
[1] 0.5
C. n=30
AND
Compare the 4 histograms of bar_X_5, bar_X_1, bar_X_10, bar_X_30 shapes by plotting them on the same graph.
#X_30
n_30 = 30
lambda = 2
nosim = 10000
set.seed(321)
df30=data.frame()
for (i in 1:nosim){ df30<-rbind(df30,rexp(n_30,lambda))}
df30$mean<-round(rowMeans(df30,na.rm=TRUE,dims=1),1)
mean(df30$mean)
[1] 0.50039
round(mean(df30$mean),1)
[1] 0.5
# Because I want create a combined plot afterward, I want to create a copy of df30 data set
df30_copy <- df30
df30_copy$mean<-rowMeans(df30_copy,na.rm=TRUE,dims=1)
mean(df30_copy$mean)
[1] 0.5001816
round(mean(df30_copy$mean),1)
[1] 0.5
x_30_hist <- ggplot()+
geom_histogram(data=df30,aes(mean, ..density..), binwidth = 0.01, fill="yellow", color="darkred")+
geom_density(kernel="gaussian", aes(df30$mean))+
xlab("Sample Size=30, 10000 Iterations")
x_30_hist
df30 <- unlist(df30)
var_30 <- round(var(df30),1)
sd_30 <- round(sd(df30),1)
sd_30
[1] 0.5
Can I put them together in a single graph? Let’s try:
ggplot()+
geom_histogram(data = df1_copy, aes(x = mean, y = ..density..), binwidth = 1/1000, color = "orange", fill = "orange", alpha = 0.2)+
geom_histogram(data = df5_copy, aes(x = mean, y = ..density..),binwidth = 1/1000, color = "blue",fill = "blue", alpha = 0.2)+
geom_histogram(data = df10_copy, aes(x = mean, y = ..density..), binwidth = 1/1000, color = "red", fill = "red", alpha = 0.1)+
geom_histogram(data = df30_copy, aes(x = mean, y = ..density..),binwidth = 1/1000,color = "purple",fill = "purple",alpha = 0.01)+
xlim(c(0,1.5))+
xlab("Orange:Sample Size 1, Blue:Sample Size 5, Red: Sample Size 10, & Purple: Sample Size 30")
Now, the comparison is clearly visible. As can be seen, the distribution becomes more and more Gaussian as the sample size increase. The distribution is constantly being symmetric as the number of variables are increasing.
Finally, I am going to check the simulated distribution of sample size 30 with the normal distribution.
# Simulating 10000 random normal data with a mean of 1, sd of 1/2
df30_2 <- data.frame()
for(i in 1:10000){
df30_2 <- rbind(df30_2, rnorm(1, 1/2, 1/sqrt(120)))
}
# Calculating Mean
df30_2$mean <- rowMeans(df30_2, na.rm=TRUE, dims=1)
# Plotting a combined histogram
ggplot()+
geom_freqpoly(data=df30_copy, aes(x=mean, y=..density..), fill="blue", color="darkblue")+
geom_freqpoly(data=df30_2, aes(x=mean, y=..density..),color="red", fill="darkred")+
xlab("Red:Normal Distribution, Blue:Exponential Distribution")
They seem to be symmetrical. In absence of the red distribution, we would assume blue to be the normal Gaussian distribution.
In this problem, we will examine a famous study on corruption conducted by the economist Benjamin Olken. The study used a number of experiments to evaluate the effects of increased monitoring on levels of misappropriation by village officials during the construction of roads in Indonesia (the full paper is available here: https://economics.mit.edu/files/2913 ). In one experiment, villages were randomly assigned to receive invitations to public ‘accountability meetings’ i.e. public meetings where community residents would hear village officials explain how they spent the project funds. The idea behind this experiment was that if village officials know in advance that more community members will be coming to accountability meetings, they might be less likely to misappropriate funding.
Load the dataset olken.csv.
simul_stat <- as.vector(NULL)
olken_data <- read.csv("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/Assessing 14.310x/rcodes/datasets_olken.csv")
head(olken_data)
X.1 X treat_invite pct_missing head_edu mosques pct_poor total_budget
1 1 1 0 0.38527447 6 0.9083831 0.4001222 40.56500
2 2 2 1 -0.09574836 14 1.0666667 0.1856149 69.32150
3 3 3 1 0.14771932 12 0.7117438 0.4000000 41.10650
4 4 4 1 -0.18259122 9 0.9489917 0.4379366 17.06200
5 5 5 0 -0.29304767 9 1.6233766 0.3126954 72.08600
6 6 6 1 -0.09736358 9 0.7381890 0.4076087 69.83308
The dataset contains 472 villages that treat_invite a binary indicator of whether or not the village was “treated” with invitations to accountability meetings. Hence, denotes the village is in the “treatment group”, and denotes the village is in the “control group”.
pct_missing : the amount of funding misappropriated during the construction project.
head_edu : an ordinal variable indicating the level of education of the head village official.
mosques : the number of mosques per 1000 residents in the village.
pct_poor : the percentage of households in the village below the poverty line.
total_budget : the total budget allocated to the village for the road project.
library(tidyverse)
olken_data <- select(olken_data, treat_invite, pct_missing, head_edu, mosques,
pct_poor, total_budget)
olken_data <- as.data.frame(olken_data)
head(olken_data)
treat_invite pct_missing head_edu mosques pct_poor total_budget
1 0 0.38527447 6 0.9083831 0.4001222 40.56500
2 1 -0.09574836 14 1.0666667 0.1856149 69.32150
3 1 0.14771932 12 0.7117438 0.4000000 41.10650
4 1 -0.18259122 9 0.9489917 0.4379366 17.06200
5 0 -0.29304767 9 1.6233766 0.3126954 72.08600
6 1 -0.09736358 9 0.7381890 0.4076087 69.83308
summary(olken_data)
treat_invite pct_missing head_edu mosques
Min. :0.0000 Min. :-1.10282 Min. : 6.00 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 0.01996 1st Qu.: 9.00 1st Qu.:0.8602
Median :1.0000 Median : 0.22717 Median :12.00 Median :1.2629
Mean :0.6589 Mean : 0.23633 Mean :11.56 Mean :1.4377
3rd Qu.:1.0000 3rd Qu.: 0.41690 3rd Qu.:12.00 3rd Qu.:1.8841
Max. :1.0000 Max. : 1.67431 Max. :20.00 Max. :6.8891
pct_poor total_budget
Min. :0.01867 Min. : 8.758
1st Qu.:0.23317 1st Qu.: 53.770
Median :0.38596 Median : 73.177
Mean :0.40713 Mean : 83.229
3rd Qu.:0.56572 3rd Qu.:103.448
Max. :0.94462 Max. :890.242
# Later, I want to plot density plots in a single graph. Thus, i want to create a copy of the Olken data.
olken_data_copy <- olken_data
head(olken_data_copy)
treat_invite pct_missing head_edu mosques pct_poor total_budget
1 0 0.38527447 6 0.9083831 0.4001222 40.56500
2 1 -0.09574836 14 1.0666667 0.1856149 69.32150
3 1 0.14771932 12 0.7117438 0.4000000 41.10650
4 1 -0.18259122 9 0.9489917 0.4379366 17.06200
5 0 -0.29304767 9 1.6233766 0.3126954 72.08600
6 1 -0.09736358 9 0.7381890 0.4076087 69.83308
We will look at the average treatment effect on the amount of misappropriated funding pct_missing. In the following, use R to help answer the questions, use the lm function if wished.
Separating the given dataset to by treatment and control group:
#Trreatment Group
treatment_group <- filter(olken_data, treat_invite==1)
tail(treatment_group)
treat_invite pct_missing head_edu mosques pct_poor total_budget
306 1 -0.20104955 12 1.9786308 0.6354749 138.7556
307 1 -0.19787270 12 2.1459227 0.5958816 98.7985
308 1 -0.09951322 12 0.9589567 0.5052515 106.9569
309 1 0.10769557 16 1.0868384 0.7793258 84.4882
310 1 0.24107966 14 1.0576415 0.5776627 123.4696
311 1 0.20130354 16 0.9394965 0.3533671 140.2379
dim(treatment_group)
[1] 311 6
#Control Group
control_group <- filter(olken_data, treat_invite==0)
tail(control_group)
treat_invite pct_missing head_edu mosques pct_poor total_budget
156 0 0.1532032 9 0.7312614 0.3879004 98.40993
157 0 -0.1434073 16 1.6062413 0.2776141 141.59979
158 0 0.2352902 16 1.1398176 0.3291284 104.98269
159 0 0.2973622 12 2.5575447 0.6444111 48.56191
160 0 0.2484317 9 2.3897059 0.3427230 50.28449
161 0 0.1516978 16 0.1238237 0.9391992 137.72064
str(control_group)
'data.frame': 161 obs. of 6 variables:
$ treat_invite: int 0 0 0 0 0 0 0 0 0 0 ...
$ pct_missing : num 0.3853 -0.293 -0.1215 -0.0551 0.1782 ...
$ head_edu : int 6 9 12 12 9 6 12 12 9 12 ...
$ mosques : num 0.908 1.623 1.28 2.055 1.094 ...
$ pct_poor : num 0.4 0.313 0.585 0.326 0.228 ...
$ total_budget: num 40.6 72.1 82.1 55.9 87.1 ...
Treatment groups have 311 data points while the control group has 161 data points.
1. Make two kernel density plots of the variable head_edu, one for the treatment group and one for the control group. Place these on the same graph. Do the two distributions look similar for treatment and control groups?
# Loading Required Libraries
library(ggplot2)
library(cowplot)
# Treatment + head_edu
kernel_tretment_head <- ggplot(treatment_group)+
geom_histogram(aes(x=head_edu,y=..density..), binwidth = 1.5, fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(head_edu), bw=2, color="red")
# Control + head_edu
kernel_control_head <- ggplot(control_group)+
geom_histogram(aes(x=head_edu,y = ..density..), binwidth = 1.5, fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(x=head_edu),bw=2, color="red")
# Putting them Together
plot_grid(kernel_tretment_head, kernel_control_head)
I don’t see any differences in these distributions. They look exactly alike.
Do the same for the other 3 variables mosques , pct_poor , and total_budget . Do the distributions of each variable look similar for treatment and control groups?
MOSQUES Variable
# Treatment + mosques
kernel_tretment_mosques <- ggplot(treatment_group)+
geom_histogram(aes(x=mosques,y=..density..), fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(mosques), color="red")
# Control + mosques
kernel_control_mosques <- ggplot(control_group)+
geom_histogram(aes(x=mosques,y = ..density..), fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(x=mosques), color="red")
# Putting them Together
plot_grid(kernel_tretment_mosques, kernel_control_mosques)
PCT_POOR Variable
# Treatment + pct_poor
kernel_tretment_pct_poor <- ggplot(treatment_group)+
geom_histogram(aes(x=pct_poor,y=..density..), fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(pct_poor), color="red")
# Control + pct_poor
kernel_control_pct_poor <- ggplot(control_group)+
geom_histogram(aes(x=pct_poor,y = ..density..), fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(x=pct_poor), color="red")
# Putting them Together
plot_grid(kernel_tretment_pct_poor, kernel_control_pct_poor)
TOTAL_BUDGET Variable
# Treatment + total_budget
kernel_tretment_total_budget <- ggplot(treatment_group)+
geom_histogram(aes(x=total_budget,y=..density..), fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(total_budget), color="red")
# Control + total_budget
kernel_control_total_budget <- ggplot(control_group)+
geom_histogram(aes(x=total_budget,y = ..density..), fill="grey", color="darkblue")+
geom_density(kernel="gaussian", aes(x=total_budget), color="red")
# Putting them Together
plot_grid(kernel_tretment_total_budget, kernel_control_total_budget)
OR I can use the olken_data_copy and plot all of them together. Just the density plot for clear comparison.
# Inspecting the shapes of covariates
#A. Head_edu
ggplot(olken_data_copy, aes(x = head_edu, colour =(treat_invite == 0))) + geom_density()
#B. Mosques
ggplot(olken_data_copy, aes(x = mosques, colour =(treat_invite == 0))) + geom_density()
#B. pct_poor
ggplot(olken_data_copy, aes(x = pct_poor, colour =(treat_invite == 0))) + geom_density()
#B. total)budget
ggplot(olken_data_copy, aes(x = total_budget, colour =(treat_invite == 0))) + geom_density()
A. Checking the Number of Treatment and Control Cases
sum(olken_data$treat_invite)
[1] 311
count(olken_data)
n
1 472
It did work! There are total of 311 treatment cases and total of 472 cases.
B. Creating a function that calculates the treatment and control effect for all possible combinations.
set.seed(123)
for(i in 1:472){
olken_data$rand <- runif(472, min=0, max=1)# Creating 472 random uniform observations between 0 and 1
olken_data$treatment_rand <- as.numeric(rank(olken_data$rand)<= 311) #Randomly selecting total of 311 placeholders for treatment cases
olken_data$control_rand <- 1 - olken_data$treatment_rand # Randomly selecting total of 161 cases control groups
# Appending the total data in a single data.frame
simul_stat <- append(simul_stat, sum(olken_data$treatment_rand * olken_data$pct_missing)/sum(olken_data$treatment_rand)- sum(olken_data$control_rand * olken_data$pct_missing)/sum(olken_data$control_rand)
)
}
# calculating effects on control cases
olken_data$control <- 1 - olken_data$treatment
olken_data$control
[1] 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0
[38] 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1
[75] 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
[112] 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0
[149] 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0
[186] 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1
[223] 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1
[260] 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0
[297] 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 1
[334] 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1
[371] 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0
[408] 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 0
[445] 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0
*** Calculating actual statistics from the given data***
act_stat <- sum(olken_data$treatment * olken_data$pct_missing)/sum(olken_data$treatment) - sum(olken_data$control * olken_data$pct_missing)/sum(olken_data$control)
act_stat
[1] -0.01428412
## Calculating Average Treatment Effect (ATE)
control_v <- filter(olken_data_copy, treat_invite == 0)#subsetting control data
treatment_v <- filter(olken_data_copy, treat_invite == 1)#subsetting the treatment data
average_treatment_effect <- mean(treatment_v$pct_missing) - mean(control_v$pct_missing)# Calculating ATE
average_treatment_effect
[1] -0.02494953
The results show that -0.0148412 is the actual statics. Now, checking if the simulated value is bigger than the actual statistics. Average Treatment Effect is -0.02494953. In a way, actual statistics is like hand calculated ATE. There are differences in these two values.
sum(abs(simul_stat) >= act_stat)/NROW(simul_stat)
[1] 1
Results show that there is only one row having bigger value than the actual statistics.
Calculating Average of Average mean of Control Schools.
control_mean <- sum(olken_data$control * olken_data$pct_missing)/sum(olken_data$control)
Calculating Average of Average Mean of Treatment Schools.
treatment_mean <- sum(olken_data$treatment * olken_data$pct_missing)/sum(olken_data$treatment)
# Putting them Together
data.frame(control_mean, treatment_mean)
control_mean treatment_mean
1 0.2457397 0.2314556
We can see that control mean is higher (by 0.0142841) than treatment mean, showing that the amount of fund misappropriation among control villages is higher than treatment villages.
Now, let’s calculate the difference in sample average between control and treatment cases, which gives the unbiased estimate of the treatment effect. The sum of the control variance is denoted by S_C and the treatment effect S_T.
Average Variance in Control Cases
S_C <- (1/(sum(olken_data$control)-1)) * sum(((olken_data$pct_missing - control_mean) * olken_data$control)^2)
# Second Method
standard_error <- sqrt(var(treatment_v$pct_missing)/length(treatment_v$pct_missing) + var(control_v$pct_missing)/length(control_v$pct_missing))
standard_error
[1] 0.03310019
# T-statistics
T_value <- average_treatment_effect/standard_error
T_value
[1] -0.7537578
# Over all P-Value
p_value <- 2*pnorm(T_value)
p_value
[1] 0.4509946
# p_value for Treatment
p_treatment <- 2*pt(T_value, 471)
p_treatment
[1] 0.4513713
## Comparing these values with the results from Linear Model
summary(lm(pct_missing ~ treat_invite, data = olken_data_copy))
Call:
lm(formula = pct_missing ~ treat_invite, data = olken_data_copy)
Residuals:
Min 1Q Median 3Q Max
-1.33064 -0.21249 -0.01284 0.18281 1.42154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.25277 0.02716 9.306 <2e-16 ***
treat_invite -0.02495 0.03346 -0.746 0.456
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3447 on 470 degrees of freedom
Multiple R-squared: 0.001181, Adjusted R-squared: -0.0009438
F-statistic: 0.5559 on 1 and 470 DF, p-value: 0.4563
Average Variance in Treatment Cases
S_T <- (1/(sum(olken_data$treatment)-1)) * sum(((olken_data$pct_missing - treatment_mean) * olken_data$treatment)^2)
# Putting them Together
data.frame(S_C,S_T)
S_C S_T
1 0.1099838 0.1234812
Now, that I calculated the variance for both control and treatment villages. I am now going to estimate the upper bound of the standard error of the point using the Neyman Inference. It is often known as the conservative estimator of sampling standard deviation.
# Upper Bound
Vneyman <- S_T/sum(olken_data$treatment)+(S_C/sum(olken_data$control))
print(sqrt(Vneyman))
[1] 0.03286601
We can calculate the T-statistics using the Vneyman statistics and the ATE value:
t_value <- (average_treatment_effect/sqrt(Vneyman))
t_value
[1] -0.7591286
Based on the outcome, the treatment has lowered the misappropriation of the budget but it is not statistically significant. It suggests that there is nominal to no-impact.
In this problem, we will investigate transformations to the linear regression model. Download the demo.csv dataset. The dataset contains information from a sample of countries in the year 2000, taken from the Democracy Time-Series Dataset. It includes the following variables:
Nation : country name
GDP : Gross Domestic Product (GDP) per capita in constant US dollars
FHouse : Freedom House rating–a measure of the level of political and civil liberties in a country, on a scale from 1.0 (most free) to 7.0 (least free).
OECD : a dummy variable indicating OECD status.
regime : a categorical variable coded from the Freedom House rating that indicates whether a country is free (1.0-1.5), partly free (2.0-3.0), or not free (3.5-7.0).
Load the dataset demo.csv
asset_data <- read.csv("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/Assessing 14.310x/rcodes/asset_demo.csv")
asset_data <- select(asset_data,Nation, GDP, FHouse, OECD, regime)
asset_data <- as.data.frame(asset_data)
head(asset_data)
Nation GDP FHouse OECD regime
1 Albania 3719 4.5 0 not free
2 Algeria 5327 5.5 0 not free
3 Angola 1462 6.0 0 not free
4 Argentina 12095 1.5 0 free
5 Armenia 2421 4.0 0 not free
6 Australia 27390 1.0 1 free
summary(asset_data)
Nation GDP FHouse OECD
Length:161 Min. : 463 Min. :1.00 Min. :0.0000
Class :character 1st Qu.: 1857 1st Qu.:1.50 1st Qu.:0.0000
Mode :character Median : 4555 Median :3.50 Median :0.0000
Mean : 8469 Mean :3.46 Mean :0.1863
3rd Qu.: 9677 3rd Qu.:5.50 3rd Qu.:0.0000
Max. :50564 Max. :7.00 Max. :1.0000
regime
Length:161
Class :character
Mode :character
make a scatter plot with FHouse on the X-axis and GDP on the -axis
ggplot(asset_data)+
geom_point(aes(x=FHouse, y=GDP), color = "red")
run a linear regression using lm with GDP as the dependent variable, and FHouse as the explanatory variable
lm(formula = GDP ~ FHouse, data = asset_data)
Call:
lm(formula = GDP ~ FHouse, data = asset_data)
Coefficients:
(Intercept) FHouse
18214 -2817
Plot the regression line on your existing scatter plot (e.g. by adding + geom_smooth(method=‘lm’) to your ggplot command); Double check that the correct line is plotted by comparing with the results in the lm.
g = ggplot(asset_data, aes(x=FHouse, y=GDP))
g = g + geom_point(size=3,colour="red",alpha=0.4)
g = g + geom_smooth(method='lm', colour="black")
g
Compute the conditional sample mean of the GDP at each value of the FHouse score and store these in a vector. (This vector should contain 1 GDP sample mean for each possible FHouse score.) These are estimates of the conditional expectation of the GDP at different values of FHouse scores;
group_1 <- filter(asset_data, FHouse==1)
head(group_1)
Nation GDP FHouse OECD regime
1 Australia 27390 1 1 free
2 Austria 28481 1 1 free
3 Bahamas 17055 1 0 free
4 Belize 5682 1 0 free
5 Canada 27503 1 1 free
6 Cyprus 19175 1 0 free
group_2 <- filter(asset_data, FHouse==2)
group_3 <- filter(asset_data, FHouse==3)
group_4 <- filter(asset_data, FHouse==4)
group_5 <- filter(asset_data, FHouse==5)
group_6 <- filter(asset_data, FHouse==6)
group_7 <- filter(asset_data, FHouse==7)
summary(group_1$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4545 17922 26987 23713 29105 50564
summary(group_2$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
975 3678 6144 8123 9122 23015
summary(group_3$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
583 1230 3002 2987 3604 7154
summary(group_4$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
521 986 2421 2960 4162 6568
summary(group_5$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
815 1206 1718 6303 7791 23594
summary(group_6$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
650.0 968.5 1546.5 2239.0 2498.2 5806.0
summary(group_7$GDP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1506 3162 3416 5477 5707 13593
#OR- the Easy Method- Calculating Mean Directly
mean_GDP <- data.frame(matrix(ncol = 2, nrow = 0)) # Create an empty matrix with 2-columns and 0-rows
x = c(2:14)/2
# creating a function
for(i in x){
mean_GDP <- rbind(c(i, mean(filter(asset_data, FHouse == i)$GDP)))
}
colnames(mean_GDP) <- c("FHouse", "mean_GDP")
mean_GDP
FHouse mean_GDP
[1,] 7 5476.8
Create a new variable, FHouse_sq, defined as the square of FHouse.
asset_data <- mutate(asset_data, FHouse_sq = FHouse^2, .keep = "all")
head(asset_data)
Nation GDP FHouse OECD regime FHouse_sq
1 Albania 3719 4.5 0 not free 20.25
2 Algeria 5327 5.5 0 not free 30.25
3 Angola 1462 6.0 0 not free 36.00
4 Argentina 12095 1.5 0 free 2.25
5 Armenia 2421 4.0 0 not free 16.00
6 Australia 27390 1.0 1 free 1.00
# Changing FHouse to a Factor Variable
asset_data$FHouse <- as.factor(asset_data$FHouse)
Fit a linear regression model with GDP as the dependent variable, and 2 explanatory variables FHouse and FHouse_sq.
fit <- lm(GDP ~ FHouse + FHouse_sq, data = asset_data)
summary(fit)
Call:
lm(formula = GDP ~ FHouse + FHouse_sq, data = asset_data)
Residuals:
Min 1Q Median 3Q Max
-19169 -2957 -1149 2215 26851
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23714 1385 17.123 < 2e-16 ***
FHouse1.5 -9188 1937 -4.743 4.91e-06 ***
FHouse2 -15590 2221 -7.020 7.44e-11 ***
FHouse2.5 -19200 2331 -8.236 8.62e-14 ***
FHouse3 -20727 2682 -7.729 1.53e-12 ***
FHouse3.5 -21022 2477 -8.485 2.05e-14 ***
FHouse4 -20753 2272 -9.133 4.65e-16 ***
FHouse4.5 -18900 2477 -7.629 2.66e-12 ***
FHouse5 -17411 2819 -6.177 6.01e-09 ***
FHouse5.5 -19650 2007 -9.791 < 2e-16 ***
FHouse6 -21475 2682 -8.007 3.18e-13 ***
FHouse6.5 -20198 2570 -7.858 7.38e-13 ***
FHouse7 -18237 3218 -5.667 7.37e-08 ***
FHouse_sq NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6496 on 148 degrees of freedom
Multiple R-squared: 0.569, Adjusted R-squared: 0.5341
F-statistic: 16.28 on 12 and 148 DF, p-value: < 2.2e-16
Now do a different regression, with FHouse as the single explanatory variable, and log of GDP as the dependent variable (i.e. a regression using and ) How can we now interpret the coefficient for FHouse?
library(AER)
fit1 <- lm(log(GDP) ~ FHouse, data = asset_data)
summary(fit1)
Call:
lm(formula = log(GDP) ~ FHouse, data = asset_data)
Residuals:
Min 1Q Median 3Q Max
-1.8175 -0.5175 -0.0182 0.5326 2.0864
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.9189 0.1758 56.419 < 2e-16 ***
FHouse1.5 -0.4947 0.2459 -2.012 0.04605 *
FHouse2 -1.2191 0.2819 -4.324 2.80e-05 ***
FHouse2.5 -1.7354 0.2959 -5.864 2.83e-08 ***
FHouse3 -2.2004 0.3405 -6.463 1.40e-09 ***
FHouse3.5 -2.2146 0.3145 -7.042 6.62e-11 ***
FHouse4 -2.2260 0.2885 -7.717 1.63e-12 ***
FHouse4.5 -1.9857 0.3145 -6.314 3.00e-09 ***
FHouse5 -1.8752 0.3578 -5.240 5.43e-07 ***
FHouse5.5 -2.0306 0.2548 -7.970 3.92e-13 ***
FHouse6 -2.4958 0.3405 -7.331 1.38e-11 ***
FHouse6.5 -2.2325 0.3263 -6.842 1.92e-10 ***
FHouse7 -1.5831 0.4085 -3.875 0.00016 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8246 on 148 degrees of freedom
Multiple R-squared: 0.527, Adjusted R-squared: 0.4887
F-statistic: 13.74 on 12 and 148 DF, p-value: < 2.2e-16
IV Regression
iv_health <- read.csv("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/Assessing 14.310x/rcodes/data_iv_health2.csv")
iv_health <- as.data.frame(iv_health)
head(iv_health)
healthinsu age female illnesses ssiratio marry logmedexpense logincome
1 1 74 1 0 0.1498765 1 6.388561 4.549085
2 1 73 0 3 0.3958560 1 7.486053 3.577847
3 0 80 1 1 1.0000000 0 5.170484 2.261763
4 1 70 0 5 0.2066395 1 7.798523 3.656221
5 0 91 0 3 0.5371920 1 5.799093 2.166193
6 1 82 1 2 0.3263104 0 6.826545 3.775077
summary(iv_health)
healthinsu age female illnesses
Min. :0.0000 Min. :65.00 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:70.00 1st Qu.:0.0000 1st Qu.:1.000
Median :0.0000 Median :74.00 Median :1.0000 Median :2.000
Mean :0.3822 Mean :75.05 Mean :0.5771 Mean :1.861
3rd Qu.:1.0000 3rd Qu.:80.00 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :1.0000 Max. :91.00 Max. :1.0000 Max. :9.000
ssiratio marry logmedexpense logincome
Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :-6.908
1st Qu.:0.2381 1st Qu.:0.0000 1st Qu.: 5.740 1st Qu.: 2.233
Median :0.5045 Median :1.0000 Median : 6.678 Median : 2.743
Mean :0.5365 Mean :0.5636 Mean : 6.481 Mean : 2.743
3rd Qu.:0.9091 3rd Qu.:1.0000 3rd Qu.: 7.430 3rd Qu.: 3.315
Max. :9.2506 Max. :1.0000 Max. :10.180 Max. : 5.744
reg1 <- ivreg(logmedexpense ~ age+female+illnesses+ssiratio+marry+logincome | age+female+illnesses+ssiratio+marry+logincome+healthinsu, data = iv_health)
summary(reg1)
Call:
ivreg(formula = logmedexpense ~ age + female + illnesses + ssiratio +
marry + logincome | age + female + illnesses + ssiratio +
marry + logincome + healthinsu, data = iv_health)
Residuals:
Min 1Q Median 3Q Max
-6.3289 -0.6684 0.1547 0.8491 3.6795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.684271 0.162049 35.078 < 2e-16 ***
age -0.004143 0.001931 -2.145 0.031954 *
female 0.055787 0.026459 2.108 0.035018 *
illnesses 0.438568 0.009594 45.712 < 2e-16 ***
ssiratio 0.171970 0.037670 4.565 5.05e-06 ***
marry 0.031815 0.027322 1.164 0.244272
logincome 0.054499 0.014965 3.642 0.000272 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 10082 degrees of freedom
Multiple R-Squared: 0.1763, Adjusted R-squared: 0.1758
Wald test: 359.7 on 6 and 10082 DF, p-value: < 2.2e-16
Thank you!