Use this .rmd template to answer the 3 questions below. Once you have
done this, you can knit
the document to HTML
and upload this document to moodle/e-class. I will only
accept HTML documents, no exceptions.
Please read all of the assignment before attempting it.
This assignment is meant to be a tutorial on using R
markdown to “knit” together text, and statistical input and output.
There will be a short video that accompanies this document to help get
you started.
A basic ability to work with data, perform statistical analysis and
interpret statistical output are part of the learning objectives of this
course. R
will be how we perform the statistical analysis,
and Rmarkdown
will be how we present our analysis. Markdown
is a markup language used to produced documents in with a simple text
syntax. Just some basics:
Section headers are done with the pound sign
Lists are done enumerated
Or itemized
The spacing is important.
We can write in math by using the dollar sign notation. This is an inline equation \(Y_i = \beta_0 + \beta_1x_i + u_i\) and the double dollar sign notation for regular equations:
\[ Y_i = \beta_0 + \beta_1x_i + u_i \]
That is all the basic markdown syntax required for this course. The cheat sheet is a good reference.
There are two ways to input statistical commands in
Rmarkdown
. The main one is called a code-chunk. They are
started with three ticks (beside the one key) and then {r}, and then
closed with three ticks. You can put them anywhere in the document. A
code chunk is set up properly when it turns shaded gray in the markdown
editor. Inside a code chunk, it is no longer a markdown language, but
the R
language.
I will always use the tidyverse
syntax in this course.
In terms of data management, there are three main commands to know.
These are filter()
, select()
, and
mutate()
. The filter command subsets rows with a logical
argument. Select subsets columns by selecting variables – there will not
be much need for this. Finally, we can create or alter variables by sign
mutate()
. I will demonstrate below. For the first couple of
assignments, I will provide the code, so this is just a reference so
that you can learn how the code works.
R
is a an object orientated programming language. This
means that we will create objects and manipulate them. Data is an object
and, as we will see, so are regression models. Here is some basic
usage:
# the pound sign is meant to comment within the code. R ignores any line starting with a pound sign.
# Load data
data('CPS1985')
# Create data frame called df from CPS1985, and do some basic data management
df <- CPS1985 %>%
filter(education > 7, wage < 30) %>%
mutate(age_squared = age * age,
BA = ifelse(education >= 16, 1, 0))
In the above code, I did the following:
CPS1985
from the package
AER
df
from
CPS1985
using the assignment operator <-
,
this is called the “create and assign” approach,This is nearly all of the data management commands we will require for this course. I will introduce a few more commands as we go along, but these will come up a lot.
Lets start with the basic question of examining wages of those with a
Bachelor’s degree (BA
= 1). We have to know some basic
statistical commands and how to tell R
what data to work
with. The most basic commands are mean()
,
var()
, sd()
, and sum()
. Suppose
we want to know the average wage of degree holders – this is a
conditional mean, the mean wage of those with a bachelor degree. We can
start a code chunk and work with the data from called df
we
created above:
mean(df$wage[df$BA == 1], na.rm = T)
## [1] 12.02183
Lets break down this command.
we called the function mean()
. This command will
always fail if there are any missing data. This can occur, for example,
if for some observation they refuse to answer a survey question. Since
this happens from time to time, we will use the option
na.rm = T
which stands for not applicable remove equals
true. I always use it for mean()
and the othe basic
functions.
Inside the function mean()
we have to tell
R
what data to use. We tell it to use the data frame called
df
and the column called wage
by the syntax
df$wage
.
Since we want a conditional mean, we have to tell R
what rows to use. We want to use all observations who have a bachelor
degree. These are denoted by the variable BA
being one.
Thus, we use the syntax [df$BA == 1]
. Put together, the
line df$wage[df$BA == 1]
says use the column
wage
in the data df
and use rows where the
variable BA
is equal to 1 in the data frame
df
.
There are alternatives that you may use to achieve the same thing. Its not necessary to know them, but for reference.
with(subset(df, BA == 1), mean(wage, na.rm = T))
## [1] 12.02183
We could also just subset the data frame to include only those with bachelor’s degrees using the “create and assign” approach above.
df_BA <- df %>%
filter(BA == 1)
mean(df_BA$wage)
## [1] 12.02183
This creates a new data frame called df_BA
.
When we just run the command mean()
it prints the answer
to the console. However, we can chose to save the answer as an object.
For example,
my_mean <- mean(df$wage[df$BA == 1], na.rm = T)
This code will print nothing to the screen, but running it will create a object in the environment below. We then can simply print this to the console or use it in some other calculation.
my_mean # print to console
## [1] 12.02183
my_mean / (sd(df$wage[df$BA==1])/sqrt(sum(df$BA==1))) # t-ratio calculation
## [1] 22.63607
I want to demo some of these concepts in the context of the textbook
example using the California schools data. We’ll end up using this data
a lot. It comes with the package AER
, so it also easy to
load. The basic idea behind the data set is that we want to examine
whether smaller class sizes improve students performance. We want to
think about this in a causal way, with an econometric model. But for
now, we will just focus on a descriptive question “Do we observe that
test scores are higher in schools with smaller class sizes?”
We will also study a binary case, where classes can either be small or not. \[ \begin{align*} D_i = \left\{ \begin{array}{ll} 1 & \mbox{if class is small (treatment)};\\ 0 & \mbox{if class is not small (control)}.\end{array} \right. \end{align*} \] Let \(score\) denote a schools average performance on reading and math standardized exams. If average outcomes depend on class size, then we can write: \[ E(score|D) = \mu_0 + \mu_1 D \] and its the case that expected difference in outcomes is just: \[ E(Y|D=1) - E(Y|D=0) = \mu_1. \]
However, \(\mu_1\) is unknown to us because its a population parameter. We can hope to estimate \(\mu_1\) by just using sample means to replace the expectations in the above equations. Lets take a look at the data:
# load the textbook data
data('CASchools')
# data management
df <- CASchools %>%
filter(complete.cases(.)) %>%
mutate(score = (math + read) / 2, # Average test scores
str = students / teachers, # student-teacher ratio
D = ifelse(str < 19, 1, 0)) # Dummy variable for small class
In the above code chunk, I have loaded the California schools data and filtered to include only rows with complete data (some schools did not provide test scores, so we remove them from the data). Then I created three variables: \(score\) which is just the average of the school’s reading and math test scores, \(str\) or the student-teacher ratio, and then a dummy variable called \(D\) that is equal to one if a school has less than 19 students per teacher (a “small class school”).
We want to use this data to estimate \(\hat
\mu_1\). In order to do so, we need to compute two conditional
sample means. Your job is to uncomment lines (ie, remove the #
sign, and fill in the ...
), and then run the code chuck
(press the ‘play’ button)
score
for small (\(D=1\)) and large (\(D\)=0) classes.# Compute the sample mean of score for small (D=1) and large (D=0) classes
Y_1 <- mean(df$wage[df$D == 1])
Y_0 <- mean(df$wage[df$D == 0])
# Take their difference
mu_1 <- Y_1 - Y_0
mu_1
## [1] NA
This calculation suggests that schools with small classes perform better; on average they score about 8.8 points higher on the standardized exams. However, we used sample means to compute this number and we know that they have a sampling distribution because they are dependent on random data. Thus, we want to ask, is 8.8 statistically different from zero? To answer this, we can compute the standard error of our sample means. The formula is 3.19 of the text book: \[ SE(Y_1 - Y_0) = \sqrt{\frac{\sigma^2_{Y_1}}{N_{Y_1}} + \frac{\sigma^2_{Y_0}}{N_{Y_0}}} \] This is calculated below:
# Compute the variance and standard error of the difference
var_1 <- var(df$score[df$D == 1]) / sum(df$D == 1)
var_0 <- var(df$score[df$D == 0]) / sum(df$D == 0)
se_difference <- sqrt(var_1 + var_0)
se_difference
## [1] 2.03339
Finally, if we want to ask if we are confident that there really is a difference between the groups, taking into account sampling variation, we can compute a \(t\)-statistic with the Null of \[ H_0: \mu_1 = 0, \\ H_1: \mu_1 \neq 1 \]
# Compute t-statistic for H0: mu_1 = 0
t <- mu_1 / se_difference
t
## [1] NA
Interpret the \(t\)-statistic above:
answer here …
The above calculations are a bit of a hassle to type. Fortunately, we can use a simple regression to get the answer for us. When you regress some outcome on a dummy variable, the coefficient on a dummy variable is simply the difference-in-means comparison:
We can do all the above steps in the following code:
lm(score ~ D, data = df) %>%
coeftest(., sandwich)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 651.2452 1.0482 621.3163 < 2.2e-16 ***
## D 8.7969 2.0271 4.3397 1.791e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first line of code uses the function lm()
to run a
regression. The syntax is
y_variable ~ x_variable, data = data you want to use
. The
2nd line just outputs the results with the correct standard errors for
inference. We’ll use this code a lot in the course, but we’ll leave the
details for later. Examining the output, the coefficient on \(D\) is 8.79 as before, and the \(t\)-value is 4.33. We get all that in two
lines of code. Plus, as we’ll see, regression offers some additional
flexibility that we’ll use down the road.
In the above example, we calculated the difference in means between two groups. This is simply descriptive: we look at the data and compute the means and report the results. But what if you were a decision maker, and you were in charge of a school board. You know from this data that schools that have smaller class sizes do better. Should you spend a lot of money to reduce your jurisdiction’s class sizes?
This is different than simply describing the data or making a prediction. This is a causal question. If you lower class sizes, will it improve performance? The answer to this depends on the interpretation of the comparison in means.
To answer this question, we need an econometric model. Say, \[ score_i = \beta_0 + \beta_1 D_i + u_i \] where \(score_i\) is the average of math and reading scores in school \(i\). \(u_i\) is term that captures unobserved features of schools that also influence scores (if the school is a high income area, high quality, or other things that could contribute to test scores). \(\beta_1\) is the causal effect of having smaller class sizes. It is the object of interest, because it answers an ‘if-then’ question about lowering class sizes.
The comparison of means analysis we just did gives us: \[ \begin{align} E(score|D=1) - E(score|D=0) &= \beta_0 + \beta_1 + E(u|D=1) - \left( \beta_0 + E(u|D=0)\right) \\ &= \beta_1 + \left(E(u|D=1) - E(u|D=0)\right) \\ \end{align} \] This says that the difference in means is equal to the causal effect that we want \(\beta_1\), plus another term in brackets. This term has to do with the average observable characteristics between schools that have large and small classes. We don’t observe the \(u\) term, so we don’t know if this is zero or not.
State the assumption required about the relationship between \(u\) and \(D\) to interpret the difference in means as casual
To interpret the difference in means as causal, we
must assume that the unobserved factors (uᵢ
) affecting test
scores are independent of whether a school has smaller
or larger class sizes (Dᵢ
).
Formally: \[ E(u_i | D_i = 1) = E(u_i | D_i = 0) \]
This means that schools with smaller and larger classes must be similar in all other respects (such as teacher quality, student background, resources, or community wealth). If this assumption is violated — for example, if smaller classes are found in wealthier districts — then the estimated difference in means will be biased and cannot be interpreted causally.
In the California Schools data set, there is an other variable called
income
, which is the median family income in the school
district. Using the same methods as above, test whether schools with
smaller class sizes are in higher income districts. What do you think
this tells you about the casual interpretation difference in means
comparison in test scores between schools with smaller and larger
classes above?
The regression of median family income on the small-class dummy shows that schools with smaller class sizes are located in significantly higher-income districts. The coefficient on D is positive and statistically significant, indicating that smaller-class districts have higher median family incomes on average. This finding suggests that the earlier observed difference in mean test scores between small- and large-class schools cannot be interpreted as purely causal. It likely reflects underlying socioeconomic differences—wealthier districts both have smaller class sizes and higher student performance—so class size alone may not explain the difference in test scores.
# Load packages
library(lmtest)
library(sandwich)
# Regress income on D (small class dummy)
model_income <- lm(income ~ D, data = df)
# Output results with robust standard errors
coeftest(model_income, vcov = vcovHC(model_income, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.24516 0.29412 48.4325 < 2.2e-16 ***
## D 3.23741 0.90554 3.5751 0.000391 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1