Assignment 1

Instructions

Use this .rmd template to answer the 3 questions below. Once you have done this, you can knit the document to HTML and upload this document to moodle/e-class. I will only accept HTML documents, no exceptions.

Please read all of the assignment before attempting it.

Introduction

This assignment is meant to be a tutorial on using R markdown to “knit” together text, and statistical input and output. There will be a short video that accompanies this document to help get you started.

A basic ability to work with data, perform statistical analysis and interpret statistical output are part of the learning objectives of this course. R will be how we perform the statistical analysis, and Rmarkdown will be how we present our analysis. Markdown is a markup language used to produced documents in with a simple text syntax. Just some basics:

Section headers are done with the pound sign

Section 1

New subsection

Lists are done enumerated

first,
second,
third.

Or itemized

first,
second,
third

The spacing is important.

We can write in math by using the dollar sign notation. This is an inline equation $Y_i = \beta_0 + \beta_1x_i + u_i$ and the double dollar sign notation for regular equations:

\[ Y_i = \beta_0 + \beta_1x_i + u_i \]

That is all the basic markdown syntax required for this course. The cheat sheet is a good reference.

Input statistical commands

There are two ways to input statistical commands in Rmarkdown. The main one is called a code-chunk. They are started with three ticks (beside the one key) and then {r}, and then closed with three ticks. You can put them anywhere in the document. A code chunk is set up properly when it turns shaded gray in the markdown editor. Inside a code chunk, it is no longer a markdown language, but the R language.

Data management

I will always use the tidyverse syntax in this course. In terms of data management, there are three main commands to know. These are filter(), select(), and mutate(). The filter command subsets rows with a logical argument. Select subsets columns by selecting variables – there will not be much need for this. Finally, we can create or alter variables by sign mutate(). I will demonstrate below. For the first couple of assignments, I will provide the code, so this is just a reference so that you can learn how the code works.

R is a an object orientated programming language. This means that we will create objects and manipulate them. Data is an object and, as we will see, so are regression models. Here is some basic usage:

# the pound sign is meant to comment within the code.  R ignores any line starting with a pound sign.

# Load data
data('CPS1985')

# Create data frame called df from CPS1985, and do some basic data management
df <- CPS1985 %>% 
  filter(education > 7, wage < 30) %>% 
  mutate(age_squared = age * age,
         BA = ifelse(education >= 16, 1, 0))

In the above code, I did the following:

Loaded the data CPS1985 from the package AER
I created a new data frame called df from CPS1985 using the assignment operator <-, this is called the “create and assign” approach,
I subsetted or selected only observations with education greater than 7 and wage less than 30,
I created two new variables, age squared and a dummy variable for having a Bachelor’s degree (if education >= 16).

This is nearly all of the data management commands we will require for this course. I will introduce a few more commands as we go along, but these will come up a lot.

Basic statistical commands

Lets start with the basic question of examining wages of those with a Bachelor’s degree (BA = 1). We have to know some basic statistical commands and how to tell R what data to work with. The most basic commands are mean(), var(), sd(), and sum(). Suppose we want to know the average wage of degree holders – this is a conditional mean, the mean wage of those with a bachelor degree. We can start a code chunk and work with the data from called df we created above:

mean(df$wage[df$BA == 1], na.rm = T)

## [1] 12.02183

Lets break down this command.

we called the function mean(). This command will always fail if there are any missing data. This can occur, for example, if for some observation they refuse to answer a survey question. Since this happens from time to time, we will use the option na.rm = T which stands for not applicable remove equals true. I always use it for mean() and the othe basic functions.
Inside the function mean() we have to tell R what data to use. We tell it to use the data frame called df and the column called wage by the syntax df$wage.
Since we want a conditional mean, we have to tell R what rows to use. We want to use all observations who have a bachelor degree. These are denoted by the variable BA being one. Thus, we use the syntax [df$BA == 1]. Put together, the line df$wage[df$BA == 1] says use the column wage in the data df and use rows where the variable BA is equal to 1 in the data frame df.

There are alternatives that you may use to achieve the same thing. Its not necessary to know them, but for reference.

with(subset(df, BA == 1), mean(wage, na.rm = T))

## [1] 12.02183

We could also just subset the data frame to include only those with bachelor’s degrees using the “create and assign” approach above.

df_BA <- df %>% 
  filter(BA == 1)

mean(df_BA$wage)

## [1] 12.02183

This creates a new data frame called df_BA.

When we just run the command mean() it prints the answer to the console. However, we can chose to save the answer as an object. For example,

my_mean <- mean(df$wage[df$BA == 1], na.rm = T)

This code will print nothing to the screen, but running it will create a object in the environment below. We then can simply print this to the console or use it in some other calculation.

my_mean # print to console

## [1] 12.02183

my_mean / (sd(df$wage[df$BA==1])/sqrt(sum(df$BA==1)))  # t-ratio calculation

## [1] 22.63607

Class size and educational outcomes

I want to demo some of these concepts in the context of the textbook example using the California schools data. We’ll end up using this data a lot. It comes with the package AER, so it also easy to load. The basic idea behind the data set is that we want to examine whether smaller class sizes improve students performance. We want to think about this in a causal way, with an econometric model. But for now, we will just focus on a descriptive question “Do we observe that test scores are higher in schools with smaller class sizes?”

We will also study a binary case, where classes can either be small or not. \[ \begin{align*} D_i = \left\{ \begin{array}{ll} 1 & \mbox{if class is small (treatment)};\\ 0 & \mbox{if class is not small (control)}.\end{array} \right. \end{align*} \] Let $score$ denote a schools average performance on reading and math standardized exams. If average outcomes depend on class size, then we can write: \[ E(score|D) = \mu_0 + \mu_1 D \] and its the case that expected difference in outcomes is just: \[ E(Y|D=1) - E(Y|D=0) = \mu_1. \]

However, $\mu_1$ is unknown to us because its a population parameter. We can hope to estimate $\mu_1$ by just using sample means to replace the expectations in the above equations. Lets take a look at the data:

# load the textbook data
data('CASchools')

# data management
df <- CASchools %>% 
  filter(complete.cases(.)) %>% 
  mutate(score = (math + read) / 2,  # Average test scores
         str   = students / teachers, # student-teacher ratio
         D = ifelse(str < 19, 1, 0)) # Dummy variable for small class

In the above code chunk, I have loaded the California schools data and filtered to include only rows with complete data (some schools did not provide test scores, so we remove them from the data). Then I created three variables: $score$ which is just the average of the school’s reading and math test scores, $str$ or the student-teacher ratio, and then a dummy variable called $D$ that is equal to one if a school has less than 19 students per teacher (a “small class school”).

We want to use this data to estimate $\hat \mu_1$. In order to do so, we need to compute two conditional sample means. Your job is to uncomment lines (ie, remove the # sign, and fill in the ...), and then run the code chuck (press the ‘play’ button)

Compute the sample mean of score for small ($D=1$) and large ($D$=0) classes.
Take their difference

# Compute the sample mean of score for small (D=1) and large (D=0) classes
Y_1 <- mean(df$wage[df$D == 1])
Y_0 <- mean(df$wage[df$D == 0])

# Take their difference
mu_1 <- Y_1 - Y_0
mu_1

## [1] NA

This calculation suggests that schools with small classes perform better; on average they score about 8.8 points higher on the standardized exams. However, we used sample means to compute this number and we know that they have a sampling distribution because they are dependent on random data. Thus, we want to ask, is 8.8 statistically different from zero? To answer this, we can compute the standard error of our sample means. The formula is 3.19 of the text book: \[ SE(Y_1 - Y_0) = \sqrt{\frac{\sigma^2_{Y_1}}{N_{Y_1}} + \frac{\sigma^2_{Y_0}}{N_{Y_0}}} \] This is calculated below:

# Compute the variance and standard error of the difference
var_1 <- var(df$score[df$D == 1]) / sum(df$D == 1)
var_0 <- var(df$score[df$D == 0]) / sum(df$D == 0)
se_difference <- sqrt(var_1 + var_0)
se_difference

## [1] 2.03339

Finally, if we want to ask if we are confident that there really is a difference between the groups, taking into account sampling variation, we can compute a $t$-statistic with the Null of \[ H_0: \mu_1 = 0, \\ H_1: \mu_1 \neq 1 \]

# Compute t-statistic for H0: mu_1 = 0
t <- mu_1 / se_difference
t

## [1] NA

Question 1:

Interpret the $t$-statistic above:

answer here …

Using regression

The above calculations are a bit of a hassle to type. Fortunately, we can use a simple regression to get the answer for us. When you regress some outcome on a dummy variable, the coefficient on a dummy variable is simply the difference-in-means comparison:

We can do all the above steps in the following code:

lm(score ~ D, data = df) %>% 
  coeftest(., sandwich)

## 
## t test of coefficients:
## 
##             Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 651.2452     1.0482 621.3163 < 2.2e-16 ***
## D             8.7969     2.0271   4.3397 1.791e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first line of code uses the function lm() to run a regression. The syntax is y_variable ~ x_variable, data = data you want to use. The 2nd line just outputs the results with the correct standard errors for inference. We’ll use this code a lot in the course, but we’ll leave the details for later. Examining the output, the coefficient on $D$ is 8.79 as before, and the $t$-value is 4.33. We get all that in two lines of code. Plus, as we’ll see, regression offers some additional flexibility that we’ll use down the road.

Casual inference

In the above example, we calculated the difference in means between two groups. This is simply descriptive: we look at the data and compute the means and report the results. But what if you were a decision maker, and you were in charge of a school board. You know from this data that schools that have smaller class sizes do better. Should you spend a lot of money to reduce your jurisdiction’s class sizes?

This is different than simply describing the data or making a prediction. This is a causal question. If you lower class sizes, will it improve performance? The answer to this depends on the interpretation of the comparison in means.

To answer this question, we need an econometric model. Say, \[ score_i = \beta_0 + \beta_1 D_i + u_i \] where $score_i$ is the average of math and reading scores in school $i$. $u_i$ is term that captures unobserved features of schools that also influence scores (if the school is a high income area, high quality, or other things that could contribute to test scores). $\beta_1$ is the causal effect of having smaller class sizes. It is the object of interest, because it answers an ‘if-then’ question about lowering class sizes.

The comparison of means analysis we just did gives us: \[ \begin{align} E(score|D=1) - E(score|D=0) &= \beta_0 + \beta_1 + E(u|D=1) - \left( \beta_0 + E(u|D=0)\right) \\ &= \beta_1 + \left(E(u|D=1) - E(u|D=0)\right) \\ \end{align} \] This says that the difference in means is equal to the causal effect that we want $\beta_1$, plus another term in brackets. This term has to do with the average observable characteristics between schools that have large and small classes. We don’t observe the $u$ term, so we don’t know if this is zero or not.

Question 2

State the assumption required about the relationship between $u$ and $D$ to interpret the difference in means as casual

To interpret the difference in means as causal, we must assume that the unobserved factors (uᵢ) affecting test scores are independent of whether a school has smaller or larger class sizes (Dᵢ).

Formally: \[ E(u_i | D_i = 1) = E(u_i | D_i = 0) \]

This means that schools with smaller and larger classes must be similar in all other respects (such as teacher quality, student background, resources, or community wealth). If this assumption is violated — for example, if smaller classes are found in wealthier districts — then the estimated difference in means will be biased and cannot be interpreted causally.

Question 3

In the California Schools data set, there is an other variable called income, which is the median family income in the school district. Using the same methods as above, test whether schools with smaller class sizes are in higher income districts. What do you think this tells you about the casual interpretation difference in means comparison in test scores between schools with smaller and larger classes above?

The regression of median family income on the small-class dummy shows that schools with smaller class sizes are located in significantly higher-income districts. The coefficient on D is positive and statistically significant, indicating that smaller-class districts have higher median family incomes on average. This finding suggests that the earlier observed difference in mean test scores between small- and large-class schools cannot be interpreted as purely causal. It likely reflects underlying socioeconomic differences—wealthier districts both have smaller class sizes and higher student performance—so class size alone may not explain the difference in test scores.

# Load packages
library(lmtest)
library(sandwich)

# Regress income on D (small class dummy)
model_income <- lm(income ~ D, data = df)

# Output results with robust standard errors
coeftest(model_income, vcov = vcovHC(model_income, type = "HC1"))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 14.24516    0.29412 48.4325 < 2.2e-16 ***
## D            3.23741    0.90554  3.5751  0.000391 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1