Getting Started

library(pacman)
p_load(tidyverse)

The first thing we need to do is to read the csv file into R. We can use the read_csv() function to read in the file. Make sure to put your own file path in the quotation marks. And remember to name your data frame!

job_df <- read_csv("/Users/jenniputz/Downloads/001-data.csv")
head(job_df)
## # A tibble: 6 x 13
##   i_callback n_jobs n_expr i_military i_computer first_name sex   i_female
##        <dbl>  <dbl>  <dbl>      <dbl>      <dbl> <chr>      <chr>    <dbl>
## 1          0      3      7          0          0 Jamal      m            0
## 2          0      2     11          0          1 Tanisha    f            1
## 3          0      4     11          0          1 Carrie     f            1
## 4          0      4      8          0          1 Anne       f            1
## 5          0      4      8          0          1 Tremayne   m            0
## 6          0      3     12          0          1 Laurie     f            1
## # … with 5 more variables: i_male <dbl>, race <chr>, i_black <dbl>,
## #   i_white <dbl>, i_secretary <dbl>

How would we find the dimensions of the data frame? The dim() function gives us the number of rows and columns.

dim(job_df)
## [1] 4626   13

Let’s take a look at the variables. Using names() will print the names of the variables. We can also see the first few observations of a variable using the head() function. Writing the dataframe$variable picks out a specific variable to look at in the data frame.

names(job_df)
##  [1] "i_callback"  "n_jobs"      "n_expr"      "i_military"  "i_computer" 
##  [6] "first_name"  "sex"         "i_female"    "i_male"      "race"       
## [11] "i_black"     "i_white"     "i_secretary"
head(job_df$race, 10)
##  [1] "b" "b" "w" "w" "b" "w" "w" "w" "b" "w"

Analysis

What percentage of resumes that received a callback Since callback is a dummy variable, finding the mean gives us the percentage of times that variable equals 1.

mean(job_df$i_callback)
## [1] 0.08084738

Now, let’s calculate the percentage of callbacks by race. We need to filter by race then take the mean for each group.

filter(job_df, race == 'b')$i_callback #this gives the values of callbacks where the race variable is = b
filter(job_df, race == 'w')$i_callback #this gives the values of callbacks where the race variable is = w

To find groups means we can nest the above into the mean() function.

mean_b <- mean(filter(job_df, race == 'b')$i_callback)
mean_b
## [1] 0.06471096
mean_w <- mean(filter(job_df, race == 'w')$i_callback)
mean_w
## [1] 0.09705373
#the difference in means:
mean_b - mean_w
## [1] -0.03234277

Now we need to perform a test to see if the group means are different. Recall from previous courses the formula for a difference in means z-test: \[ Z = (\mu_b - \mu_w)/\sqrt{\mu_{all}(1-\mu_{all})(1/n_b + 1/n_w)} \]

Let’s calculate all these things. We already have the \(\mu_b\) and \(\mu_w\), the group means, from the last step.

mean_all <- mean(job_df$i_callback)

# need the total number in each group. nrow() gives us the number of rows in the data frame
n_b <- filter(job_df, race == 'b') %>% nrow() 
n_w <- filter(job_df, race == 'w') %>% nrow()

# now build the Z-stat
z <- (mean_b - mean_w)/sqrt(mean_all*(1-mean_all)*(1/n_b + 1/n_w))

z
## [1] -4.034802

Our z-stat is -4.03. How do we find the p-value?

2*pnorm(abs(z), lower.tail = F) # multiply by 2 because it is a 2-sided test. take absolute value of z.
## [1] 5.464843e-05

What are the null and alternative hypotheses? What can we conclude?

Next, let’s run some OLS regressions. First, let’s regress i_callback on i_black.

reg1 <- lm(i_callback ~ i_black, data = job_df)
summary(reg1)
## 
## Call:
## lm(formula = i_callback ~ i_black, data = job_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09705 -0.09705 -0.06471 -0.06471  0.93529 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.097054   0.005665  17.131  < 2e-16 ***
## i_black     -0.032343   0.008004  -4.041 5.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2722 on 4624 degrees of freedom
## Multiple R-squared:  0.003519,   Adjusted R-squared:  0.003304 
## F-statistic: 16.33 on 1 and 4624 DF,  p-value: 5.408e-05

Our null hypothesis for this t-test is \(H_0: \beta_1 = 0\) and the alternative is \(H_a: \beta_1 \neq 0\). Our coefficient on i_black is -0.032. Notice that this is the difference in means we calculated earlier. This means that on average, resumes with the implied race as African American received less callbacks than resumes with the implied race as White by about 3 percentage points. The t-stat is -4.041 with a corresponding p-value that is less than .05, so we can reject our null hypothesis at the 5% significance level.

Let’s try a regression with an interaction term. We can use : to make an interaction term in the regression equation.

reg2 <- lm(i_callback ~ i_black + i_military + i_black:i_military, data = job_df)
summary(reg2)
## 
## Call:
## lm(formula = i_callback ~ i_black + i_military + i_black:i_military, 
##     data = job_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09933 -0.09933 -0.06721 -0.06721  0.95745 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         9.933e-02  5.947e-03  16.703  < 2e-16 ***
## i_black            -3.212e-02  8.422e-03  -3.814 0.000138 ***
## i_military         -2.457e-02  1.953e-02  -1.258 0.208538    
## i_black:i_military -9.249e-05  2.706e-02  -0.003 0.997273    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2721 on 4622 degrees of freedom
## Multiple R-squared:  0.004233,   Adjusted R-squared:  0.003587 
## F-statistic:  6.55 on 3 and 4622 DF,  p-value: 0.0002044

That’s everything! Good luck - please come to office hours or email me if you have questions!