Introduction: Reminders about R and Rmarkdown

Please make sure you have downloaded this file (pset2.rmd) to your computer and opened it in R Studio. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer, and then opened it with R Studio. You should now be looking at the “raw” text of the .rmd file.

If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd.

Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches. While we were generous in grading the first pset for people whose rmd files would not knit, we will not be so generous this time.

Question 1

We first want to install the package “Rcurl” on your system. We just want to do this once from the console rather than in your .rmd file, because it downloads files and adds them to your system. So go to your console and type install.packages("RCurl").

Now we want to load this package and use it by adding a code chunk to your rmd with code: library("RCurl"). You need to add this code in a proper code chunk, but it will not work right until you have installed the RCurl library from the console, as instructed above!

Question 1.1. Explain the difference between installing a package (using install.packages) and loading it (using library).

The command (install.packages) is installing the data set into Rstudio, where as the command(library) actually makes the data set available in rmarkdown.

Next we are going to load some data via the internet. I will give you the code: Installing the package makes the data set avaiable to you and library makes the the data set avaiablie to you.

library("RCurl")

## Loading required package: bitops

myurl=getURL("https://cdn.rawgit.com/kosukeimai/qss/master/CAUSALITY/resume.csv")
dat=read.csv(text=myurl)

Question 1.2. What variable name did the code above give to the data you loaded?
Vairbale name: dat

Question 1.3. Explore the data you just loaded. Specifically, (a) use the head function to get a look at the data. (b) Use the names or ls function to list the different variables; (c) use the dim function to determine how many observations there are.

head(dat)

##   firstname    sex  race call
## 1   Allison female white    0
## 2   Kristen female white    0
## 3   Lakisha female black    0
## 4   Latonya female black    0
## 5    Carrie female white    0
## 6       Jay   male white    0

names(dat)

## [1] "firstname" "sex"       "race"      "call"

dim(dat)

## [1] 4870    4

Question 2: More with this dataset

For this section, you may want to read and work through Chapter 2 of the textbook. Everything you need to know is there!

We will work with data from a famous experiment by Marianne Bertrand and Sendhil Mullainathan (2004)¹. The textbook describes the paper as follows:

Does racial discrimination exist in the labor market? Or, should racial disparities in the unemployment rate be attributed to other factors such as racial gaps in educational attain- ment? To answer this question, two social scientists conducted the following experiment. In response to newspaper ads, the researchers sent out resumes of fictitious job candidates to potential employers. They varied only the names of the job applicants while leaving the other information of the resumes unchanged. For some resumes, stereotypically black-sounding names such as Lakisha Washington or Jamal Jones were used, whereas other resumes contained stereotypically white-sounding names such as Emily Walsh or Greg Baker. The researchers then compared the call back rates between these two groups and found that the resumes with typical African American names received fewer callbacks than those with typical white names. The positions, to which the applications were sent, were in the sales, administrative support, clerical, and customer services job categories.

Look at Table 2.1 in the textbook to make sure you understand what each variable means.

Question 2.1 Construct a two-way table that categorizes the data by whether a resume had a black-sounding or white-sounding name, and by whether it got a call back or not from the employer. (Hint: look at Chapter 2, and use the table() command.)

table(dat$race, dat$call)

##        
##            0    1
##   black 2278  157
##   white 2200  235

race.call.tab<-table(dat$race, dat$call)

Question 2.2 (a) What is the overall proportion of resumes that received a call back from the employer?

r sum(race.call.tab[, 2]) / nrow(dat)

## [1] 0.08049281 #p is sum(race.call.tab[, 2]) / nrow(dat)

What is the proportion among resumes with black-sounding names?

(race.call.tab[1, 2]) / sum(race.call.tab[1, ])# Black Proportion

## [1] 0.06447639

What is the proportion among resumes with white-sounding names? There are many ways to determine this in R – you may use any method you like (but don’t compute it outside of R!).

(race.call.tab[2, 2]) / sum(race.call.tab[2, ]) #White Proportion

## [1] 0.09650924

Question 2.3 Conduct a z-test for the difference in sample means to determine whether these two proportions are really different or not. Use the four step process outlined during lecture. Be sure to show your work, including how you get the z-test, how you determine whether the difference is significant or not, and compute an actual p-value.

black.prop<-(race.call.tab[1, 2]) / sum(race.call.tab[1, ])
white.prop<-(race.call.tab[2, 2]) / sum(race.call.tab[2, ])
bw.var<-((black.prop*(1-black.prop)*2)/2435)+((white.prop*(1-white.prop)*2)/2435)

pnorm(2.90134)

## [1] 0.9981421

1-pnorm(2.90134)

## [1] 0.001857852

2*(1-pnorm(2.90134))

## [1] 0.003715705

Becuase the z statistic is outside the crictal value we reject the null hypothesis H0 that the proportions in black people who received a call back is equal to the proportion of white people who recieved a call back. Therefore there is a difference between the two crops

Question 2.3
Suppose for a moment that there was no real difference between the two groups in their probability of getting a call. Under that assumption, what is the expected distribution of the z-statistic you computed? That is, if you could repeat the experiment over and over again, how would you expect the z-statistics you compute to vary? In your answer, specify (a) the shape of the distribution; (b) the center of the distribution; and (c) the variance and standard error of the distribution. (Be sure you answer these questions with respect to the z-statistic, not with respect to the difference-in-proportions itself.)

tHE EXPECTED DISTRIBUTION OF THE Z statistic would be zero because there would be no differnce in the between the two groups. You need the probabliilty mass function to show they shape of the distribution.

(1*.5)+(2*.5)

## [1] 1.5

The variance of this is

(((1-1.5)^2)*.5)+(((2-1.5)^2)*.5)

## [1] 0.25

the SD(X) is

sqrt(.25)

## [1] 0.5

Question 2.4 A friend asks you to explain the z-statistic. Explain (a) what the numerator of the z-statistic corresponds to, and (b) what the denominator corresponds to. Also, (c) do the best you can to explain why you use the denominator that you do. That is, what effect does dividing by that denominator have on the resulting statistic?

Numerator: the difference in probabilites Denomenaors: standard erro of the difference in probabliites

dividing by the stadnaerd errror crates the numeric representative of how many standard deviations from zero the quanity in that numerator

Question 2.5 Construct the 95% confidence interval for (a) the proportion of resume with black-sounding names that received a call;

(157/2435)+(1.96*(sqrt((157/2435)*(1-(157/2435))/2435)))

## [1] 0.07423154

(157/2435)-(1.96*(sqrt((157/2435)*(1-(157/2435))/2435)))

## [1] 0.05472123

the proportion of resumes with white-sounding names the received a call;

(235/2435)+(1.96*(sqrt((235/2435)*(1-(235/2435))/2435)))

## [1] 0.108238

(235/2435)-(1.96*(sqrt((235/2435)*(1-(235/2435))/2435)))

## [1] 0.08478046

and (c) the difference-in-means. Be sure to use the appropriate standard error for each of these.

bw.var<-6.058086e-05
((157-235)/2435)-(1.96*(sqrt(bw.var)))

## [1] -0.04728826

((157-235)/2435)+(1.96*(sqrt(bw.var)))

## [1] -0.01677745

the confidence interval for the difference in means is between 1.6% and 4.6% ***

Question 3

Recall that before we learned to test hypotheses using the z-statistic, we learned how we could do a two-sample test for a difference in means using randomization or “permutation” inference.

To review, begin with the difference-in-proportions: the proportion of resumes with black-sounding names that got a call, minus the proportion of resumes with white-sounding names that got a call. We know from above what this is for the actual data. However, if we were to “scramble” up the “race” variable so that resumes were randomly relabeled, then take a new difference-in-proportions using this incorrect version of the race variable, we would get a different outcome. It would be near \(0\), since we know that there can’t be any real relationship between “race” and “call” now that we messed up the “race” variable. But it would not be exactly \(0\), and this illustsrates how large a difference from \(0\) we might get just by chance even when we know there is no real difference.

If you repeat this process thousands of times, re-scrambling the order of the “race” variable each time, you could see a distribution of possible difference in proportions that you would get under the null hypothesis of no real difference.

Borrowing code from the lecture slides or the textbook as you see fit, use this approach to compute a p-value for this example. Use at least 5000 iterations to get the distribution of difference-in-proportion estimates under the null.

Hint: you can “scramble” the order of the “race” variable with code such as:

race_scrambled=sample(x=dat$race, size=nrow(dat), replace=FALSE)

getfakeDIM=function(call, race) {
  race_scrambled=sample(x=dat$race, size=nrow(dat), replace=FALSE)
  fakeDIM=(mean(call[race_scrambled=="black"]))-(mean(call[race_scrambled=="white"]))
  return(fakeDIM)
}

iters=5000
fakeDIMs=replicate(n=iters, getfakeDIM(dat$call, race_scrambled))

hist(fakeDIMs, breaks=50)

The Histrogram represente the (By the way, note that the code above looks like a code chunk when you knit it, but it is not a code chunk because it does not have the “{r}” bit, and so no code is actually run.)

Question 4 ²

According to the Center for Disease Control (CDC), 49.4% of American adults 18 years old and over meet their Physical Activity Guidelines for aerobic physical activity (see CDC). The owners of a gym want to know if the proportion of their members who meet these requirements is different from the national average. They took a random sample of 40 members, and 28 met the CDC’s guidelines. Use the four-step hypothesis testing approach described in class to answer their question, using the z-statistic. Be sure you construct your null hypothesis correctly for this problem. Also be sure to document each of those four step clearly for credit.

Step 1:Check any necessary assumptions and write null and alternative hypotheses. np: 40.494 = 19.76 >10 nq: 40.506 = 20.24 >10 H0: p= .494 H1: p is not .494

Step 2 Calculate the appropiate test statistic.

 ((28/40-.494)/sqrt((.494*(1-.494))/40))

## [1] 2.605904

Step 3: Check against the critcal Values, deterine the p-value.

The z statistic is 2.605, whcih is greater than the critical values fo both 1.96 and 1.64, therefeore it is not significant at p=0.05 nor p=0.10.

2*(1-pnorm(2.605904))

## [1] 0.009163214

Step 4 Decide whether you can accept the null or not, and state a meaningful conclusion.

Therefore we reject the null hypothesis that 49.9% of people meet the CDC’s guideline for aerobic physical activity, because the p-value of .009 is less than .05.

“Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination” American Economic Review, Vol. 94, No. 4, pp. 991–1013.↩
Adapted from https://onlinecourses.science.psu.edu/↩

Problem Set 2 (Due 3 Feb 2016)

Nidirah Stephens, PS6"

February 3, 2016

Introduction: Reminders about R and Rmarkdown

Question 1

Question 2: More with this dataset

Question 3

Question 4 ²

Problem Set 2 (Due 3 Feb 2016)

Nidirah Stephens, PS6"

February 3, 2016

Introduction: Reminders about R and Rmarkdown

Question 1

Question 2: More with this dataset

Question 3

Question 4 2

Question 4 ²