This exam consists of 4 questions and 110 points total. You may either print this exam out, write your answer by hand and submit as a scan/picture, or type your answers, in which case you may use as much space as necessary. If you must answer a question on more than the space provided, make sure the question number and part being answered are clearly designated. To be eligible for partial credit, you must show all of your work. If you don’t understand a question or you think it is ambiguous or that there is a mistake in a question, ask me for clarification.

You have up to 24 hours to submit your answers to me electronically once I email you the exam. This exam is designed to be completed in 1 hour and 30 minutes and you may use our normal meeting time if necessary. Please try to pace yourself by answering questions you are surer of first.

This exam contains 7 total pages. Please make sure that you have all the pages before you begin.

This exam is open-book and open-note, but you may not collude with anyone else once you begin.

Question 1

The website youth.gov describes diversion programs for juveniles involved in the justice system as follows: “Diversion programs are alternatives to initial or continued formal processing of youth in the juvenile delinquency system. The purpose of diversion programs is to redirect youthful offenders from the justice system through programming, supervision, and supports. Diverting youth who have committed minor offenses away from the system and towards community-based treatment and support options is a more appropriate response than confinement, and a more productive way of addressing and preventing future delinquency.” Consider some recidivism data from the California juvenile justice system, which shows the proportion of cases who were rearrested within one year of adjudication (OneYrRearrest) for adjudicated youth, conditional on either no diversion (i.e., probation or formal processing; diverted = 0) and diversion (diverted = 1).

Table 1


a)

(5 points) Calculate the probability of diversion.

The probability of diversion is given by the number of individuals in diversion program over the sample size, which is given by the following:

# The number of individuals in diversion program:
diverted <-  11167

# The overall sample size:
sample <- 11167 + 6517

# Calculating the probability of diversion:
pro <- diverted / sample
pro
## [1] 0.6314748

Therefore, the probability of diversion is 0.6314748.


b)

(10 points) Suppose we randomly selected five individuals from the sample to do in-depth qualitative interviews about their experiences with justice system. What probability we get all five diverted cases from this draw?

In this case, we assume that each of our draws is independent from each other, allowing us to apply the specific multiplication rule:

\[ P(fivediverted) = P(diverted)^5 \]

pro^5
## [1] 0.1004107

From the above calculation, we got the answer is: 0.1004107, meaning that the probability we get all five diverted cases is about 0.1.


c)

(15 points) Calculate the probability that an individual was diverted given they were rearrested within one year.

Here we are asking about: P(diverted|rearrest)

We know:

p(rearrest|diverted) = .201397
P(rearrest|non_diverted) = .3793156
p(diverted) = .6314748
p(non_diverted) = 1 - p(diverted) = .3685252

p4 <- 1 - pro
p4
## [1] 0.3685252

Applying Chain Rule, we get:

P(rearrest & diverted) = p(rearrest|diverted) * p(diverted) = .1271771
P(rearrest & non_diverted) = p(rearrest|non_diverted) * p(non_diverted) = .1397874

p5 <- .201397 * pro 
p5
## [1] 0.1271771
p6 <- .3793156 * (1-pro)
p6
## [1] 0.1397874

Since: P(diverted) + p(not_diverted) = 1

We have:

P(rearrest) = P(rearrest & diverted) + P(rearrest & not_diverted) = .2669645

p7 <- p5 + p6
p7
## [1] 0.2669645

For this question, P(rearrest) can be easily calculated by:

\[ \frac{11167 * 0.201397 + 6517 * 0.3793156} {6517 + 11167} \]

However, the chain rule is a more general way of solving the problem

Finally, applying the Bayes’ theorem:

\[ p(diverted|rearrest) = \frac{p(rearrest|diverted) * p(diverted)} {p(rearrest)} \]

p8 <- .201397 * pro / p7
p8
## [1] 0.4763822

Therefore, the probability that an individual was diverted given they were rearrested within one year is: 0.4763822.


d)

(10 points) Calculate the probability of rearrest in the sample.

We know:

p(rearrest|diverted) = .201397
P(rearrest|non_diverted) = .3793156

Since: P(diverted) + p(not_diverted) = 1

We have:

P(rearrest) = P(rearrest & diverted) + P(rearrest & not_diverted) = 0.2669645

Therefore, the sample probability of rearrest is: 0.2669645


e)

(15 points) Test the hypothesis that one-year recidivism is lower for diverted youth. Be sure to clearly state the null and alternative hypotheses you are testing. Use alpha=.05.

Null Hypothesis: One-year recidivism is the same for diverted and non-diverted youth.
Alternative Hypothesis: : One-year recidivism is lower for diverted youth.

Now we know that:

# Sample Statistics for diverted youths
n_diverted <- 11167
mean_diverted <- .201397
sd_diverted <- .4010619

# Sample Statistics for non-diverted youths
n_nondiverted <- 6517
mean_nondiverted <- .3793156
sd_nondiverted <- .485254

We know that t-score is calculated by:

\[ t = \frac{(\hat{p_1} - \hat{p_2})} {{\sqrt{\hat{p}*(1-\hat{p}) * (\frac{1}{n_1}+\frac{1}{n_2}) }}} \] where:

\[ \hat{p} = \frac{\hat{p_1}*n_1+\hat{p_2}*n_2}{n_1+n_2} \]

Therefore:

x_diff <- mean_nondiverted - mean_diverted
x_diff
## [1] 0.1779186
p_hat <- (mean_nondiverted*n_nondiverted + mean_diverted*n_diverted)  / (n_diverted + n_nondiverted) 
p_hat
## [1] 0.2669645
denominator <- sqrt(p_hat * (1-p_hat) * ((1/n_diverted) + (1/n_nondiverted)))
denominator
## [1] 0.006895843
t <- x_diff / denominator
t
## [1] 25.80085

Now that we know our t score is: 25.80085, we can let R help us find our p-score:

p_value = pt(q=25.80085, df=6517+11167-2, lower.tail = FALSE)
p_value
## [1] 1.994405e-144

Our p_value is infinitely close to 0, much smaller than 0.05, meaning that we can be 95% confident that one-year recidivism is lower for diverted youths.


f)

(10 points) Based on the results of the above test, can you argue that diversion caused a reduction in re-arrest among these youth as the website asserts? Explain your reasoning.

No, we can’t. Two sample t-test only tells us the difference between two means. In our case, we can only say there is a significant difference in re-arrest rate between diverted and non-diverted individuals. We have no evidence to infer any causal relationship.


g)

(10 points) Define a Type I error in the context of this problem. What are the potential consequences of making a Type I error? Be specific.

Type I error for this problem is: there is no difference in re-arrest rate between diverted and non-diverted individuals. However, we mistakenly believe there is difference.

The consequence will be we adopted the diversion program, wasting all our money on a policy that has no effect on reducing re-arrest rate.


h)

(10 points) Define a Type II error in the context of this problem. What are the potential consequences of making a Type II error? Be specific.

Type II error for this problem is: there is difference in re-arrest rate between diverted and non-diverted individuals. However, we mistakenly believe there is no difference.

The consequence is we waste a chance to adopt a policy that could effectively reduce re-arrest rate, by which saves large amount of money invested in the prison and court system.



Question 2

Recall the NLSY data on intensive employment during high school we have used on several prior occasions. Another argument proponents of a ‘limit’ law cite is that intensive working may increase the rate of school absence among those who work too much, either by making them too tired to come to school, or deemphasizing the importance of school attendance all together. Suppose that on average, a typical student misses 3 days of school per year. Consider some output for the variable nabsnt, which is a count of the number of school days missed due to absence for those youth employed at least one week during the school year at age 16 and who worked more than 20 hours per week on average (i.e., int1wksc = 1):

Table 2


a)

(10 points) Is there sufficient evidence to support that intensive work actually is associated with increased absences? Clearly state your null and alternative hypotheses as well as your conclusion. Test for alpha=.02.

Null Hypothesis: Youths employed at least one week during the school year at age 16 and who worked more than 20 hours per week on average do NOT have more absent days than typical students.

Alternative Hypothesis: Youths employed at least one week during the school year at age 16 and who worked more than 20 hours per week on average have more absent days than typical students

To test our hypothesis, we employ the one-sample t-test:

We know:

n = 248
mean = 4.076475
sd = 4.893529

We also know the formula:

\[ t = \frac {\bar x - \mu} {\frac{s}{\sqrt{n}} } \]

Therefore: t = 3.464237

t <- (mean - 3) / (sd / sqrt(n))
t
## [1] 3.464237

The corresponding p value is: 0.0003133577, much smaller than .02.

p_value = pt(q=3.464237, df=247, lower.tail = FALSE)
p_value
## [1] 0.0003133577

Therefore, we have sufficient evidence to say that: intensive work actually is associated with increased absences, given our set confidence level.


b)

(5 points) Define a p-value for the above test.

The p-value is the smallest level of significance for which the observed sample statistics that tell us to reject our null hypothesis.

P-value in our case can be described as: if there is no difference between absence for youths who work or who do not, we would obtain the observed difference or more in less than 0.03% of studies due to random error.



Question 3

A recent story published by the Associated Press (AP) highlighted the use of child labor (i.e., workers under the age of 18) in the agriculture industry. The author implicated a recent Department of Agriculture survey of a random sample of 2,500 agricultural workers drawn from the population of 1.2 million agricultural workers. The survey found that 236 of the sampled workers were under age 18.


a)

(10 points) Compute a 95% confidence interval for the proportion of agricultural workers who are under age 18. Interpret this interval.

We know that the confidence interval for proportions is given by:

\[ CI = \hat p \pm Z \cdot \sqrt{\frac{(\hat p \cdot (1 – \hat p))} {n}} \] Given:

p_hat <- 236/2500
z <- 1.96 # given 95% confidence interval
n <- 2500

With some algebra, we have:

upper_bound <- p_hat + sqrt(p_hat * (1-p_hat) / n)
upper_bound
## [1] 0.1002477
lower_bound <- p_hat - sqrt(p_hat * (1-p_hat) / n)
lower_bound
## [1] 0.08855231

Therefore, the 95% CI is: (0.08855231, 0.1002477). We can be 95% confident that the proportion of agricultural workers who are under age 18 is between around 8.9% and 10%.


b)

(5 points) Besides the proportion, we would also like an estimate of the number of workers under age 18. Based on the above calculations provide a 95% confidence interval for this number.

Based on the proportion we calculated previously, the estimate of the number of workers under age 18 is given by:

number <- 1200000 *  p_hat
number
## [1] 113280

So our estimate based on the sample proportion is: 113280.

The 95% CI is given by:

bottom <- 1200000 * lower_bound
cap <- 1200000 * upper_bound 
bottom
## [1] 106262.8
cap
## [1] 120297.2

So the 95% confidence interval for this number is: (106262.8, 120297.2).