General instructions

The .Rmd source for this document will be the template for your homework submission. You must submit your completed assignment as a single html document, uploaded to eLC by 11:59pm on August 27 using the filename sectiontime_lastname_hw1.html (e.g. 935_geddes_hw1.html). I will answer questions by email through Friday, August 25 at 5 pm.

Notes:

Part A: Bayes’ Rule

A rare disease affects approximately 1 in 10,000 people in a specific population. There is a diagnostic test available that is quite accurate, but not perfect. The test has the following characteristics:

  1. Write down the definition of Bayes’ Rule.
  2. If someone tests positive for this disease, what is the probability they actually have the disease?
  3. If someone tests negative for this disease, what is the probability they don’t have the disease?

Answers

  1. P(A∣B) = (P(B∣A)*P(A)) / P(B)

P(A) is the probability of A occurring. P(B) is the probability of B occurring. P(A∣B) is a conditional probability, specifically the likelihood of A occurring given B is known/has occurred. P(B∣A) is a conditional probability, specifically the likelihood of B occurring given that A is known/has occurred.

For example, if A: has rare disease, B: tests positive P(A) is the probability of having the rare disease. P(B) is the probability of testing positive. P(A∣B) is the probability of having the rare disease given you have tested positive. P(B∣A) is of testing positive given that you have the rare disease.

  1. 0.0196

Assumption: A: has rare disease, B: tests postive

P(A∣B) = P(B|A)P(A) / P(B) = (.98)(1/10000) / .005 = .0196

P(B) = (.98.0001) + (.005.9999) = .005

  1. .9999

Assumption: A: doesn’t have rare desease, B: tests negative

P(A∣B) = P(B∣A)P(A) / P(B) = (.995)(1-(1/10000)) / (1-.005) = .9999

Part B: Conditional Expectations

Suppose you are interested in the relationship between sales and your advertising budget. You know that \(E[Sales \mid Advertising] = \alpha + \beta \cdot Advertising\) (there is a linear relationship).

  1. Write down the Law of Iterated Expectations.

  2. If \(E[Advertising] = 100\), what is \(E[Sales]\)?

Answers

  1. E[E[Y∣X]] = E[Y]

States that the expected value of the conditional expectation of Y given X is equal to the expected value of Y (in other words, the expected value of Y remains the same despite any knowledge of X)

  1. α + 100β

E[Sales∣Advertising] = α + β*Advertising = α + 100β

Part C: Sampling Experiments

This problem asks you to conduct a series of sampling experiments and was adapted from the Mastering Metrics online materials. You should modify the below code.

First, draw 500 random samples each with a sample size of 8 from a random number generator for a standard Normal distribution. Then, increase the sample size to 32 then increase the sample size to 128.

  1. Plot histograms of the sampling distributions of (i) the sample mean and (ii) the sample variance, for each of these three sample sizes.

  2. Your experiments produce “samples of sample means.” What would you expect the mean and the variance of the sample means to be based on statistical theory?

  3. Compute the mean and variance of the sample means generated by each experiment and compare them to what you predicted in part (b).

## [1] 0.001030307
## [1] 0.1278419

## [1] -0.01102805
## [1] 0.02861983

## [1] -0.003069978
## [1] 0.006837496

Answers

  1. See corresponding sections of code above

  2. According to statistical theory, when one calculates the mean of sample means, he would expect the outcome to be close to the population mean. In this case, we are sampling from a standard normal distribution (mean=0). Therefore, one should expect the mean of the sample means to be close to 0. When calculating the variance of the sample means, one would expect it to be smaller than the variance of the original distribution. The formula for the variance of the sample means is the population variance divided by the sample size. For a standard Normal distribution, the population variance is 1. Therefore, given that the sample size is in the denominator, the variance of the sample means should decrease as the sample size increases from 8 to 32 to 128 and so on. The sample variances would therefore be 1/8,1/32,1/128.

  3. The mean of sample means are around 0, which is what we expected. They get closer to 0 as the sample size increases (.001 at 8 sample size decreases to -.00008 at 128 sample size). Likewise, the variation decreases as the sample size increases (.128 at 8 sample size decreases to .017 at 128 sample size). All of this was consistent with my expectations.

Part D: Hypothesis Testing

This question explores hypothesis testing. It uses data from Project STAR, a randomized control trial that evaluated the effects of class size on student outcomes. We will return to Project STAR later in the course.

  1. Check out the contents of the data set using the glimpse function. How many observations and variables does it contain? What is the gender of the third child in the sample? What is their kindergarten math test score and to what sort of “STAR” class were they assigned while in kindergarten?

  2. Create a total test score variable (the sum of the reading and math score) for each grade, K-3. Plot the scorek distributions for the small and regular classes using ggplot. Write a sentence that summarizes what they indicate.

  3. Use the st function to construct a table of test-score means and standard deviations for each class type (“small”, “regular” and “regular+aide”) by grade, based on the kindergarten assignment.

  4. Write a sentence that states the difference between small and regular-class average scores for kindergartners. Write a sentence that expresses this difference in terms of the standard deviation of regular-class scores.

  5. Using a t-test, evaluate whether these two means are statistically different from one another. Define what the null and alternative hypotheses that you are testing are. Write a sentence that interprets the p-value from this test.

## Rows: 11,598
## Columns: 47
## $ gender      <fct> female, female, female, male, male, male, male, female, ma…
## $ ethnicity   <fct> afam, cauc, afam, cauc, afam, cauc, afam, cauc, cauc, cauc…
## $ birth       <yearqtr> 1979 Q3, 1980 Q1, 1979 Q4, 1979 Q4, 1980 Q1, 1979 Q3, …
## $ stark       <fct> NA, small, small, NA, regular+aide, NA, NA, NA, NA, NA, re…
## $ star1       <fct> NA, small, small, NA, NA, NA, NA, regular+aide, regular, r…
## $ star2       <fct> NA, small, regular+aide, NA, NA, regular, NA, regular+aide…
## $ star3       <fct> regular, small, regular+aide, small, NA, regular, regular+…
## $ readk       <int> NA, 447, 450, NA, 439, NA, NA, NA, NA, NA, 448, 447, 431, …
## $ read1       <int> NA, 507, 579, NA, NA, NA, NA, 475, NA, 651, 651, 533, 558,…
## $ read2       <int> NA, 568, 588, NA, NA, NA, NA, 573, NA, 596, 614, 608, 608,…
## $ read3       <int> 580, 587, 644, 686, NA, 644, NA, 599, NA, 626, 641, 665, 5…
## $ mathk       <int> NA, 473, 536, NA, 463, NA, NA, NA, NA, NA, 559, 489, 454, …
## $ math1       <int> NA, 538, 592, NA, NA, NA, NA, 512, NA, 532, 584, 545, 553,…
## $ math2       <int> NA, 579, 579, NA, NA, NA, NA, 550, NA, 590, 639, 603, 579,…
## $ math3       <int> 564, 593, 639, 667, NA, 648, NA, 583, NA, 618, 684, 648, 5…
## $ lunchk      <fct> NA, non-free, non-free, NA, free, NA, NA, NA, NA, NA, non-…
## $ lunch1      <fct> NA, free, NA, NA, NA, NA, NA, non-free, non-free, non-free…
## $ lunch2      <fct> NA, non-free, non-free, NA, NA, non-free, NA, non-free, no…
## $ lunch3      <fct> free, free, non-free, non-free, NA, non-free, free, non-fr…
## $ schoolk     <fct> NA, rural, suburban, NA, inner-city, NA, NA, NA, NA, NA, r…
## $ school1     <fct> NA, rural, suburban, NA, NA, NA, NA, rural, rural, rural, …
## $ school2     <fct> NA, rural, suburban, NA, NA, rural, NA, rural, rural, rura…
## $ school3     <fct> suburban, rural, suburban, rural, NA, rural, inner-city, r…
## $ degreek     <fct> NA, bachelor, bachelor, NA, bachelor, NA, NA, NA, NA, NA, …
## $ degree1     <fct> NA, bachelor, master, NA, NA, NA, NA, master, master, bach…
## $ degree2     <fct> NA, bachelor, bachelor, NA, NA, bachelor, NA, master, bach…
## $ degree3     <fct> bachelor, bachelor, bachelor, bachelor, NA, bachelor, bach…
## $ ladderk     <fct> NA, level1, level1, NA, probation, NA, NA, NA, NA, NA, lev…
## $ ladder1     <fct> NA, level1, probation, NA, NA, NA, NA, apprentice, level1,…
## $ ladder2     <fct> NA, apprentice, level1, NA, NA, notladder, NA, level1, lev…
## $ ladder3     <fct> level1, apprentice, level1, level1, NA, level1, notladder,…
## $ experiencek <int> NA, 7, 21, NA, 0, NA, NA, NA, NA, NA, 16, 5, 8, 17, NA, NA…
## $ experience1 <int> NA, 7, 32, NA, NA, NA, NA, 8, 13, 7, 11, 15, 0, 5, NA, 17,…
## $ experience2 <int> NA, 3, 4, NA, NA, 13, NA, 13, 6, 8, 31, 14, 9, NA, 4, 28, …
## $ experience3 <int> 30, 1, 4, 10, NA, 15, 17, 23, 8, 8, 7, 14, 8, NA, 19, 13, …
## $ tethnicityk <fct> NA, cauc, cauc, NA, cauc, NA, NA, NA, NA, NA, cauc, cauc, …
## $ tethnicity1 <fct> NA, cauc, afam, NA, NA, NA, NA, cauc, cauc, cauc, cauc, ca…
## $ tethnicity2 <fct> NA, cauc, afam, NA, NA, cauc, NA, afam, cauc, cauc, cauc, …
## $ tethnicity3 <fct> cauc, cauc, cauc, cauc, NA, cauc, afam, cauc, cauc, cauc, …
## $ systemk     <fct> NA, 30, 11, NA, 11, NA, NA, NA, NA, NA, 35, 41, 4, 11, NA,…
## $ system1     <fct> NA, 30, 11, NA, NA, NA, NA, 4, 40, 21, 35, 41, 4, 11, NA, …
## $ system2     <fct> NA, 30, 11, NA, NA, 6, NA, 4, 40, 21, 35, 41, 4, NA, 17, 2…
## $ system3     <fct> 22, 30, 11, 6, NA, 6, 11, 4, 40, 21, 35, 41, 4, NA, 17, 20…
## $ schoolidk   <fct> NA, 63, 20, NA, 19, NA, NA, NA, NA, NA, 69, 79, 5, 16, NA,…
## $ schoolid1   <fct> NA, 63, 20, NA, NA, NA, NA, 5, 77, 50, 69, 79, 5, 16, NA, …
## $ schoolid2   <fct> NA, 63, 20, NA, NA, 8, NA, 5, 77, 50, 69, 79, 5, NA, 41, 4…
## $ schoolid3   <fct> 54, 63, 20, 8, NA, 8, 31, 5, 77, 50, 69, 79, 5, NA, 41, 48…
## [1] 47
## [1] 11598
## [1] female
## Levels: male female
## [1] 536
## [1] small
## Levels: regular small regular+aide
## Warning: Removed 351 rows containing non-finite values (`stat_density()`).

## Warning: Removed 1299 rows containing non-finite values (`stat_density()`).

## Warning: Removed 1813 rows containing non-finite values (`stat_density()`).

## Warning: Removed 2110 rows containing non-finite values (`stat_density()`).
Test scores by class type
stark
regular
small
regular+aide
Variable N Mean SD N Mean SD N Mean SD
scorek 2005 918 73 1738 932 76 2043 918 71
score1 1456 1057 91 1339 1076 95 1503 1054 91
score2 1201 1179 83 1080 1189 85 1183 1175 83
score3 1047 1247 70 937 1258 73 1021 1247 73
## 
##  Welch Two Sample t-test
## 
## data:  scorek by stark
## t = -5.6635, df = 3616, p-value = 1.598e-08
## alternative hypothesis: true difference in means between group regular and group small is not equal to 0
## 95 percent confidence interval:
##  -18.710595  -9.087394
## sample estimates:
## mean in group regular   mean in group small 
##              918.0429              931.9419

Answers

  1. Contains 47 variables and 11,598 observations. Third student is female, got math score of 536 in kindergarten, and was assigned to small class in kindergarten.

  2. The plots show the distribution of total test scores for students in small and regular classes among the different grade levels. For the various grade levels, it looks like students in small classes tend to have slightly higher scores compared to students in regular classes. Whether or not there is causation in this case, I am not quite sure. Afterall, correlation is not causation.

  3. Please see table above.

  4. With respect to kindergartners, the average score for small classes (932) is greater than that for regular classes (918). With respect to kindergartners, the standard deviation for small classes (76) is greater than that for regular classes (73).

  5. Ho: There is no significant difference between small and regular groups

Ha: There is a difference between the two groups

According to the results (Confidence interval does not include 0 and p-value is ~0), we would reject the null, concluding that the kindergarten scores for small and regular classes are significantly different.