General instructions
The .Rmd source for this document will be the template
for your homework submission. You must submit your completed assignment
as a single html document, uploaded to eLC by 11:59pm on August
27 using the filename
sectiontime_lastname_hw1.html
(e.g. 935_geddes_hw1.html). I will answer questions by
email through Friday, August 25 at 5 pm.
Notes:
setup chunk above indicates the packages required
for this assignment.eval to TRUE in the global
options command to execute code chunks.A rare disease affects approximately 1 in 10,000 people in a specific population. There is a diagnostic test available that is quite accurate, but not perfect. The test has the following characteristics:
Answers
P(A) is the probability of A occurring. P(B) is the probability of B occurring. P(A∣B) is a conditional probability, specifically the likelihood of A occurring given B is known/has occurred. P(B∣A) is a conditional probability, specifically the likelihood of B occurring given that A is known/has occurred.
For example, if A: has rare disease, B: tests positive P(A) is the probability of having the rare disease. P(B) is the probability of testing positive. P(A∣B) is the probability of having the rare disease given you have tested positive. P(B∣A) is of testing positive given that you have the rare disease.
Assumption: A: has rare disease, B: tests postive
P(A∣B) = P(B|A)P(A) / P(B) = (.98)(1/10000) / .005 = .0196
P(B) = (.98.0001) + (.005.9999) = .005
Assumption: A: doesn’t have rare desease, B: tests negative
P(A∣B) = P(B∣A)P(A) / P(B) = (.995)(1-(1/10000)) / (1-.005) = .9999
Suppose you are interested in the relationship between sales and your advertising budget. You know that \(E[Sales \mid Advertising] = \alpha + \beta \cdot Advertising\) (there is a linear relationship).
Write down the Law of Iterated Expectations.
If \(E[Advertising] = 100\), what is \(E[Sales]\)?
Answers
States that the expected value of the conditional expectation of Y given X is equal to the expected value of Y (in other words, the expected value of Y remains the same despite any knowledge of X)
E[Sales∣Advertising] = α + β*Advertising = α + 100β
This problem asks you to conduct a series of sampling experiments and was adapted from the Mastering Metrics online materials. You should modify the below code.
First, draw 500 random samples each with a sample size of 8 from a random number generator for a standard Normal distribution. Then, increase the sample size to 32 then increase the sample size to 128.
Plot histograms of the sampling distributions of (i) the sample mean and (ii) the sample variance, for each of these three sample sizes.
Your experiments produce “samples of sample means.” What would you expect the mean and the variance of the sample means to be based on statistical theory?
Compute the mean and variance of the sample means generated by each experiment and compare them to what you predicted in part (b).
## [1] 0.001030307
## [1] 0.1278419
## [1] -0.01102805
## [1] 0.02861983
## [1] -0.003069978
## [1] 0.006837496
Answers
See corresponding sections of code above
According to statistical theory, when one calculates the mean of sample means, he would expect the outcome to be close to the population mean. In this case, we are sampling from a standard normal distribution (mean=0). Therefore, one should expect the mean of the sample means to be close to 0. When calculating the variance of the sample means, one would expect it to be smaller than the variance of the original distribution. The formula for the variance of the sample means is the population variance divided by the sample size. For a standard Normal distribution, the population variance is 1. Therefore, given that the sample size is in the denominator, the variance of the sample means should decrease as the sample size increases from 8 to 32 to 128 and so on. The sample variances would therefore be 1/8,1/32,1/128.
The mean of sample means are around 0, which is what we expected. They get closer to 0 as the sample size increases (.001 at 8 sample size decreases to -.00008 at 128 sample size). Likewise, the variation decreases as the sample size increases (.128 at 8 sample size decreases to .017 at 128 sample size). All of this was consistent with my expectations.
This question explores hypothesis testing. It uses data from Project STAR, a randomized control trial that evaluated the effects of class size on student outcomes. We will return to Project STAR later in the course.
Check out the contents of the data set using the
glimpse function. How many observations and variables does
it contain? What is the gender of the third child in the sample? What is
their kindergarten math test score and to what sort of “STAR” class were
they assigned while in kindergarten?
Create a total test score variable (the sum of the reading and
math score) for each grade, K-3. Plot the scorek
distributions for the small and regular classes using
ggplot. Write a sentence that summarizes what they
indicate.
Use the st function to construct a table of
test-score means and standard deviations for each class type (“small”,
“regular” and “regular+aide”) by grade, based on the kindergarten
assignment.
Write a sentence that states the difference between small and regular-class average scores for kindergartners. Write a sentence that expresses this difference in terms of the standard deviation of regular-class scores.
Using a t-test, evaluate whether these two means are statistically different from one another. Define what the null and alternative hypotheses that you are testing are. Write a sentence that interprets the p-value from this test.
## Rows: 11,598
## Columns: 47
## $ gender <fct> female, female, female, male, male, male, male, female, ma…
## $ ethnicity <fct> afam, cauc, afam, cauc, afam, cauc, afam, cauc, cauc, cauc…
## $ birth <yearqtr> 1979 Q3, 1980 Q1, 1979 Q4, 1979 Q4, 1980 Q1, 1979 Q3, …
## $ stark <fct> NA, small, small, NA, regular+aide, NA, NA, NA, NA, NA, re…
## $ star1 <fct> NA, small, small, NA, NA, NA, NA, regular+aide, regular, r…
## $ star2 <fct> NA, small, regular+aide, NA, NA, regular, NA, regular+aide…
## $ star3 <fct> regular, small, regular+aide, small, NA, regular, regular+…
## $ readk <int> NA, 447, 450, NA, 439, NA, NA, NA, NA, NA, 448, 447, 431, …
## $ read1 <int> NA, 507, 579, NA, NA, NA, NA, 475, NA, 651, 651, 533, 558,…
## $ read2 <int> NA, 568, 588, NA, NA, NA, NA, 573, NA, 596, 614, 608, 608,…
## $ read3 <int> 580, 587, 644, 686, NA, 644, NA, 599, NA, 626, 641, 665, 5…
## $ mathk <int> NA, 473, 536, NA, 463, NA, NA, NA, NA, NA, 559, 489, 454, …
## $ math1 <int> NA, 538, 592, NA, NA, NA, NA, 512, NA, 532, 584, 545, 553,…
## $ math2 <int> NA, 579, 579, NA, NA, NA, NA, 550, NA, 590, 639, 603, 579,…
## $ math3 <int> 564, 593, 639, 667, NA, 648, NA, 583, NA, 618, 684, 648, 5…
## $ lunchk <fct> NA, non-free, non-free, NA, free, NA, NA, NA, NA, NA, non-…
## $ lunch1 <fct> NA, free, NA, NA, NA, NA, NA, non-free, non-free, non-free…
## $ lunch2 <fct> NA, non-free, non-free, NA, NA, non-free, NA, non-free, no…
## $ lunch3 <fct> free, free, non-free, non-free, NA, non-free, free, non-fr…
## $ schoolk <fct> NA, rural, suburban, NA, inner-city, NA, NA, NA, NA, NA, r…
## $ school1 <fct> NA, rural, suburban, NA, NA, NA, NA, rural, rural, rural, …
## $ school2 <fct> NA, rural, suburban, NA, NA, rural, NA, rural, rural, rura…
## $ school3 <fct> suburban, rural, suburban, rural, NA, rural, inner-city, r…
## $ degreek <fct> NA, bachelor, bachelor, NA, bachelor, NA, NA, NA, NA, NA, …
## $ degree1 <fct> NA, bachelor, master, NA, NA, NA, NA, master, master, bach…
## $ degree2 <fct> NA, bachelor, bachelor, NA, NA, bachelor, NA, master, bach…
## $ degree3 <fct> bachelor, bachelor, bachelor, bachelor, NA, bachelor, bach…
## $ ladderk <fct> NA, level1, level1, NA, probation, NA, NA, NA, NA, NA, lev…
## $ ladder1 <fct> NA, level1, probation, NA, NA, NA, NA, apprentice, level1,…
## $ ladder2 <fct> NA, apprentice, level1, NA, NA, notladder, NA, level1, lev…
## $ ladder3 <fct> level1, apprentice, level1, level1, NA, level1, notladder,…
## $ experiencek <int> NA, 7, 21, NA, 0, NA, NA, NA, NA, NA, 16, 5, 8, 17, NA, NA…
## $ experience1 <int> NA, 7, 32, NA, NA, NA, NA, 8, 13, 7, 11, 15, 0, 5, NA, 17,…
## $ experience2 <int> NA, 3, 4, NA, NA, 13, NA, 13, 6, 8, 31, 14, 9, NA, 4, 28, …
## $ experience3 <int> 30, 1, 4, 10, NA, 15, 17, 23, 8, 8, 7, 14, 8, NA, 19, 13, …
## $ tethnicityk <fct> NA, cauc, cauc, NA, cauc, NA, NA, NA, NA, NA, cauc, cauc, …
## $ tethnicity1 <fct> NA, cauc, afam, NA, NA, NA, NA, cauc, cauc, cauc, cauc, ca…
## $ tethnicity2 <fct> NA, cauc, afam, NA, NA, cauc, NA, afam, cauc, cauc, cauc, …
## $ tethnicity3 <fct> cauc, cauc, cauc, cauc, NA, cauc, afam, cauc, cauc, cauc, …
## $ systemk <fct> NA, 30, 11, NA, 11, NA, NA, NA, NA, NA, 35, 41, 4, 11, NA,…
## $ system1 <fct> NA, 30, 11, NA, NA, NA, NA, 4, 40, 21, 35, 41, 4, 11, NA, …
## $ system2 <fct> NA, 30, 11, NA, NA, 6, NA, 4, 40, 21, 35, 41, 4, NA, 17, 2…
## $ system3 <fct> 22, 30, 11, 6, NA, 6, 11, 4, 40, 21, 35, 41, 4, NA, 17, 20…
## $ schoolidk <fct> NA, 63, 20, NA, 19, NA, NA, NA, NA, NA, 69, 79, 5, 16, NA,…
## $ schoolid1 <fct> NA, 63, 20, NA, NA, NA, NA, 5, 77, 50, 69, 79, 5, 16, NA, …
## $ schoolid2 <fct> NA, 63, 20, NA, NA, 8, NA, 5, 77, 50, 69, 79, 5, NA, 41, 4…
## $ schoolid3 <fct> 54, 63, 20, 8, NA, 8, 31, 5, 77, 50, 69, 79, 5, NA, 41, 48…
## [1] 47
## [1] 11598
## [1] female
## Levels: male female
## [1] 536
## [1] small
## Levels: regular small regular+aide
## Warning: Removed 351 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1299 rows containing non-finite values (`stat_density()`).
## Warning: Removed 1813 rows containing non-finite values (`stat_density()`).
## Warning: Removed 2110 rows containing non-finite values (`stat_density()`).
| Variable | N | Mean | SD | N | Mean | SD | N | Mean | SD |
|---|---|---|---|---|---|---|---|---|---|
| scorek | 2005 | 918 | 73 | 1738 | 932 | 76 | 2043 | 918 | 71 |
| score1 | 1456 | 1057 | 91 | 1339 | 1076 | 95 | 1503 | 1054 | 91 |
| score2 | 1201 | 1179 | 83 | 1080 | 1189 | 85 | 1183 | 1175 | 83 |
| score3 | 1047 | 1247 | 70 | 937 | 1258 | 73 | 1021 | 1247 | 73 |
##
## Welch Two Sample t-test
##
## data: scorek by stark
## t = -5.6635, df = 3616, p-value = 1.598e-08
## alternative hypothesis: true difference in means between group regular and group small is not equal to 0
## 95 percent confidence interval:
## -18.710595 -9.087394
## sample estimates:
## mean in group regular mean in group small
## 918.0429 931.9419
Answers
Contains 47 variables and 11,598 observations. Third student is female, got math score of 536 in kindergarten, and was assigned to small class in kindergarten.
The plots show the distribution of total test scores for students in small and regular classes among the different grade levels. For the various grade levels, it looks like students in small classes tend to have slightly higher scores compared to students in regular classes. Whether or not there is causation in this case, I am not quite sure. Afterall, correlation is not causation.
Please see table above.
With respect to kindergartners, the average score for small classes (932) is greater than that for regular classes (918). With respect to kindergartners, the standard deviation for small classes (76) is greater than that for regular classes (73).
Ho: There is no significant difference between small and regular groups
Ha: There is a difference between the two groups
According to the results (Confidence interval does not include 0 and p-value is ~0), we would reject the null, concluding that the kindergarten scores for small and regular classes are significantly different.