The following are the R commands and answers for the questions on the in class handout.
Load the data:
drp<-read.file("/home/emesekennedy/Data/Ch7/drp.txt")
## Reading data with read.table()
Load the aplpack package and create a back-to-back stemplot:
require(aplpack)
## Loading required package: aplpack
## Loading required package: tcltk
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
stem.leaf.backback(subset(drp, group=="Treat")$drp, subset(drp, group=="Control")$drp, style="bare", depths=F)
## _______________________
## 1 | 2: represents 12, leaf unit: 1
## subset(drp, group == "Treat")$drp
## subset(drp, group == "Control")$drp
## _______________________
## | 1 |0
## | 1 |79
## 4| 2 |0
## | 2 |68
## 3| 3 |3
## | 3 |77
## 4333| 4 |12223
## 996| 4 |68
## 432| 5 |34
## 98776| 5 |55
## 21| 6 |02
## 7| 6 |
## 1| 7 |
## | 7 |
## | 8 |
## _______________________
## HI: 85
## n: 21 23
## _______________________
The stemplot indicates that there might be an outlier in the treatment group.
Create histograms, boxplots, and Normal quantile plots for the two samples.
histogram(~drp|group, data=drp)
bwplot(~drp|group, data=drp)
xqqmath(~drp|group, data=drp)
Neither of the data sets is strongly skewed, and there are no extreme outliers. The Normal quantile plot indicates that the distribution of both samples is close to Normal. Thus, it is appropriate to use the t procedures for this data.
Find the mean and standard deviation for the two samples:
mean(~drp|group, data=drp)
## Control Treat
## 41.52174 51.47619
sd(~drp|group, data=drp)
## Control Treat
## 17.14873 11.00736
\(H_0: \mu_{\text{Treat}}=\mu_{\text{Control}}\) (or \(\mu_{\text{Treat}}-\mu_{\text{Control}}=0\))
\(H_a: \mu_{\text{Treat}}>\mu_{\text{Control}}\) (or \(\mu_{\text{Treat}}-\mu_{\text{Control}}>0\), or \(\mu_{\text{Control}}-\mu_{\text{Treat}}<0\))
Use the t.test command to conduct the test. Note that R Studio uses the difference in the population means (\(\mu_{\text{Group 1}}-\mu_{\text{Group 2}}\)) on the left-hand side of the alternative, where Group 1 is the sample whose category name comes first in the alphabet or the sample whose group number is lower. This means that depending on which variable you use to group your samples by, your alternative might be different.
If we group our samples by the variable called group, which has value “Treat” or “Control”, then R Studio will use the alternative hypothesis \(\mu_{\text{Control}}-\mu_{\text{Treat}}<0\). This means that the value of the alternative option should be “less”:
t.test(~drp|group, data=drp, alternative="less")
##
## Welch Two Sample t-test
##
## data: drp by group
## t = -2.3109, df = 37.855, p-value = 0.01319
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -2.691293
## sample estimates:
## mean in group Control mean in group Treat
## 41.52174 51.47619
If we group our samples by the variable called g, which has value 0 (for the treatment group) or 1 (for the control group), then R Studio will use the alternative hypothesis \(\mu_{\text{Treat}}-\mu_{\text{Control}}>0\). This means that the value of the alternative option should be “greater”:
t.test(~drp|g, data=drp, alternative="greater")
##
## Welch Two Sample t-test
##
## data: drp by g
## t = 2.3109, df = 37.855, p-value = 0.01319
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 2.691293 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 51.47619 41.52174
Depending on which alternative hypothesis you used, the value of the test statistic \(t\) is either \(2.3109\) or \(-2.3109\), and it has \(37.855\) degrees of freedom. The P-value in either case is \(.01319\).
To find the 95% confidence interval for the difference in means, we have to re-run the test with the default two-sided alternative:
t.test(~drp|group, data=drp)
##
## Welch Two Sample t-test
##
## data: drp by group
## t = -2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.67588 -1.23302
## sample estimates:
## mean in group Control mean in group Treat
## 41.52174 51.47619
If we define the difference as \(\mu_{\text{Control}}-\mu_{\text{Treat}}\), then the 95% confidence interval is \((-18.67, -1.23)\)
t.test(~drp|g, data=drp)
##
## Welch Two Sample t-test
##
## data: drp by g
## t = 2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.23302 18.67588
## sample estimates:
## mean in group 0 mean in group 1
## 51.47619 41.52174
If we define the difference as \(\mu_{\text{Treat}}-\mu_{\text{Control}}\), then the 95% confidence interval is \((1.23, 18.67)\).
We can also find the 99% confidence interval using the optional command conf.level=.99:
t.test(~drp|group, data=drp, conf.level=.99)
##
## Welch Two Sample t-test
##
## data: drp by group
## t = -2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## -21.637175 1.728272
## sample estimates:
## mean in group Control mean in group Treat
## 41.52174 51.47619
If we define the difference as \(\mu_{\text{Control}}-\mu_{\text{Treat}}\), then the 99% confidence interval is \((-21.64, 1.73)\)
t.test(~drp|g, data=drp, conf.level=.99)
##
## Welch Two Sample t-test
##
## data: drp by g
## t = 2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## -1.728272 21.637175
## sample estimates:
## mean in group 0 mean in group 1
## 51.47619 41.52174
If we define the difference as \(\mu_{\text{Treat}}-\mu_{\text{Control}}\), then the 99% confidence interval is \((-1.73, 21.64)\).
At the 5% significance level, the data provides evidence that directed reading activities improve the DRP score of students.
speaker<-read.file("/home/emesekennedy/Data/Ch7/speakerclarity.txt")
## Reading data with read.table()
Create a two-way table. Depending on what order we enter the two variables, we can make the table horizontal or vertical:
tally(~Gender|Rating, data=speaker)
## Rating
## Gender 1 2 3 4 5
## F 5 44 48 183 188
## M 5 29 30 91 66
tally(~Rating|Gender, data=speaker)
## Gender
## Rating F M
## 1 5 5
## 2 44 29
## 3 48 30
## 4 183 91
## 5 188 66
We can use the margins=T option to display the column sums:
tally(~Gender|Rating, data=speaker, margins=T)
## Rating
## Gender 1 2 3 4 5
## F 5 44 48 183 188
## M 5 29 30 91 66
## Total 10 73 78 274 254
If we would also like to display the row sums, then we have to use & instead of | between the two variables:
tally(~Gender&Rating, data=speaker, margins=T)
## Rating
## Gender 1 2 3 4 5 Total
## F 5 44 48 183 188 468
## M 5 29 30 91 66 221
## Total 10 73 78 274 254 689
Find the sample means and standard deviations of the ratings for the male and female participants separately:
mean(~Rating|Gender, data=speaker)
## F M
## 4.079060 3.832579
sd(~Rating|Gender, data=speaker)
## F M
## 0.9860643 1.0677194
histogram(~Rating|Gender, data=speaker)
Both samples are skewed to the left, there is a large number of observations, and there are no outliers because people are asked to give one of five possible ratings (an outlier would be if someone gave a rating that is not one of the five possible ratings), so it is appropriate to use the t procedures for these data.
\(H_0: \mu_{\text{female}}=\mu_{\text{male}}\)
\(H_a: \mu_{\text{female}}\ne\mu_{\text{male}}\)
t.test(~Rating|Gender, data=speaker)
##
## Welch Two Sample t-test
##
## data: Rating by Gender
## t = 2.8975, df = 402.17, p-value = 0.003967
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.07925257 0.41370872
## sample estimates:
## mean in group F mean in group M
## 4.079060 3.832579
\(t=2.8975\) with \(402.17\) degrees of freedom
\(P=.003967<.05\)
The data provides strong evidence that there is a difference in average rating between females and males.
The 95% confidence interval for the difference in average satisfaction \(\mu_{\text{female}}-\mu_{\text{male}}\) is \((.0792, .4137)\).
If we use the assumption that the variances of ratings for females and males is equal, then we can conduct a Pooled two-sample t test. To do this, we have to add the option var.equal=T:
t.test(~Rating|Gender, data=speaker, var.equal=T)
##
## Two Sample t-test
##
## data: Rating by Gender
## t = 2.9814, df = 687, p-value = 0.002971
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.08415627 0.40880501
## sample estimates:
## mean in group F mean in group M
## 4.079060 3.832579
\(t=2.9814\) with \(687\) degrees of freedom
\(P=.002971<.05\)
Even if we assume that the variance of ratings for females and males is the same, the data provides strong evidence that there is a difference in average rating between females and males.
The 95% confidence interval for the difference in average satisfaction \(\mu_{\text{female}}-\mu_{\text{male}}\) is \((.0842, .4088)\).
We can tell the owner that there is strong evidence to suggest that there is a difference in average ratings, but we cannot be sure that the difference is greater than \(.25\) since the 95% confidence interval includes many values below \(.25\).