For the Chapter 1 R code we went into considerable detail on using R, such as how to install packages, load packages, and use various functions. Note that the AMCP first needs to be installed (see the Chapter 1 page). Further note that the data sets are organized using the chapter number and table number, each separated with an underscore (e.g., chapter_3_table_1; chapter_11_table_20). For a list of data sets use data() with package="AMCP" specified as data(package="AMCP").
We begin with Table 3, which is the data upon which Figure 3.1 is based.
library(AMCP)
data(chapter_3_table_3)
chapter_3_table_3
## Condition Rating
## 1 1 6
## 2 1 5
## 3 1 4
## 4 1 7
## 5 1 7
## 6 1 5
## 7 1 5
## 8 1 7
## 9 1 7
## 10 1 7
## 11 2 5
## 12 2 4
## 13 2 4
## 14 2 3
## 15 2 4
## 16 2 3
## 17 2 4
## 18 2 4
## 19 2 4
## 20 2 5
## 21 3 3
## 22 3 3
## 23 3 4
## 24 3 4
## 25 3 4
## 26 3 3
## 27 3 1
## 28 3 2
## 29 3 2
## 30 3 4
To return the data (i.e., see what it looks like) you can simply call upon it.
chapter_3_table_3
## Condition Rating
## 1 1 6
## 2 1 5
## 3 1 4
## 4 1 7
## 5 1 7
## 6 1 5
## 7 1 5
## 8 1 7
## 9 1 7
## 10 1 7
## 11 2 5
## 12 2 4
## 13 2 4
## 14 2 3
## 15 2 4
## 16 2 3
## 17 2 4
## 18 2 4
## 19 2 4
## 20 2 5
## 21 3 3
## 22 3 3
## 23 3 4
## 24 3 4
## 25 3 4
## 26 3 3
## 27 3 1
## 28 3 2
## 29 3 2
## 30 3 4
To learn the structure of the data, the str() function can be used. Importantly, here, we see that Condition is treated as an integer (a number).
str(chapter_3_table_3)
## 'data.frame': 30 obs. of 2 variables:
## $ Condition: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Rating : int 6 5 4 7 7 5 5 7 7 7 ...
The above can also be seen by using the class() function, which shows that Conditionition as a numeric variable.
class(chapter_3_table_3$Condition)
## [1] "integer"
We can convert Condition into a factor, so that it is explicitly recognized as a grouping variable, using the as.factor() function and redefining the Condition variable.
chapter_3_table_3$Condition <- as.factor(chapter_3_table_3$Condition)
class(chapter_3_table_3$Condition)
## [1] "factor"
A simple (box and whiskers) plot can be performed using the plot() function (now that Conditionition is treated as a factor).
plot(chapter_3_table_3)
More descriptive labels and/or a title can be added
plot(chapter_3_table_3, ylab="Global Affect Rating", xlab="Group",
main="Box and Whiskers Plot of Affect Rating by Group")
Equivalently, the boxplot() function can be used with the model specified. The “model” is simply DV~IV (here with color added to the box)
boxplot(Rating~Condition, data=chapter_3_table_3, ylab="Global Affect Rating", xlab="Group",
main="Box and Whiskers Plot of Affect Rating by Group", col="bisque")
Now with names for the grouping variable
boxplot(Rating~Condition, data=chapter_3_table_3, ylab="Global Affect Rating", xlab="Group",
main="Box and Whiskers Plot of Affect Rating by Group", col="bisque",
names=c("Pleasant", "Neutral", "Unpleasant"))
Now with points.
boxplot(Rating~Condition, data=chapter_3_table_3, ylab="Global Affect Rating", xlab="Group",
main="Box and Whiskers Plot of Affect Rating by Group", col="bisque",
names=c("Pleasant", "Neutral", "Unpleasant"))
points(Rating~Condition, data=chapter_3_table_3)
Because there are multiple points at the same value, they cannot be seen. “Jittering” the points (literally, changing their value slightly) can be useful to more effectively understand the distribution of values within each group. Even still, there can be some uncertainty if all of the scores are represented (some still may still lay essentially atop one another). This figure is functionally R’s approach to the figure shown in Figure 3.1.
boxplot(Rating~Condition, data=chapter_3_table_3, ylab="Global Affect Rating", xlab="Group",
main="Box and Whiskers Plot of Affect Rating by Group", col="bisque",
names=c("Pleasant", "Neutral", "Unpleasant"))
points(jitter(Rating)~Condition, data=chapter_3_table_3)
Another way we could plot the data is to jitter on the grouping variable by treating it as numeric.
boxplot(Rating~as.numeric(Condition), data=chapter_3_table_3, ylab="Global Affect Rating",
xlab="Group", main="Box and Whiskers Plot of Affect Rating by Group", col="bisque",
names=c("Pleasant", "Neutral", "Unpleasant"))
points(Rating~jitter(as.numeric(Condition)), data=chapter_3_table_3)
Because the default level of jittering seemed too much, we reduce it here (by a factor of .5*default).
boxplot(Rating~as.numeric(Condition), data=chapter_3_table_3, ylab="Global Affect Rating",
xlab="Group", main="Box and Whiskers Plot of Affect Rating by Group", col="bisque",
names=c("Pleasant", "Neutral", "Unpleasant"))
points(Rating~jitter(as.numeric(Condition), .5), data=chapter_3_table_3)
Numeric summaries can be obtained with the summary() function. One way to do this is to select the rows of interest (i.e., the groups) by using logical statements. Later we explore other ways of doing this. Note that, however, the percentiles (or quantiles) that are given are from but one way of calculating percentiles (or quantiles). See ?quantile for more information on how R makes such computations (there are, in fact, nine different algorithms that R can use, with the appropriate options selected).
summary(chapter_3_table_3[chapter_3_table_3[,"Condition"]==1, "Rating"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 5.0 6.5 6.0 7.0 7.0
summary(chapter_3_table_3[chapter_3_table_3[,"Condition"]==2, "Rating"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 4 4 4 4 5
summary(chapter_3_table_3[chapter_3_table_3[,"Condition"]==3, "Rating"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.25 3.00 3.00 4.00 4.00
sd(chapter_3_table_3[chapter_3_table_3[,"Condition"]==1, "Rating"])
## [1] 1.154701
sd(chapter_3_table_3[chapter_3_table_3[,"Condition"]==2, "Rating"])
## [1] 0.6666667
sd(chapter_3_table_3[chapter_3_table_3[,"Condition"]==3, "Rating"])
## [1] 1.054093
Load Chapter 3, Table 1 with the following command.
data(chapter_3_table_1)
One could call upon the summary statistics of interest individually, as we do below. Note that chapter_3_table_1 is a data frame, which typically means there is more than a single variable (as above). Because of the more flexible nature of a data frame, we need to identify the variable of interest for which we want to take the summary statistics. This seems unnecessary here, but consider the more general case of multiple variables, where identifying the variable of interest is important. The functions below are commonly used functions for describing data. There are many others that can be used. For information on any of the function uses, use the ?function.name or help("") or help.search("topic of interest")
mean(chapter_3_table_1$IQ)
## [1] 104
median(chapter_3_table_1$IQ)
## [1] 104
quantile(chapter_3_table_1$IQ, probs=c(.05, .25, .5, .75, .95))
## 5% 25% 50% 75% 95%
## 97.5 102.5 104.0 107.0 109.5
sd(chapter_3_table_1$IQ)
## [1] 4.898979
length(chapter_3_table_1$IQ)
## [1] 6
Table 3.1a in the book. Note that MARGIN=2 indicates that margin totals across the columns are taken (MARGIN=1 would be the margin totals across rows).
Table_3.1a <- cbind(Y=chapter_3_table_1$IQ, Parameter.Term=rep(104,
length(chapter_3_table_1$IQ)), Error.Scores=chapter_3_table_1$IQ-104,
Squared.Term=(chapter_3_table_1$IQ-104)^2)
# From the table, take a sum of all but the seCondition column ("-2" ignores the seCondition column).
apply(Table_3.1a[,-2], MARGIN=2, FUN=sum)
## Y Error.Scores Squared.Term
## 624 0 120
An alternative way to obtain the sum of squared deviations is use of the summarize() function from the dplyr package. The dplyr package can be installed with the install.packages() function as install.packages("dplyr"), in the same way the AMCP package was installed. After dplyr is installed, it needs to be loaded into the R session using the library() function. Note that below we use %>%, which is known as a “pipe.” In words, the pipe feeds the current output to a new function. It would be very valuable to read about using pipes and the dplyr package because it, along with its companion package tidyr, has changed the way many interact with data in R. These and other packages make some of the more obtuse programming in R unnecessary in that they provide ways of programming in R that are, arguably, more conceptually appealing and easier to master.
# install.packages("dplyr") # Only necessary once on current system.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
chapter_3_table_1 %>%
summarize(
Sum.Y = sum(IQ),
n = n(),
Sum.Squared.Errors = sum((IQ-104)^2))
## Sum.Y n Sum.Squared.Errors
## 1 624 6 120
Here we use the above code to find F as done in Table 3.1.
Values <- chapter_3_table_1 %>%
summarize(
Error_F = sum((IQ-mean(IQ))^2),
df_F = n()-1,
Error_R = sum((IQ-98)^2),
df_R = n(),
F = ((Error_R - Error_F)/(df_R - df_F))/(Error_F/df_F)
)
Linear models, as discussed in the text, can be implemented in R using the lm() function. Here we use the lm() function in a non-standard way to illustrate the idea of model comparisons. Additionally, we use the anova() function in order to compare two lm(). Note that the anova() function requires that the dependent variable be named the same in the two competing models, which is why we form IQ.1 and IQ.2 as we do below.
# Define two data frames, IQ.1 and IQ.2, which is simply the data from Chapter 3, Table 1
IQ.1 = IQ.2 <- chapter_3_table_1
# Reassign `IQ` from the IQ.2 data set as the IQ scores minus the value of 98, which is
# the null.
IQ.2$IQ <- IQ.2$IQ-98
# Fit a linear model in which only the intercept (i.e., the 1) is estimated, which is the
# mean in this context, using the IQ.1 data set.
Full <- lm(IQ ~ 1, data=IQ.1)
# Fit a linear model in which there is no intercept term (i.e., the -1).
Restricted <- lm(IQ ~ -1, data=IQ.2)
# Now, perform the model comparisons. It is important that the two data sets use the same
# name for the dependent variable.
anova(Restricted, Full)
## Analysis of Variance Table
##
## Model 1: IQ ~ -1
## Model 2: IQ ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6 336
## 2 5 120 1 216 9 0.0301 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The critical value from an F-distribution can be found using the qf() function, in which the quantile of interest is specified (e.g., .95) and the numerator and denominator degrees of freedom. Generically, qf() is used as qf(1-alpha, df1, df2). For our example:
qf(.95, 1, 5)
## [1] 6.607891
The p-value from an F-test can be found using the pf() function, in which the observed F-value and degrees of freedom are supplied as arguments. Generically, pf() is used as 1-pf(F, df1, df2) (to obtain the area greater than the observed F-value). For our example:
1-pf(9, 1, 5)
## [1] 0.03009925
# Alternatively
pf(9, 1, 5, lower.tail = FALSE)
## [1] 0.03009925
As noted, a t-test can be done. In R, the t.test() function with the null values to test of mu=98 specified can be used.
t.test(chapter_3_table_1$IQ, mu=98)
##
## One Sample t-test
##
## data: chapter_3_table_1$IQ
## t = 3, df = 5, p-value = 0.0301
## alternative hypothesis: true mean is not equal to 98
## 95 percent confidence interval:
## 98.85884 109.14116
## sample estimates:
## mean of x
## 104
The t.test() function is very general. It can be used for single-sample t-tests, paired samples t-test, and independent-samples t-tests based on the way in which the arguments are specified. To learn more about the various options or a function, use the help() function. Equivalently, a question mark immediately followed by the function for which you want to see the help file can be used.
help(t.test)
?t.test
Doing a t-test step-by-step can be done as follows, where we first define a variable called IQ to make the code easier to follow.
IQ <- chapter_3_table_1$IQ
Mean.IQ <- mean(IQ)
s.IQ <- sd(IQ)
n <- length(IQ)
(Mean.IQ - 98)/(s.IQ/sqrt(n))
## [1] 3
The critical value or values from a t-distribution can be found using the qt() function, in which the quantiles of interest respecified (e.g., .025 and .975 for a two-sided test; .95 for a one-sided test). For example
# For a one-sided test using alpha=.05
qt(.95, 5)
## [1] 2.015048
# For a two-sided test using alpha=.05
qt(c(.025, .975), 5)
## [1] -2.570582 2.570582
The p-value from a t-test can be found using the pt() function, in which the observed t-value and degrees of freedom are supplied as arguments. For a two-sided p-value, the qt() function can be used generically as 2*(1-pt(abs(t), df)), where the absolute value is used for the observed t-value.
# For a two-sided p-value
2*(1-pt(abs(3), 5))
## [1] 0.03009925
For a one-sided p-value the pt() function is used differently based on the alternative hypothesis that is specified. For an alternative hypothesis of “<”, only the area to the left of the observed t-value is needed, so pt is used as pt(t, df), which returns the cumulative probability. It is important that the absolute value function (i.e., abs()) is not used for single-sided tests.
pt(3, 5)
## [1] 0.9849504
# The above differs considerably from, say, an observed negative t-value and a "<"
# alternative hypothesis.
pt(-3, 5)
## [1] 0.01504962
Had a “>” alternative hypothesis been specified, the pt() function would be used as 1-pt(t, df), which returns the area more extreme than the observed t-value. Again, it is important that the absolute value function (i.e., abs()) is not used for single-sided tests.
1-pt(3, 5)
## [1] 0.01504962
# The above differs considerably from, say, an observed negative t-value and a ">"
# alternative hypothesis.
1-pt(-3, 5)
## [1] 0.9849504
We used the summarize() function from the dplyr package earlier. A wonderful function within dplyr is the group_by() function, which does the summarizing as demonstrated before, but it does so by a grouping factor. For us, the grouping factor is Condition, which identifies the levels of the group (i.e., Condition). We can very simply replicate Table 3.4 (something that takes some work in other programs).
Summary.Mood <-
chapter_3_table_3 %>%
group_by(Condition) %>%
summarize(
Sum.Y = sum(Rating),
Y.bar = mean(Rating),
Sum.Squared.Errors = sum((Rating-mean(Rating))^2),
n = n())
Summary.Mood
## # A tibble: 3 × 5
## Condition Sum.Y Y.bar Sum.Squared.Errors n
## <fctr> <int> <dbl> <dbl> <int>
## 1 1 60 6 12 10
## 2 2 40 4 4 10
## 3 3 30 3 10 10
(a <- length(unique(chapter_3_table_3$Condition)))
## [1] 3
(N <- length(chapter_3_table_3$Rating))
## [1] 30
(E.F <- sum(Summary.Mood$Sum.Squared.Errors))
## [1] 26
(E.R <- sum((chapter_3_table_3$Rating - mean(chapter_3_table_3$Rating))^2))
## [1] 72.66667
(SS.Between <- E.R - E.F)
## [1] 46.66667
(MS.Between <- SS.Between/(a-1))
## [1] 23.33333
(MS.Within <- E.F/(N-a))
## [1] 0.962963
(df.F <- df.Within <- N-a)
## [1] 27
(df.R <- N-1)
## [1] 29
(df.Between <- df.R-df.F)
## [1] 2
(F <- ((E.R - E.F)/(df.R-df.F))/(E.F/df.F))
## [1] 24.23077
# Or
((E.R - E.F)/df.Between)/(E.F/df.Within)
## [1] 24.23077
# Or,
MS.Between/MS.Within
## [1] 24.23077
# Critical value
qf(.95, df.Between, df.Within)
## [1] 3.354131
# p-value
1-pf(F, df.Between, df.Within)
## [1] 9.421379e-07
The above showed the step-by-step how-to that mimics the text. In R, a one-way ANOVA can be done as follows (recalling that we already defined Condition as a factor). In the code below the aov() function fits an ANOVA model but summary() provides a summary of an aov() (and many other) objects, as we will see.
summary(aov(Rating ~ Condition, data=chapter_3_table_3))
## Df Sum Sq Mean Sq F value Pr(>F)
## Condition 2 46.67 23.333 24.23 9.42e-07 ***
## Residuals 27 26.00 0.963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that, had we not already defined Condition as a factor, it can be done within the aov() function as follows. Specification of the factor is important, otherwise use of the aov() and summary() functions will produce results for a different type of analysis.
summary(aov(Rating ~ as.factor(Condition), data=chapter_3_table_3))
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Condition) 2 46.67 23.333 24.23 9.42e-07 ***
## Residuals 27 26.00 0.963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To compute a confidence interval, in a step-by-step fashion, we can create objects of the relevant values, where MOE is margin of error.
alpha <- .05
n <- 6
Y.bar <- 104
s <- sqrt(24)
s_Y.bar <- s/sqrt(n)
MOE <- qt(1-.05/2, n-1)*s_Y.bar
Y.bar - MOE
## [1] 98.85884
Y.bar + MOE
## [1] 109.1412
Recall that earlier we used the t.test() function for the test of the null hypothesis that IQ had a mean of 104. In particular, we did the following, which, as part of the output, also provides the 95% confidence interval.
t.test(chapter_3_table_1$IQ, mu=98)
##
## One Sample t-test
##
## data: chapter_3_table_1$IQ
## t = 3, df = 5, p-value = 0.0301
## alternative hypothesis: true mean is not equal to 98
## 95 percent confidence interval:
## 98.85884 109.14116
## sample estimates:
## mean of x
## 104
Note that use of the t.test() function for purposes of obtaining a confidence interval does not depend, in any way, on the null value that is specified. For example, had the null value been 100, the confidence interval is unchanged.
t.test(chapter_3_table_1$IQ, mu=100)
##
## One Sample t-test
##
## data: chapter_3_table_1$IQ
## t = 2, df = 5, p-value = 0.1019
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
## 98.85884 109.14116
## sample estimates:
## mean of x
## 104
Further, note that a different confidence level can be specified with the additional argument of onf.level = 0.95. For example, a 99% confidence interval can be obtained as
t.test(chapter_3_table_1$IQ, mu=98, onf.level = 0.99)
##
## One Sample t-test
##
## data: chapter_3_table_1$IQ
## t = 3, df = 5, p-value = 0.0301
## alternative hypothesis: true mean is not equal to 98
## 95 percent confidence interval:
## 98.85884 109.14116
## sample estimates:
## mean of x
## 104
Also note that a single-sided confidence interval, based on the alternative hypothesis of interest, can be obtained as the following if the alternative hypothesis is “less than”
t.test(chapter_3_table_1$IQ, mu=98, alternative="less")
##
## One Sample t-test
##
## data: chapter_3_table_1$IQ
## t = 3, df = 5, p-value = 0.985
## alternative hypothesis: true mean is less than 98
## 95 percent confidence interval:
## -Inf 108.0301
## sample estimates:
## mean of x
## 104
A single sided confidence-interval for an alternative hypothesis of “greater than” can be obtained as
t.test(chapter_3_table_1$IQ, mu=98, alternative="greater")
##
## One Sample t-test
##
## data: chapter_3_table_1$IQ
## t = 3, df = 5, p-value = 0.01505
## alternative hypothesis: true mean is greater than 98
## 95 percent confidence interval:
## 99.9699 Inf
## sample estimates:
## mean of x
## 104
In the text the confidence interval for the two-independent group t-test is done using data based on summary statistics (i.e., the Brown and Miller, 1993 example). This data is not presented in the text, as it was not presented in the article. Suppose, however, that one wanted to test the hypothesis that two independent groups had the same mean and/or they wanted to form a confidence interval for the population mean differences. Here, we will create a data set (rather than reading one from ACMP) to show how one might create a data set directly in R. The scores below are actually the scores from Groups 1 and 2 from Chapter 3, Table 3. We type in the data simply to illustrate how one might (though there are different ways) form a data set in R. Below is just one of many ways data can be entered into R.
Group.1 <- cbind(Group=c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Score=c(6, 5, 4, 7, 7, 5, 5, 7, 7, 7))
Group.2 <- cbind(Group=c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2), Score=c(5, 4, 4, 3, 4, 3, 4, 4, 4, 5))
Data <- as.data.frame(rbind(Group.1, Group.2))
Data$Group <- as.factor(Data$Group)
Another way to enter the data above is as follows.
Data2 <- cbind(Group=as.factor(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)),
Score=c(6, 5, 4, 7, 7, 5, 5, 7, 7, 7, 5, 4, 4, 3, 4, 3, 4, 4, 4, 5))
Yet another way uses the rep() function, which replicates a value the specified number of times.
Data3 <- cbind(Group=as.factor(c(rep(1, 10), rep(2, 10))),
Score=c(6, 5, 4, 7, 7, 5, 5, 7, 7, 7, 5, 4, 4, 3, 4, 3, 4, 4, 4, 5))
Now that we have data from two independent groups, we can perform a two-group t-test and obtain the confidence interval as follows. Note that this t-test does not assume homogeneity of variance, which we show momentarily.
t.test(Score~Group, data=Data)
##
## Welch Two Sample t-test
##
## data: Score by Group
## t = 4.7434, df = 14.4, p-value = 0.0002915
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.09803 2.90197
## sample estimates:
## mean in group 1 mean in group 2
## 6 4
Alternatively, the above t-test can be modified so that it assumes homogeneity of variance, as we have done thus far in the book for comparing means among some number of groups. This is accomplished in R using the additional specification of var.equal=TRUE.
t.test(Score~Group, data=Data, var.equal=TRUE)
##
## Two Sample t-test
##
## data: Score by Group
## t = 4.7434, df = 18, p-value = 0.0001623
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.114173 2.885827
## sample estimates:
## mean in group 1 mean in group 2
## 6 4
The above t-test can also be specified without the use of the data argument, but by using two matrices of scores (recall that we defined two vectors of the scores when we first created this data set). Note that you must select only the Score column (otherwise the group identifier will also be used)
t.test(Group.1[,"Score"], Group.2[,"Score"], var.equal=TRUE)
##
## Two Sample t-test
##
## data: Group.1[, "Score"] and Group.2[, "Score"]
## t = 4.7434, df = 18, p-value = 0.0001623
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.114173 2.885827
## sample estimates:
## mean of x mean of y
## 6 4
Had the matrix of data for Group 1 and Group 2 been specified as simply a vector of scores, we would implement a t-test as
group.1 <- c(6, 5, 4, 7, 7, 5, 5, 7, 7, 7)
group.2 <- c(5, 4, 4, 3, 4, 3, 4, 4, 4, 5)
t.test(group.1, group.2, var.equal=TRUE)
##
## Two Sample t-test
##
## data: group.1 and group.2
## t = 4.7434, df = 18, p-value = 0.0001623
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.114173 2.885827
## sample estimates:
## mean of x mean of y
## 6 4
As the book explains, confidence intervals for the population standardized mean difference is not straightforward computationally, but the MBESS package makes it very simple.
Here we use the Brown and Miller data as reported in the book in order to find the standardized mean difference.
# install.packages("MBESS") # Only necessary once on current system.
library(MBESS)
smd(Mean.1=60.9, Mean.2=18.5, s.1=sqrt(778.41), s.2=sqrt(2756.25), n.1=13, n.2=13)
## [1] 1.008572
Had we wanted to form the standardized mean difference using the control group’s (Group 2) standard deviation, the smd.c() function could be used.
smd.c(Mean.T=60.9, Mean.C=18.5, s.C=sqrt(2756.25), n.C=13)
## [1] 0.807619
If one wanted to obtain the unbiased versions of these effect sizes, the Unbiased=TRUE argument can be used.
smd(Mean.1=60.9, Mean.2=18.5, s.1=sqrt(778.41), s.2=sqrt(2756.25), n.1=13, n.2=13,
Unbiased=TRUE)
## [1] 0.9766663
smd.c(Mean.T=60.9, Mean.C=18.5, s.C=sqrt(2756.25), n.C=13, Unbiased=TRUE)
## [1] 0.7558844
In order to find the 95% confidence interval for the population standardized mean difference, we can use the ci.smd() function, which itself uses the conf.limits.nct() function, as noted in the book.
# Using the smd() function:
ci.smd(smd=smd(Mean.1=60.9, Mean.2=18.5, s.1=sqrt(778.41), s.2=sqrt(2756.25), n.1=13,
n.2=13), n.1=13, n.2=13)
## $Lower.Conf.Limit.smd
## [1] 0.1798709
##
## $smd
## [1] 1.008572
##
## $Upper.Conf.Limit.smd
## [1] 1.818814
# Or using the standardized mean difference directly:
ci.smd(smd=1.008572, n.1=13, n.2=13)
## $Lower.Conf.Limit.smd
## [1] 0.1798707
##
## $smd
## [1] 1.008572
##
## $Upper.Conf.Limit.smd
## [1] 1.818814
In order to find the 95% confidence interval for the population standardized mean difference, we can use the ci.smd() function, which itself uses the conf.limits.nct() function, as noted in the book.
# Using the smd() function:
ci.smd.c(smd.c=smd.c(Mean.T=60.9, Mean.C=18.5, s.C=sqrt(2756.25), n.C=13), n.C=13, n.E=13)
## $Lower.Conf.Limit.smd.c
## [1] -0.03910679
##
## $smd.c
## [1] 0.807619
##
## $Upper.Conf.Limit.smd.c
## [1] 1.625974
# Or using the standardized mean difference directly:
ci.smd.c(smd.c=0.807619, n.C=13, n.E=13)
## $Lower.Conf.Limit.smd.c
## [1] -0.03910683
##
## $smd.c
## [1] 0.807619
##
## $Upper.Conf.Limit.smd.c
## [1] 1.625974
To obtain a confidence interval for the square root of the signal to noise ratio:
ci.srsnr(F.value=24.23, df.1=2, df.2=27, N=30)
## $Lower.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 0.753884
##
## $Upper.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 1.746008
To obtain a confidence interval for the signal to noise ratio
ci.snr(F.value=24.23, df.1=2, df.2=27, N=30)
## $Lower.Limit.Signal.to.Noise.Ratio
## [1] 0.5683411
##
## $Upper.Limit.Signal.to.Noise.Ratio
## [1] 3.048545
Estimate of the median of the signal to noise ratio, where a very small confidence interval is used, thus excluding nearly the whole left and right side of the distribution.
ci.srsnr(F.value=24.23, df.1=2, df.2=27, N=30, conf.level=.00001)
## $Lower.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 1.243926
##
## $Upper.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 1.243932
# Or here, where the limits are the same (within rounding)
ci.srsnr(F.value=24.23, df.1=2, df.2=27, N=30, conf.level=.000001)
## $Lower.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 1.243929
##
## $Upper.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 1.243929
One of the advantages of an object oriented language, such as R, is that objects can contain other objects and these objects can be used within other functions. Below we perform an anova using the aov() function and then ask R to state the various objects that are available using the names() function.
Result <- aov(Rating ~ as.factor(Condition), data=chapter_3_table_3)
Summary.Output <- summary(Result)
#View the structure of what is in the Summary.Output object.
str(Summary.Output)
## List of 1
## $ :Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
## ..$ Df : num [1:2] 2 27
## ..$ Sum Sq : num [1:2] 46.7 26
## ..$ Mean Sq: num [1:2] 23.333 0.963
## ..$ F value: num [1:2] 24.2 NA
## ..$ Pr(>F) : num [1:2] 9.42e-07 NA
## - attr(*, "class")= chr [1:2] "summary.aov" "listof"
# Extract the various attributes of interest (first using double square brackets).
Summary.Output[[1]]["Df"]
## Df
## as.factor(Condition) 2
## Residuals 27
Summary.Output[[1]]["Mean Sq"]
## Mean Sq
## as.factor(Condition) 23.333
## Residuals 0.963
Summary.Output[[1]]["Sum Sq"]
## Sum Sq
## as.factor(Condition) 46.667
## Residuals 26.000
Summary.Output[[1]]["F value"]
## F value
## as.factor(Condition) 24.231
## Residuals
Summary.Output[[1]]["Pr(>F)"]
## Pr(>F)
## as.factor(Condition) 9.421e-07 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Further, objects within the above objects can themselves be extracted:
Summary.Output[[1]]["Df"][1,1]
## [1] 2
Summary.Output[[1]]["Df"][2,1]
## [1] 27
Summary.Output[[1]]["Mean Sq"][1,1]
## [1] 23.33333
Summary.Output[[1]]["Mean Sq"][2,1]
## [1] 0.962963
Given the above information, Omega^2 can be obtained by extracting the necessary objects.
SS.B <- Summary.Output[[1]]["Sum Sq"][1,1]
SS.W <- Summary.Output[[1]]["Sum Sq"][2,1]
df.1 <- Summary.Output[[1]]["Df"][1,1]
df.2 <- Summary.Output[[1]]["Df"][2,1]
MS.W <- Summary.Output[[1]]["Mean Sq"][2,1]
SS.T <- SS.B + SS.W
(Omega.Sq <- (SS.B-df.1*MS.W)/(SS.T + MS.W))
## [1] 0.6076459
#Or, using F as follows:
F.Val <- Summary.Output[[1]]["F value"][1,1]
df.2 <- Summary.Output[[1]]["Df"][2,1]
(df.1*(F.Val-1))/(df.1*(F.Val-1)+(df.1+df.2+1))
## [1] 0.6076459
Similarly, for R^2
E.R <- SS.T
E.F <- SS.W
(R2 <- (E.R - E.F)/E.R)
## [1] 0.6422018
Or for Adjusted R^2
N <- df.1 + df.2 + 1
a <- df.1+1
1 - ((N-1)/(N-a))*(1-R2)
## [1] 0.6156983
To obtain a confidence interval for the population proportion of variance accounted for (omega^2)
ci.pvaf(F.value=24.23, df.1=2, df.2=27, N=30)
## $Lower.Limit.Proportion.of.Variance.Accounted.for
## [1] 0.3623836
##
## $Probability.Less.Lower.Limit
## [1] 0.025
##
## $Upper.Limit.Proportion.of.Variance.Accounted.for
## [1] 0.7529977
##
## $Probability.Greater.Upper.Limit
## [1] 0.025
##
## $Actual.Coverage
## [1] 0.95
A binomial effect size display (using the example from Rosenthal and Rubin (1982)).
r <- .32 #Taken from
rbind(c(.5 + r/2, .5 - r/2)*100,
c(.5 - r/2, .5 + r/2)*100)
## [,1] [,2]
## [1,] 66 34
## [2,] 34 66
Common language (CL) effect size. Recalling the example above where we have two groups (taken from Table 3 in Chapter 3). We repeat the data here. We then use the expand.grid() function (as used in the Chapter 1 R code) in order to find all possible pairings of the scores.
group.1 <- c(6, 5, 4, 7, 7, 5, 5, 7, 7, 7)
group.2 <- c(5, 4, 4, 3, 4, 3, 4, 4, 4, 5)
Grid <- expand.grid(group.1=group.1, group.2=group.2)
Difs <- Grid[,1] - Grid[,2]
# Note that this is not corrected for ties.
(CL.Uncorrected <- mean(Difs > 0))
## [1] 0.86
# Proportion of ties
(Ties <- mean(Difs==0))
## [1] 0.12
(CL.Corrected <- CL.Uncorrected + .5*Ties)
## [1] 0.92
Cliff’s measure
# Note that `Difs` is defined above.
mean(Difs > 0) - mean(Difs <0)
## [1] 0.84
Here we plot a QQ Plot of the chapter_3_table_7_raw data, ignoring groups.
data(chapter_3_table_7_raw)
qqnorm(chapter_3_table_7_raw$Drinks, xlab="Theoretical Quantiles",
ylab="Sample Quantiles")
Here we find a histogram using hist(). Notice that the y-axis is in terms of the frequency.
hist(chapter_3_table_7_raw$Drinks, main="Histogram")
Here we use the histogram but now with probability=TRUE, so as to not plot the frequency on the Y-axis but rather in terms of the “probability density.” After the histogram is plotted, we apply to that plot the estimated density line using the lines() function, which uses within it the density() function. As above, groups are ignored.
hist(chapter_3_table_7_raw$Drinks, probability=TRUE, ylim=c(0, .03),
main="Histogram and Estimated Density Line")
lines(density(chapter_3_table_7_raw$Drinks))
Using the nclass option we can specify the number of bins or classes (i.e., the number of bars).
hist(chapter_3_table_7_raw$Drinks, probability=TRUE, ylim=c(0, .03),
main="Histogram and Estimated Density Line", nclass=15)
lines(density(chapter_3_table_7_raw$Drinks))
Had we wanted to include the normal density, one approach is to do the following, where we find the bounds for X, find values between the bounds, and then find the density of a normal distribution of those those values using the observed mean and standard deviation.
hist(chapter_3_table_7_raw$Drinks, probability=TRUE, ylim=c(0, .03), nclass=15,
main="Histogram and Normal Density Line")
Min.Drinks <- min(chapter_3_table_7_raw$Drinks)
Max.Drinks <- max(chapter_3_table_7_raw$Drinks)
Mean.Drinks <- mean(chapter_3_table_7_raw$Drinks)
SD.Drinks <- sd(chapter_3_table_7_raw$Drinks)
X <- seq(Min.Drinks, Max.Drinks, .01)
lines(X, dnorm(x=X, mean=Mean.Drinks, sd=SD.Drinks))
Here we use the normal density line, a blue line using the col argument, and the estimated density line (black, which is the default).
hist(chapter_3_table_7_raw$Drinks, probability=TRUE, ylim=c(0, .03),
nclass=15, main="Histogram and Normal Density Line")
lines(density(chapter_3_table_7_raw$Drinks))
lines(X, dnorm(x=X, mean=Mean.Drinks, sd=SD.Drinks), col="blue")
To obtain measures of skewness and kurtosis, one option is to use the moments package.
# install.packages("moments") # Only necessary once on current system.
library(moments)
skewness(chapter_3_table_7_raw$Drinks)
## [1] 4.770605
kurtosis(chapter_3_table_7_raw$Drinks)
## [1] 32.00413
Here we plot the confidence interval limits. One way to do this is step-by-step. We begin separating the data by group (not necessary) and then use t.test() function as before, but not literally for the hypothesis test. Recall that this function also provides a confidence interval for the population mean. We then extract the conf.int results from the t.test() function. Note that using the names() function around the results of a t.test() will show the various objects that the t.test() result contains, each of which can be extracted, as we do here for conf.int. We then form a data frame that includes the mean, the lower confidence interval limit, and the upper confidence interval limit. We then create a plot without numbers for the x-axis and then use the group names (where 1-5 would normally be based on the use the X below as a placeholder (numeric group identifier). Finally, we use the segments function which draws a line segment from one (x,y) coordinate to another (x,y) coordinate. There are other ways to do this but the step-by-step nature of our appraoch here attempts to map the way one would think about this problem. Note that horizonal limit bars could be added by another use of the segments function.
T1_CRA.minus.D <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==1,
"Drinks"]
T1_CRA.plus.D <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==2,
"Drinks"]
T1_Std <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==3, "Drinks"]
T2_CRA.minus.D <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==4,
"Drinks"]
T2_Std <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==5,
"Drinks"]
T1_CRA.minus.D_CI <- t.test(T1_CRA.minus.D)$conf.int[1:2]
T1_CRA.plus.D <- t.test(T1_CRA.plus.D)$conf.int[1:2]
T1_Std <- t.test(T1_Std)$conf.int[1:2]
T2_CRA.minus.D <- t.test(T2_CRA.minus.D)$conf.int[1:2]
T2_Std <- t.test(T2_Std)$conf.int[1:2]
Results <- as.data.frame(rbind(c(mean(T1_CRA.minus.D), T1_CRA.minus.D_CI),
c(mean(T1_CRA.plus.D), T1_CRA.plus.D),
c(mean(T1_Std), T1_Std),
c(mean(T2_CRA.minus.D), T2_CRA.minus.D),
c(mean(T2_Std), T2_Std)))
Group.Names <- c("T1 CRA-D", "T1 CRA+D", "T1 Std", "T2 CRA-D", "T2 Std")
dimnames(Results) <- list(Group.Names, c("Mean", "Lower.Limit", "Upper.Limit"))
# Numeric place-holder for groups
X <- 1:5
# Note that pch=0 is a square plotting character.
plot(1:5, Results$Mean, main="Confidence Interval Plot for Each Group", xaxt='n',
xlab="Groups", ylab="95% CI for Number of Drinks", ylim=c(-15, 150), pch=0)
axis(1, at=X, Group.Names)
segments(X[1], Results[1, 2], X[1], Results[1,3 ])
segments(X[2], Results[2, 2], X[2], Results[2,3 ])
segments(X[3], Results[3, 2], X[3], Results[3,3 ])
segments(X[4], Results[4, 2], X[4], Results[4,3 ])
segments(X[5], Results[5, 2], X[5], Results[5,3 ])
#Length of horizontal segment for confidence interval limit.
Length <- .2
#Horizontal confidence interval limits.
segments(X[1]-Length, Results[1, 3], X[1]+Length, Results[1,3 ])
segments(X[1]-Length, Results[1, 2], X[1]+Length, Results[1,2 ])
segments(X[2]-Length, Results[2, 3], X[2]+Length, Results[2,3 ])
segments(X[2]-Length, Results[2, 2], X[2]+Length, Results[2,2 ])
segments(X[3]-Length, Results[3, 3], X[3]+Length, Results[3,3 ])
segments(X[3]-Length, Results[3, 2], X[3]+Length, Results[3,2 ])
segments(X[4]-Length, Results[4, 3], X[4]+Length, Results[4,3 ])
segments(X[4]-Length, Results[4, 2], X[4]+Length, Results[4,2 ])
segments(X[5]-Length, Results[5, 3], X[5]+Length, Results[5, 3])
segments(X[5]-Length, Results[5, 2], X[5]+Length, Results[5, 2])
Above we went step-by-step in what was one approach of forming a confidence interval plot. Here we use the gplots and its plotmeans() function. Clearly, this makes things much simpler!
# install.packages("gplots") # Only necessary once on current system.
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
plotmeans(Drinks ~ Group, data=chapter_3_table_7_raw, connect=FALSE,
main="Confidence Interval Plot")
Suppose one wanted to include a horizontal line at zero. This can be done using the abline() function with the h argument representing the value where a horizontal line will be drawn.
plotmeans(Drinks ~ Group, data=chapter_3_table_7_raw, connect=FALSE,
main="Confidence Interval Plot")
abline(h=0)
We can use the plotmeans() function but define the x-axis labels using the group names, as before.
plotmeans(Drinks ~ Group, data=chapter_3_table_7_raw,
connect=FALSE, main="Confidence Interval Plot",
xaxt='n')
axis(1, at=X, Group.Names)
We can adjust the y-axis, if desired, modify the y-axis name, and modify the x-axis name.
plotmeans(Drinks ~ Group, data=chapter_3_table_7_raw,
connect=FALSE, main="Confidence Interval Plot", xaxt='n',
ylab="95% CI for Number of Drinks", xlab="Group",
ylim=c(-50, 150))
axis(1, at=X, Group.Names)
Here we compute O’Brien’s test step-by-step.
Y.i1 <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==1,"Drinks"]
Ybar.1 <- mean(Y.i1)
s2.1 <- var(Y.i1)
n.1 <- length(Y.i1)
Y.i2 <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==2,"Drinks"]
Ybar.2 <- mean(Y.i2)
s2.2 <- var(Y.i2)
n.2 <- length(Y.i2)
Y.i3 <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==3,"Drinks"]
Ybar.3 <- mean(Y.i3)
s2.3 <- var(Y.i3)
n.3 <- length(Y.i3)
Y.i4 <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==4,"Drinks"]
Ybar.4 <- mean(Y.i4)
s2.4 <- var(Y.i4)
n.4 <- length(Y.i4)
Y.i5 <- chapter_3_table_7_raw[chapter_3_table_7_raw$Group==5,"Drinks"]
Ybar.5 <- mean(Y.i5)
s2.5 <- var(Y.i5)
n.5 <- length(Y.i5)
r.1 <- (((n.1-1.5)*n.1*((Y.i1-Ybar.1)^2))-(.5*s2.1*(n.1-1)))/((n.1-1)*(n.1-2))
r.2 <- (((n.2-1.5)*n.2*((Y.i2-Ybar.2)^2))-(.5*s2.2*(n.2-1)))/((n.2-1)*(n.2-2))
r.3 <- (((n.3-1.5)*n.3*((Y.i3-Ybar.3)^2))-(.5*s2.3*(n.3-1)))/((n.3-1)*(n.3-2))
r.4 <- (((n.4-1.5)*n.4*((Y.i4-Ybar.4)^2))-(.5*s2.4*(n.4-1)))/((n.4-1)*(n.4-2))
r.5 <- (((n.5-1.5)*n.5*((Y.i5-Ybar.5)^2))-(.5*s2.5*(n.5-1)))/((n.5-1)*(n.5-2))
# Notice the following, where the first column is the mean of the r.j and the seCondition
#column is the variance.
as.matrix(rbind(
c(mean(r.1), s2.1),
c(mean(r.2), s2.2),
c(mean(r.3), s2.3),
c(mean(r.4), s2.4),
c(mean(r.5), s2.5)))
## [,1] [,2]
## [1,] 2439.836 2439.836
## [2,] 3713.122 3713.122
## [3,] 22162.041 22162.041
## [4,] 3952.290 3952.290
## [5,] 2008.812 2008.812
# Add a column of the r values.
chapter_3_table_7_raw$r <- c(r.1, r.2, r.3, r.4, r.5)
Now, using the new column of data in the data set, r, we perform the ANOVA as before.
summary(aov(r ~ as.factor(Group), data=chapter_3_table_7_raw))
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 4 5.074e+09 1.269e+09 0.953 0.438
## Residuals 83 1.105e+11 1.331e+09
The Levene test is implemented in the car package with the LeveneTest() function. Note that below we use center=mean, though the default is center=median.
# install.packages("car") # Only necessary once on current system.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(Drinks ~ as.factor(Group), data=chapter_3_table_7_raw, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 4 1.7558 0.1456
## 83
The psych package has a nice descriptive statistics function, describe() that can easily be applied to a data set. To obtain a useful set of descriptives for all variables in a data set, describe() can be used as follows.
# install.packages("psych") # Only necessary once on current system.
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describe(chapter_3_table_7_raw)
## vars n mean sd median trimmed mad min max
## Group 1 88 3.09 1.42 3.00 3.11 1.48 1.00 5.00
## Drinks 2 88 36.88 81.96 3.85 19.64 5.70 0.00 624.62
## LgDrinks 3 88 0.84 0.85 0.68 0.77 1.01 0.00 2.80
## r 4 88 6718.94 36447.29 649.23 1217.95 595.96 -106.15 335136.99
## range skew kurtosis se
## Group 4.00 -0.04 -1.35 0.15
## Drinks 624.62 4.69 28.28 8.74
## LgDrinks 2.80 0.40 -1.34 0.09
## r 335243.14 8.37 71.99 3885.29
Note that we obtained skewness and kurtosis earlier for the full data using skewness() and kurtosis() from the moments package. Further, notice that, although skewness is the same for both approaches, kurtosis is different. The describe() function has three different approaches for calculating skewness and kurtosis. But, each of these appraoches is different than what the kurtosis() function produces. Because there are different ways to compute certain statistics, the “correct” value depends exactly what you want. Arguably, because we use skewness and kurtosis as ways of trying to understand the distribution shape, any one of them is useful. Here is a comparison of the four approaches.
kurtosis(chapter_3_table_7_raw$Drinks)
## [1] 32.00413
describe(chapter_3_table_7_raw$Drinks, type=1)$kurtosis
## [1] 29.00413
describe(chapter_3_table_7_raw$Drinks, type=2)$kurtosis
## [1] 30.79357
describe(chapter_3_table_7_raw$Drinks, type=3)$kurtosis
## [1] 28.2809
The book uses what describe() calls type=2. Thus, to reproduce the values in the descriptive statistics table, we have the code below that also adds the group name to the output.
rbind(c(Group=Group.Names[1],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==1,"Drinks"], type=2)),
c(Group=Group.Names[2],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==2,"Drinks"], type=2)),
c(Group=Group.Names[3],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==3,"Drinks"], type=2)),
c(Group=Group.Names[4],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==4,"Drinks"], type=2)),
c(Group=Group.Names[5],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==5,"Drinks"], type=2)))
## Group vars n mean sd median trimmed mad min
## [1,] "T1 CRA-D" 1 15 33.80431 49.3947 10.97846 25.98249 16.27667 0
## [2,] "T1 CRA+D" 1 19 26.4196 60.93539 0 15.57014 0 0
## [3,] "T1 Std" 1 17 71.51416 148.8692 28.8 39.40836 40.91976 0
## [4,] "T2 CRA-D" 1 17 20.31077 62.86724 0 5.864205 0 0
## [5,] "T2 Std" 1 20 33.76562 44.81977 13.47692 25.65317 19.98089 0
## max range skew kurtosis se
## [1,] 169.2923 169.2923 1.836401 3.100115 12.75366
## [2,] 237.28 237.28 2.782492 8.104232 13.97954
## [3,] 624.6154 624.6154 3.587491 13.73201 36.10609
## [4,] 257.32 257.32 3.778988 14.76502 15.24755
## [5,] 164.0615 164.0615 1.634646 2.432478 10.02201
Using now the log of the number of Drinks, we can use the describe() function as follows, where we replace Drinks with LgDrinks (note that LgDrinks is already in the data set).
rbind(c(Group=Group.Names[1],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==1,"LgDrinks"], type=2)),
c(Group=Group.Names[2],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==2,"LgDrinks"], type=2)),
c(Group=Group.Names[3],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==3,"LgDrinks"], type=2)),
c(Group=Group.Names[4],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==4,"LgDrinks"], type=2)),
c(Group=Group.Names[5],
describe(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==5,"LgDrinks"], type=2)))
## Group vars n mean sd median trimmed mad
## [1,] "T1 CRA-D" 1 15 1.007188 0.8033638 1.078401 0.9905094 1.147596
## [2,] "T1 CRA+D" 1 19 0.5400679 0.8274548 0 0.4637766 0
## [3,] "T1 Std" 1 17 1.228967 0.8689398 1.474216 1.206408 0.8106354
## [4,] "T2 CRA-D" 1 17 0.3912861 0.7442338 0 0.282647 0
## [5,] "T2 Std" 1 20 1.053933 0.7811781 1.150223 1.05333 1.063293
## min max range skew kurtosis se
## [1,] 0 2.231195 2.231195 -0.03217937 -1.457266 0.2074276
## [2,] 0 2.377088 2.377088 1.373427 0.3182206 0.1898312
## [3,] 0 2.796307 2.796307 -0.2371426 -0.8679882 0.2107489
## [4,] 0 2.412158 2.412158 1.943825 2.782116 0.1805032
## [5,] 0 2.217646 2.217646 -0.2085922 -1.387944 0.1746767
Note that, had we been interested in applying a different type of transformation to the data, that can be done directly within the describe() function on the data of interest, such as what we do below using the sqrt() function (for the square root transformation).
rbind(c(Group=Group.Names[1],
describe(sqrt(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==1,"Drinks"]), type=2)),
c(Group=Group.Names[2],
describe(sqrt(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==2,"Drinks"]), type=2)),
c(Group=Group.Names[3],
describe(sqrt(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==3,"Drinks"]), type=2)),
c(Group=Group.Names[4],
describe(sqrt(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==4,"Drinks"]), type=2)),
c(Group=Group.Names[5],
describe(sqrt(chapter_3_table_7_raw[chapter_3_table_7_raw$Group ==5,"Drinks"]), type=2)))
## Group vars n mean sd median trimmed mad min
## [1,] "T1 CRA-D" 1 15 4.229386 4.129589 3.313376 3.879196 4.912411 0
## [2,] "T1 CRA+D" 1 19 2.596713 4.557393 0 1.996097 0 0
## [3,] "T1 Std" 1 17 5.937236 6.207242 5.366563 5.062714 6.33236 0
## [4,] "T2 CRA-D" 1 17 1.886724 4.218765 0 1.068875 0 0
## [5,] "T2 Std" 1 20 4.374796 3.923851 3.645773 4.040459 5.317046 0
## max range skew kurtosis se
## [1,] 13.01124 13.01124 0.8036426 -0.2710347 1.066255
## [2,] 15.4039 15.4039 1.873749 2.622742 1.045538
## [3,] 24.99231 24.99231 1.877095 4.900944 1.505477
## [4,] 16.0412 16.0412 2.810644 8.264705 1.023201
## [5,] 12.80865 12.80865 0.5849369 -0.6513498 0.8773998
Using the descriptive statistics previously calculated on the number of Drinks by group, here we fit a regression model with the log standard deviations Conditionitional on the log of the means for the five groups. First, we create a data frame of the log transformed means and standard deviations. Then, we use the lm() function to fit a linear model, here a simple regression model.
Log_Descriptives <- data.frame(LogMeans=log(c(Ybar.1, Ybar.2, Ybar.3, Ybar.4, Ybar.5)),
LogSD=log(sqrt(c(s2.1, s2.2, s2.3, s2.4, s2.5))))
summary(lm(LogSD~LogMeans, data=Log_Descriptives))
##
## Call:
## lm(formula = LogSD ~ LogMeans, data = Log_Descriptives)
##
## Residuals:
## 1 2 3 4 5
## -0.2926 0.1003 0.2545 0.3267 -0.3889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5795 1.4096 1.121 0.344
## LogMeans 0.7422 0.3977 1.866 0.159
##
## Residual standard error: 0.3735 on 3 degrees of freedom
## Multiple R-squared: 0.5372, Adjusted R-squared: 0.3829
## F-statistic: 3.482 on 1 and 3 DF, p-value: 0.1589
Now, we plot the means based on the log transformed scores. Notice here the use of \n in the title to go to a seCondition line.
plotmeans(LgDrinks ~ Group , data=chapter_3_table_7_raw, connect=FALSE,
main="Confidence Interval Plot\n(For Log Transformed Number of Drinks)",
xaxt='n', ylab="95% CI for Log Number of Drinks", xlab="Group")
axis(1, at=X, Group.Names)
ANOVA on log transformed number of Drinks
summary(aov(LgDrinks ~ as.factor(Group), data=chapter_3_table_7_raw))
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 4 9.04 2.2597 3.48 0.0112 *
## Residuals 83 53.90 0.6494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Computing the supposed noncentral F parameter based on a set of supposed population means (for a hypothetical situation)
Supposed.Within.Group.SD <- 100
Supposed.Group.Means <- c(400, 450, 500)
Grand.Mean <- mean(Supposed.Group.Means)
Treatment.Effects <- Supposed.Group.Means-Grand.Mean
n <- 10
a <- 3
N <- n*a
(lambda <- 10*sum(Treatment.Effects^2)/Supposed.Within.Group.SD^2)
## [1] 5
(f2 <- lambda/N)
## [1] 0.1666667
(f <- sqrt(f2))
## [1] 0.4082483
Using the above supposed scores, the f effect size can be estimated as follows, with
(length(Supposed.Group.Means)-1)/length(Supposed.Group.Means)
being used in order to use n in the denominator of the variance, rather than the default or typical denominator (n-1).
sigma.m <- sqrt(var(Supposed.Group.Means)*(length(Supposed.Group.Means)-1)/length(Supposed.Group.Means))
f <- sigma.m/Supposed.Within.Group.SD
In R, we can find the exact power based on the area of the particular noncentral F-distribution for the specified degrees of freedom by determining the area that is beyond the critical value from the central F-distributoin as follows, for different sample sizes (some of this is calculated above but it is repeated here for completeness).
# Constant (parameters)
(Supposed.sigma_e <- 100)
## [1] 100
(Supposed.Group.Means <- c(400, 450, 500))
## [1] 400 450 500
(a <- length(Supposed.Group.Means))
## [1] 3
(Grand.Mean <- mean(Supposed.Group.Means))
## [1] 450
(alpha <- Supposed.Group.Means-Grand.Mean)
## [1] -50 0 50
(Supposed.sigma_m <- sqrt(sum(alpha^2)/a))
## [1] 40.82483
(f <- Supposed.sigma_m/Supposed.sigma_e)
## [1] 0.4082483
# Variables
(n <- 10)
## [1] 10
# Constant Conditionitional on the situation (i.e., Conditionitional based on n)
(lambda <- n*sum(Treatment.Effects^2)/Supposed.Within.Group.SD^2)
## [1] 5
(CV <- qf(.95, df1=a-1, df2=a*n-a))
## [1] 3.354131
# Statistical power.
(power <- pf(CV, df1=a-1, df2=a*n-a, ncp=lambda, lower.tail=FALSE))
## [1] 0.4579923
Here we use the above Constant values but specify a different sample size (and then a new lambda results)
# Variables
(n <- 25)
## [1] 25
# Constant Conditionitional on the situation (i.e., Conditionitional based on n)
(lambda <- n*sum(Treatment.Effects^2)/Supposed.Within.Group.SD^2)
## [1] 12.5
(CV <- qf(.95, df1=a-1, df2=a*n-a))
## [1] 3.123907
(power <- pf(CV, df1=a-1, df2=a*n-a, ncp=lambda, lower.tail=FALSE))
## [1] 0.8827749
Here we use the above Constant values but specify a different sample size (and then a new lambda results)
# Variables
(n <- 20)
## [1] 20
# Constant Conditionitional on the situation (i.e., Conditionitional based on n)
(lambda <- n*sum(Treatment.Effects^2)/Supposed.Within.Group.SD^2)
## [1] 10
(CV <- qf(.95, df1=a-1, df2=a*n-a))
## [1] 3.158843
(power <- pf(CV, df1=a-1, df2=a*n-a, ncp=lambda, lower.tail=FALSE))
## [1] 0.7933118
Trying to hone in on power of .80, we try n=21, which is the smallest sample size that satisfies desired power.
# Variables
(n <- 21)
## [1] 21
# Constant Conditionitional on the situation (i.e., Conditionitional based on n)
(lambda <- n*sum(Treatment.Effects^2)/Supposed.Within.Group.SD^2)
## [1] 10.5
(CV <- qf(.95, df1=a-1, df2=a*n-a))
## [1] 3.150411
(power <- pf(CV, df1=a-1, df2=a*n-a, ncp=lambda, lower.tail=FALSE))
## [1] 0.8147697
Rather than planning sample size as above based on the definitional forms and noncentral distributions directly, the pwr package can be used as follows (after installing and loading it).
# install.packages("pwr") # Only necessary once on current system.
library(pwr)
# Returns exact power based on fractional sample sizes.
pwr.anova.test(k=3, f=f, sig.level=.05, power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## k = 3
## n = 20.30205
## f = 0.4082483
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
# Here we "round up" by using the ceiling function to get
# the necessary per-group sample size.
ceiling(pwr.anova.test(k=3, f=f, sig.level=.05, power=.80)$n)
## [1] 21
Here we use the ci.srsnr() function to estimate the population median of the square root of the signal to noise ratio and then form a one-sided confidence interval to find the lowest plausible value with 80% confidence. Additionally, we perform a one-sided 66.7% confidence interval.
ci.srsnr(F.value=6.611, df.1=1, df.2=24, N=26, conf.level=.00001)
## $Lower.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 0.4988097
##
## $Upper.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 0.498815
ci.srsnr(F.value=6.611, df.1=1, df.2=24, N=26, alpha.lower=.20, alpha.upper=0)
## $Lower.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 0.3229929
##
## $Upper.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## numeric(0)
ci.srsnr(F.value=6.611, df.1=1, df.2=24, N=26, alpha.lower=.3333, alpha.upper=0)
## $Lower.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## [1] 0.4087853
##
## $Upper.Limit.of.the.Square.Root.of.the.Signal.to.Noise.Ratio
## numeric(0)
# For the estimated median effect size (i.e., using conf.level=.00001), necessary sample
# size is
pwr.anova.test(k=2, f=.4988, sig.level=.05, power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## k = 2
## n = 16.79006
## f = 0.4988
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
# Using the 80% single-sided confidence interval limits.
pwr.anova.test(k=2, f=.3230, sig.level=.05, power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## k = 2
## n = 38.60018
## f = 0.323
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
# Using the 66.66% single-sided confidence interval limits.
pwr.anova.test(k=2, f=.4088, sig.level=.05, power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## k = 2
## n = 24.48153
## f = 0.4088
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
To evaluate power at a given sample size for supposed population value, we can also use the pwr.anova.test() function, but now specifying sample size instead of (desired) power.
pwr.anova.test(k=2, f=.3230, sig.level=.05, power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## k = 2
## n = 38.60018
## f = 0.323
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group