I. Data
There are two main data types: qualitative and quantitative. The previous module on Proportion Testing dealt with qualitative data sets. This module will deal with quantitative data.
We have two main tools when working with quantitative data:
- \(t\)-procedures
- ANOVA
The \(t\)-test works well but is limited to single sample tests and situations where the grouping variable has only two levels, like marital status: married vs. not married. If we have a grouping variable like class rank in college which has four levels, we need to use ANOVA (Analysis of Variance).
Initializing RStudio
The data set we will use primarily is Data3350 which was produced in 2015 during an undergraduate research project about personality and humor. The VarsData3350 PDF file has descriptions of each variable in the Data3350 file. Both are available for download in D2L. Be sure to put the Data3350 in your R folder in Documents, and make sure your working directory is set the same way (Session menu). The code block below uses the library function to ensure that the Mosaic package is loaded and will import the data frame used in this module: Data3350.
library(mosaic)
library(readxl)
Data3350 = read_excel("Data3350.xlsx")
II. Assumptions for \(t\)-tests and ANOVA
The \(t\)-tests and ANAOVA are built on three main assumptions about the structure of the data:
- Normality: The data points were drawn at random from a normal distribution.
- Independence: The scores of each data point were independent of all the other observations.
- Homogeneity of Variances: Standard deviations of all populations and sub-populations are equal.
To check the normality assumption, we can inspect a histogram (shape) and box plot (check for outliers). Independence cannot be verified by statistical analysis – the researcher who collects the data must do the job properly. We can check homogeneity with statistical procedures. Yet, in many cases, the procedures we use to verify homogeneity are less accurate than the \(t\)-tests and ANOVA’s themselves. We try to ensure issues with homogeneity will not cause problems by avoiding sharply unequal group sizes.
1. Checking Normality
The \(t\)-statistic is robust which means that minor violations of its assumptions don’t cause inaccuracies in the resulting \(p\)-values. The \(F\)-statistic (for ANOVA) is even more robust than \(t\). The bottom line: for moderate sample sizes, no data checks are needed. The \(t\)- and \(F\)-statistics hold up well under stress. Only for small sample sizes are data checks needed.
Verifying Normality Assumption: \(t\)-Procedures
\[\begin{array}{ccl} \textbf{Sample Size} && \textbf{Data Check}\\ \hline
n \geq 40 && \text{No data checks needed due to robustness of }t\\
15 \leq n < 40 && \text{Check box plot, verify no outliers}\\
n < 15 && \text{Also check histogram (need approximate normality)}
\end{array}\]
For independent samples \(t\)-tests, we use the overall sample size. If \(n_1\) and \(n_2\) are the sample size of groups 1 and 2 respectively, then for the purposes of the chart above \[n=n_1+n_2\]
Verifying Normality Assumption: ANOVA
\[\begin{array}{ccl} \textbf{Sample Size} && \textbf{Data Check}\\ \hline
n \geq 20 && \text{No data checks needed due to robustness of }F\\
n < 20 && \text{Do not run an ANOVA}
\end{array}\]
As with the \(t\)-test, the \(n\) in the chart for ANOVA refers to overall sample size: the total number of observations including all samples. In either case, if the data checks fail, the data are not appropriate for these procedures, and the researcher stop immediately. Statistical apps will produce \(p\)-values with garbage data just as easily as with good data. The researcher conducting the quantitative analysis is responsible for checking the data for appropriateness.
2. Checking Homogeneity
The robustness of \(t\) and \(F\) will ensure reasonable accuracy of the resulting \(p\)-values even when the standard deviations are different so long as the group sizes are not sharply unequal. What is sharply unequal? We would not wish to see the following data set for a \(t\)-test.
Hypothesis Testing Steps
- Identify correct procedure
- Setup null \((H_0)\) and alternate \((H_a)\) hypotheses
- Verification: are data appropriate for procedure?
- Set \(\alpha\) (\(\alpha = .05\) is default value)
- Run Stats App to determine \(p\)-value
- Statistical Conclusion: “Reject \(H_0\)” or“Fail to reject \(H_0\)”
- Research conclusion
\[\begin{array}{ccccc} &&&\textbf{Polar Bears} && \textbf{Grizzly Bears}\\ \hline
\overline{x} &&& 800 && 900\\
s &&& 125 && 275\\
n &&& 8 && 28
\end{array}\]
We require the ratio of largest-to-smallest group size to be no more than \(2:1\). In the above example, the standard deviations are clearly not equal, and the group size ratio is \(3.5:1\), a classic example of sharply unequal groups.
The same recommendation holds true for ANOVA’s where we require that the largest group have no more than twice as many observations as the smallest group. With even slight differences in standard deviations and sharply unequal group sizes, the data are not appropriate for either \(t\)-procedures or ANOVA. In a more advanced statistics class, we have options for investigation. At this level, however, just don’t run the ANOVA or \(t\)-test under suspicious circumstances.
IV. Matched Pairs \(t\)-test
The most useful application of the one sample \(t\)-test is the matched pairs design. The matched pairs \(t\)-test (sometimes called a dependent samples \(t\)-test) is most often used to analyze data from a pretest - posttest research design. The matched pairs \(t\)-test reduces the variance between subjects by creating a Gain score that ignores the starting and ending points for each subject and focuses only on the difference between pretest and posttest scores.
Does Stress Increase During Midterms?
The variables Stress1 and Stress2 in Data3350 were identical measures of stress, with Stress1 administered in the second week of the semester, and Stress2 administered during midterms when most students had several tests, projects or papers due. Test for an increase in Stress during midterms at \(0.01\) level of significance.
Hypothesis Test
We will use a matched pairs \(t\)-test which is a one sample \(t\)-test applied the Gain scores. Even though we start with two samples, Pre and Post, we are using a one sample test. The hypothesis is that the average Stress score will be higher during midterms and, specifically, that the average Gain score (\(\mu_g\)) will be greater than zero.
\[\begin{align*}H_0 : \mu_g = 0\\ H_a : \mu_g > 0\end{align*}\] We will run a quick data analysis, but the sample size is adequate that the robustness of \(t\) will ensure accurate \(p\)-values.
favstats(Gain)
On average, students experienced about three-quarters of a point higher Stress during midterms. With \(n = 164\), no further data checks are needed. These data are appropriate for \(t\) procedures. Even though it is not required for verification, most good researchers inspect the histogram and box plot anyway.
histogram(Gain, width = 2)

boxplot(Gain, horizontal = TRUE)

In a large data set like this one, outliers will not decrease the accuracy of the \(t\)-test \(p\)-values. Still, note that there is one outlier to each side balancing each other. The histogram provides quite convincing evidence the data were drawn from a bell-shaped population.
Two Ways to Conduct Matched Pairs using RStudio
We can produce a one sample \(t\)-test on the Gain scores using the following code. The option \(\fbox{alternative = greater}\) specifies the one-tailed test where we expect positive gains.
t.test(Gain, alternative = "greater")
One Sample t-test
data: Gain
t = 3.0839, df = 163, p-value = 0.0012
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
0.3561614 Inf
sample estimates:
mean of x
0.7682927
We can also use the two sample \(t\)-test format, but include the option \(\fbox{paired = TRUE}\) to let RStudio know to use the matched pairs settings, not the independent samples settings. Here, we enter the variables Pre and Post, but the t.test function will create gain scores before running the test.
t.test(Post , Pre, data = Data3350,
paired = TRUE,
alternative = "greater")
Paired t-test
data: Post and Pre
t = 3.0839, df = 163, p-value = 0.0012
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.3561614 Inf
sample estimates:
mean of the differences
0.7682927
Results
Either way, we find that \(p=0.0012 < 0.01 =\alpha\), and we reject the null. With the \(p\)-value so close to zero, we have met a more stringent level of proof that the null is likely false. That means we can strengthen our research conclusion accordingly. Evidence strongly suggests that Stress increases during midterms.
Level of Significance
Why did we not use the default \(\alpha = 0.05\)? The matched pairs research design has the advantage of being very sensitive to changes in Gain scores. We only measure the differences between the starting points and ending points. A participant who has Pre = 12 and Post = 18 and one who has Pre = 18 and Post = 24 both have Gain scores of 6. Since the variation among individuals is erased when we calculate the Gains, the test becomes much more sensitive.
What if we just ran the comparison like an independent samples (two sample) \(t\)-test? How much difference would it make?
t.test(Post, Pre,
paired = FALSE,
alternative = "greater")
Welch Two Sample t-test
data: Post and Pre
t = 2.1058, df = 324.33, p-value = 0.018
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.1664461 Inf
sample estimates:
mean of x mean of y
13.78659 13.01829
Notice that the \(p\)-value is more than one order of magnitude larger than with the paired approach. We would still reject the null at the \(0.05\) level of significance, but not at the \(0.01\) level. The reason we used a smaller \(\alpha\) was due to this feature of the matched pairs design. Because the matched pairs design ignore the significant human variation at the start, we can reduce \(\alpha\) to create a more sensitive statistical test while retaining adequate power, a concept we will discuss later in the course.
V. Independent Samples \(t\)-Test
The independent samples \(t\)-test compares samples drawn from two different sub-populations directly to each other. For this investigation, let’s consider the differences in the sleep patterns and caffeine consumption between those who are 21 years old or older compared to those who are twenty and younger. We’ll run two hypothesis tests at once.
Do Older College Consume More Caffeine and Get Less Sleep?
The grouping variable G21 has two levels, Y for those answering yes to whether their current age was 21 years old or older, and N for those answering no. We will perform 1-tailed \(t\)-tests on two variables: Sleep, the number of hours of sleep in last 48 including naps (divided by 2, to give an average per night), and Caff, the number of 12 ounce servings of caffeine in last 24 hours. Our hypotheses are that older students will get less sleep and consume more caffeine. Test at the default \(\alpha = 0.05\) level of significance.
Using a grouping variable G21 with two levels, we need the independent samples \(t\)-test. To clarify our hypothesis setup, let “Y” indicate students who are not yet 21 (Younger), and “D” indicate students who are 21 or olDer.
\[\begin{array}{ccccc} \textbf{Caffeine} &&&\textbf{Sleep}\\ H_0 : \mu_Y = \mu_D &&& H_0 : \mu_Y = \mu_D\\ H_0 : \mu_Y < \mu_D &&& H_0 : \mu_Y > \mu_D\end{array}\]
Both null hypotheses are that the samples were drawn from identical distributions. The \(\fbox{tally}\) function works well to show the frequency table comparisons. The tallies for the Caff data:
tally(Caff ~ G21, data = Data3350)
G21
Caff N Y
0 39 11
1 14 12
2 21 13
3 13 7
4 8 6
5 5 2
6 3 3
7 1 0
8 1 3
10 1 2
and the summary statistics. Notice that using a statistical model as the object for the favstats function creates a stats summary for both females and males.
favstats(Caff ~ G21, data = Data3350)
The summary statistics for Sleep:
favstats(Sleep ~ G21, data = Data3350)
Let’s take a quick glance at the histograms and box plots for the two variables.
Graphical Comparison of Caffeine: Younger vs. Older Students
histogram (~ Caff | G21 , data = Data3350, layout = c(1,2))

boxplot( Caff ~ G21 , data = Data3350, horizontal = TRUE, layout = c(1,2))

Graphical Comparison of Sleep: Younger vs. Older Students
histogram (~ Sleep | G21 , data = Data3350, layout = c(1,2))

boxplot( Sleep ~ G21 , data = Data3350, horizontal = TRUE, layout = c(1,2))

Note that none of the graphical analysis or tallying was required for verification. Still, the visualizations show exactly what we would expect: approximately bell-shaped distributions in all subgroups with medians slightly higher for Caff and slightly lower for Sleep in the group who are 21 years old or older.
Due to the robustness of \(t\) and combined sample sizes of far more than 40, no data checks are necessary for the normality assumption. The ratio of the larger group size to smaller is \(106:59\) or \(1.77 : 1\). Since the ratio is less than \(2:1\), we do not have sharply unequal group sizes and, hence, we have no issues with the homogeneity assumption. (A quick scan of the summary statistics suggest the standard deviations are not very different.) These data are very much appropriate for \(t\)-procedures.
Results: Caff Comparison
t.test(Caff ~ G21, data = Data3350,
alternative = "less")
Welch Two Sample t-test
data: Caff by G21
t = -2.0641, df = 100.29, p-value = 0.02079
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.1541073
sample estimates:
mean in group N mean in group Y
1.839623 2.627119
Because \(p = 0.02079 < 0.05 = \alpha\), we reject the null. Evidence suggests that older student consume more caffeine than younger students do \((p=0.02)\).
Results: Sleep Comparison
t.test(Sleep ~ G21, data = Data3350,
alternative = "greater")
Welch Two Sample t-test
data: Sleep by G21
t = 2.1533, df = 126.96, p-value = 0.01659
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.1626102 Inf
sample estimates:
mean in group N mean in group Y
6.603774 5.898305
Because \(p = 0.01659 < 0.05 = \alpha\), we reject the null. Evidence suggests that older student get fewer hours of sleep on average per night than younger students do \((p=0.02)\).
VI. Randomization Techniques for Independent Samples \(t\)-tests
How can we empirically test that the means are significantly different using machine-based randomization? Why not shuffle all the Sleep observations? That way, each person will be paired with a randomly permuted observation from the sample, regardless of whether they are younger than 21 (“N” for G21 variable) or 21 or older (“Y” for G21 variable). With randomly permuted observations, the difference between the group means should be approximately equal.
1. Difference in Group Means
mean(Sleep ~ G21, data = Data3350 )
N Y
6.603774 5.898305
We can have RStudio calculate the difference. The brackets reference the specific items from the output. (Don’t worry about learning how to do this - you can just subtract the two values in R or with a calculator.)
mean(Sleep ~ G21, data = Data3350 )[[1]][1]-mean(Sleep ~ G21, data = Data3350 )[[2]][1]
[1] 0.7054685
2. Shuffle Observations of Dependent Variable
We’re using a permutation test on the difference in means by shuffling the observations of Sleep, randomly reassigning them to the younger group or older. Let’s use tally to create a frequency table for one possible permutation. As always, re-execute the code block several times to see what’s going on.
tally(shuffle(Sleep) ~ G21 , data = Data3350)
G21
shuffle(Sleep) N Y
0.5 1 0
1 1 0
2.5 1 3
3 8 4
3.5 3 5
4 4 2
4.5 2 0
5 9 3
5.5 4 5
6 14 5
6.5 9 4
7 13 5
7.5 9 9
8 11 2
8.5 6 6
9 2 2
9.5 4 2
10 3 2
10.5 1 0
11 1 0
The first line creates a data frame with the means from the shuffled data for both the “N” column and the “Y” column (“younger than 21” and “21 or older” groups respectively). The second line adds a third column to the shuf data frame called “diff” to indicate “difference in means.”
shuf = do(1000) * mean(shuffle(Sleep) ~ G21 , data = Data3350)
shuf$diff = shuf$N - shuf$Y
shuf
3. Estimate \(p\)-value
We need to count the number of shuffled mean differences that were greater than .705, because we are testing the hypothesis \[H_0 : \mu_Y = \mu_D\] \[H_a : \mu_Y > \mu_D\]
We can use the sum function to perform counts by sending it a logical expression.
sum(shuf$diff > .705)
[1] 13
Since we had 1,000 shuffled mean differences, and 25 of them were greater than the observed mean difference, we estimate that the \(p\)-value is \[p=\frac{25}{1000}=0.025\]
VII. Quick Hits: Independent Samples \(t\)-tests
There are dozens of possibilities within the Data3350 to run independent samples \(t\)-tests using 2-level grouping variables such as biological sex or the G21 variable. Not all hypothesis tests find a difference between the variables. The data set is large enough that the verifications will all work out. You should try several. I have included a few different examples below.
Do females or males experience different levels of OCD?
t.test(OCD ~ Sex, data = Data3350)
Welch Two Sample t-test
data: OCD by Sex
t = 0.74989, df = 125.11, p-value = 0.4547
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.9136994 2.0285250
sample estimates:
mean in group F mean in group M
9.651163 9.093750
We use a two-tailed test here because there is no reason to believe either females might have higher levels, nor would males. A two-tailed test is less sensitive, but has the added flexibility of finding significant differences in more exploratory situations. We fail to reject the null because \(p=.22\), a value far too large to suggest the null is incorrect. There is no evidence to suggest a difference in OCD indicators based upon biological sex.
Are younger students more playful?
t.test(Play ~ G21, data = Data3350,
alternative = "greater")
Welch Two Sample t-test
data: Play by G21
t = 0.55889, df = 116.77, p-value = 0.2887
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-3.343707 Inf
sample estimates:
mean in group N mean in group Y
136.9681 135.2679
Even using the more sensitive one-tailed hypothesis, we find no evidence that younger students are more playful \((p=0.29)\).
Do males at North Georgia use more coping humor than females?
t.test(CHS ~ Sex, data = Data3350,
alternative = "less")
Welch Two Sample t-test
data: CHS by Sex
t = -3.534, df = 160.6, p-value = 0.0002674
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -1.31639
sample estimates:
mean in group F mean in group M
23.86316 26.33824
We find a stark difference in the use of Coping Humor at North Georgia based upon biological sex \((p = 0.0003)\).
VIII. ANOVA
For any grouping variables with three or more levels, we must use ANOVA. The \(t\)-test comparisons, while robust and useful, can only be used for the one-sample and two-sample cases. Ronald Fischer, along with pioneering the use of the null hypothesis, also developed ANOVA which describes a group of methods, not a single test.
ANOVA is our first two-step procedure. If we reject the null in an ANOVA, we must follow it up with a post hoc test to ferrett out the differences between the groups.
Do Levels of Neuroticism Depend upon Primary Humor Style?
Primary Humor Styles are defined based upon results from the Humor Style Questionnaire. The four primary humor styles in our data frame Data 3350 are Affiliative (HSAF), Aggressive (HSAG), Self-Enhancing (HSSE) and Self-Defeating (HSSD). The variable Neuro measures the intensity of neurotic personalities. Test whether the variable Neuro differs based upon Primary Humor Style.
For ANOVA the null hypothesis is that all four samples were drawn from identical distributions, so the group means would be equal. \[\begin{align*}H_0 &: \mu_AF=\mu_AG=\mu_SE=\mu_SD\\ H_a &: \text{At least one is different.}\end{align*}\] The alternative hypothesis is not that all the group means are unequal. The logical opposite of all group means being equal is that at least one is different than the rest. Saying this alternative in words is fine, though some do prefer to use symbols: \[H_a : \mu_i \neq \mu_j \hspace{2mm}\text{for some }1 \leq i,j \leq 4\]
We will create a linear model using the function lm. ANOVA allows the standard statistical modeling notation we have been using: \[\text{Neuro} \sim \text{PHS}\] which indicates that Neuro is the dependent variable we are analyzing using PHS as the grouping variable (read “Neuro by PHS”). We will save the model as mod (short for model) so we can reuse the model output for ANOVA and any post hoc testing needed.
Let’s scan the summary statistics for the groups as we consider verification.
favstats(Neuro ~ PHS, data = Data3350)
The data are appropriate for ANOVA procedures because there are certainly far more than 20 total observations, and the largest to smallest group size ratio is \(43 : 31\) which reduces to \(1.39 : 1\). The standard deviations do appear to pair up into two different ranges, but since we do not have sharply unequal group sizes, we should have no problem with lack of homogeneity of the variances.
Using an \(\alpha = 0.05\) level of significance, we run the ANOVA app.
mod = lm(Neuro ~ PHS , data = Data3350)
anova(mod)
Analysis of Variance Table
Response: Neuro
Df Sum Sq Mean Sq F value Pr(>F)
PHS 3 5145.1 1715.03 10.73 2.158e-06 ***
Residuals 140 22375.9 159.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since \(p = 0.000002158\), we reject the null. We have strong very strong evidence for a difference is in at least some of the group means, but we don’t know for sure where those differences lie.
Tukey Post Hoc Testing
Since we have rejected the null, we need to run a post hoc test. In this class, we will use the most common of several post hoc options, Tukey’s HSD. HSD stands for “honestly significant difference.”
TukeyHSD(mod, conf.level = 0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = x)
$PHS
diff lwr upr p adj
AG-AF 4.978245 -2.7668457 12.723335 0.3427512
SD-AF 12.000872 3.9970309 20.004713 0.0008530
SE-AF -4.282502 -12.5045100 3.939505 0.5301145
SD-AG 7.022627 -0.3485287 14.393783 0.0679484
SE-AG -9.260747 -16.8682354 -1.653259 0.0101553
SE-SD -16.283374 -24.1541382 -8.412610 0.0000018
The Tukey post hoc test does all pairwise comparisons of group means and determines which pairs are significantly different. Here we have four choose two comparisons to make: \(\binom{4}{2}=6\).
There are two different ways to tell if the group means differ significantly. We can check the final column and look for \(p\)-values that are less than \(\alpha\). Or we can inspect the confidence intervals which are estimating the difference between group means. If the groups are significantly different, then the confidence interval will not include zero. While we can inspect the Tukey HSD table for confidence intervals whose “lwr” and “upr” endpoints are both the same sign, but an mplot of the TukeyHSD output will produce a graph that shows everything quite clearly.
mplot(TukeyHSD(mod, conf.level = 0.95))

Either way we find the following three significant differences:
\[\begin{array}{ccclcc}
\textbf{Comparison} && \textbf{Significance} && \textbf{Pattern}\\ \hline
\text{HSSD vs. HSAF} && p = 0.0008530 && \text{SD} > \text{AF}\\
\text{HSSE vs. HSAG} && p = 0.0101553 && \text{SE} < \text{AG}\\
\text{HSSE vs. HSSD} && p = 0.0000018 && \text{SE} < \text{SD}
\end{array}\]
Evidence from the confidence intervals suggests that subjects with self-defeating primary humor styles scored significantly higher on the neuroticism measure than those with affiliative. Subjects with self-enhancing primary humor styles scored significantly lower on neuroticism than those with aggressive or self-defeating primary humor styles.
IX. ANOVA Quick Hits
This module describes methods that are at the heart of quantitative inquiry. As with the \(t\)-tests, a group of examples with minimal details are included to provide more breadth of experience with these powerful and important analytic tools.
Does GPA affect Seating Preference in Class?
The null hypothesis is that the average GPA is identical in all three seating preference groups: Front, Middle and Back. The pattern of means seen in the summary stats shows a slightly higher GPA for students preferring Front (3.33) compared to Back (3.25) and Middle (3.20). Yet, the differences are not significantly different at the default \(\alpha = 0.05\) level. The data are appropriate for ANOVA methods, and the sample size and statistical design are sufficiently sensitive. We thus find no evidence seating preference is affected by GPA. Since we fail to reject the null, no post hoc Tukey is needed.
mod2 = lm(GPA ~ SitClass, data = Data3350)
favstats(GPA ~ SitClass, data = Data3350)
anova(mod2)
Analysis of Variance Table
Response: GPA
Df Sum Sq Mean Sq F value Pr(>F)
SitClass 2 1.228 0.61405 2.4348 0.09081 .
Residuals 162 40.856 0.25220
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Do Levels of Optimism Differ Based on Primary Humor Style
Because Self-Enhancing humor, one of the four humor styles, is related to coping humor, we have reason to believe a connection might exist. The data are appropriate for ANOVA, and we reject the null \((p=0.00185)\) indicating a post hoc Tukey HSD procedure is needed.
mod3 = lm(Opt ~ PHS, data = Data3350)
favstats(Opt ~ PHS, data = Data3350)
anova(mod3)
Analysis of Variance Table
Response: Opt
Df Sum Sq Mean Sq F value Pr(>F)
PHS 3 281.37 93.790 5.2439 0.00185 **
Residuals 140 2503.96 17.885
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TukeyHSD(mod3)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = x)
$PHS
diff lwr upr p adj
AG-AF -1.8349587 -4.4258527 0.7559352 0.2584250
SD-AF -2.6608544 -5.3383058 0.0165970 0.0520958
SE-AF 0.8944282 -1.8560045 3.6448608 0.8325577
SD-AG -0.8258957 -3.2917008 1.6399095 0.8198789
SE-AG 2.7293869 0.1845237 5.2742501 0.0303588
SE-SD 3.5552826 0.9223482 6.1882169 0.0033328
mplot(TukeyHSD(mod3))

We find two of the pairwise differences significant: that those with self-enhancing primary humor styles score higher for optimism compared to the those with aggressive or self-defeating primary humor styles.
No Clue what this Means
While the \(p\)-value is just slightly larger than the default \(\alpha\), the near-relationship between these variables is odd. Is there a pattern lurking here? I don’t know, but it would be interesting to see a follow-on study to see if a pattern emerges.
mod4 = lm(TxRel ~ Friends, data = Data3350)
favstats(TxRel ~ Friends, data = Data3350)
anova(mod4)
Analysis of Variance Table
Response: TxRel
Df Sum Sq Mean Sq F value Pr(>F)
Friends 2 135.6 67.788 3.0059 0.05228 .
Residuals 161 3630.8 22.551
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
mplot(TukeyHSD(mod4, conf.level = .9))

Using a very liberal \(\alpha =0.1\), we find that participants that said they made friends most easily with either sex (no difference) scored higher the Toxic Relationship Beliefs scale than did those who said they made friends most easily with those of who were of the same biological sex.
To repeat, there is no reason to use such a high value for the level of significance, so there is no significant pattern in this data, only a vague hint of a pattern that may possibly exist. This is how science works. Could an exploratory study be created to explore the connections between how we make friends and our relationship beliefs? Yes, most certainly. Would it yield results? Yes, because even if we do the study and fail to find any significant connections (fail to reject the null), we then have some confirmation that, perhaps, no connections between these variables are actually there.
Pondering a scientific approach to exploring this question is a perfect mental exercise to explore what it means to do statistics-based inquiry. As Fischer pointed out, we can never prove the null hypothesis, only falsify it. Thus, even two or three validation studies where we failed to reject the null may not provide enough evidence that no relationship exists.
Do Narcissism Levels Vary Based upon Primary Humor Style?
The short answer is “Yes, they do,” \((p=0.008735)\).
mod5 = lm(Narc ~ PHS, data = Data3350)
favstats(Narc ~ PHS, data = Data3350)
anova(mod5)
Analysis of Variance Table
Response: Narc
Df Sum Sq Mean Sq F value Pr(>F)
PHS 3 85.83 28.6109 4.0432 0.008735 **
Residuals 129 912.84 7.0763
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To find the pairwise differences, a TukeyHSD is required.
TukeyHSD(mod5)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = x)
$PHS
diff lwr upr p adj
AG-AF 1.7286751 0.02142619 3.4359241 0.0459844
SD-AF -0.2630542 -2.00168438 1.4755760 0.9792027
SE-AF 0.7719689 -1.01676625 2.5607040 0.6758555
SD-AG -1.9917293 -3.61386008 -0.3695986 0.0093827
SE-AG -0.9567063 -2.63242897 0.7190164 0.4488839
SE-SD 1.0350230 -0.67266033 2.7427064 0.3950070
mplot(TukeyHSD(mod5))

The most striking difference is that those with aggressive primary humor styles have significantly higher levels of narcissism than those with self-defeating primary humor styles \((p = 0.009)\) and those with self-defeating primary humor styles \((p=0.046)\)
