Part 1
I will be completing a data analysis on the two variables Anx(Anxiety) and Opt(Optimism).
Histogram for Anx(Anxiety):
histogram(Data3350$Anx,
width = 1,
type = "count",
main = "Histogram: Students' Anxiety Level",
xlab = "Anxiety Level",
ylab = "Number of Students")

Box plot for Anx(Anxiety):
boxplot(Data3350$Anx, horizontal = TRUE,
main = "Box plot: Students' Anxiety Level",
xlab = "Anxiety Level")

Density plot for Anx(Anxiety):
densityplot(Data3350$Anx,
#type = "count",
main = "Density plot: Students' Anxiety Levels",
xlab = "Anxiety Levels")

The shape of the data for students’ anxiety levels show a normal distribution that is skewed to the right. The histogram and density plot both show the shape well. There is a bell shaped curve with a longer tail on the right-hand side. That longer tail is due to an outlier in the data set. The box plot very easily shows the outlier past the upper tail. Even the right tail is unequal (longer) in length to the left tail, indicating a right skewed data set. The outlier tells us that the particular student has much higher levels of anxiety than the average population. The graphs indicate that the average student has an anxiety level around the mid 30’s.
Histogram for Opt(Optimism):
histogram(Data3350$Opt,
width = 1,
type = "count",
main = "Histogram: Students' Optimism Level",
xlab = "Optimism Level",
ylab = "Number of Students")

Box plot for Opt(Optimism):
boxplot(Data3350$Opt, horizontal = TRUE,
main = "Box plot: Students' Optimism Level",
xlab = "Optimism Level")

Density plot for Opt(Optimism):
densityplot(Data3350$Opt,
main = "Density plot: Students' Optimsim Levels",
xlab = "Optimism Levels")

The data for students’ optimism levels shows a normal distribution as seen in the histogram and density plots. There is a distinct bell-shaped curve that is skewed to the left. The box plot shows that there is an outlier on the left hand side of the data, below the lower tail. The left tail is a bit longer than the right, indicating a left-handed skew. The outlier tells us that the specific student is much less optimistic than the general population. The graphs indicate that the average student’s optimism level is around 20.
Part 2
Independent Samples t-test of Anx(Anxiety) and age, older vs younger students (G21). The G21 variable indicates whether the student is greater than or equal to 21 years old. So y = greater than or equal to 21. u = under 21.
H0: μy = μu
Ha: μy ≠ μu
Verification:
favstats(~ Anx, data = Data3350)
So, n (which is the total number of students) is 144, and 144 ≥ 40. Therefore, no data checks are required because of the large sample size. The t-test is robust and will guarantee accurate p-values.
The table below shows the frequency table comparisons.
tally(Anx ~ G21, data = Data3350)
G21
Anx N Y
23 1 0
24 2 1
25 3 2
26 4 1
27 4 3
28 5 5
29 2 1
30 4 4
31 0 2
32 6 2
33 6 2
34 1 1
35 4 2
36 5 4
37 3 0
38 1 1
39 4 0
40 2 2
41 5 2
42 2 1
43 3 1
44 3 2
45 6 1
46 4 2
47 2 0
48 1 0
50 3 0
51 0 2
52 1 1
53 3 2
54 1 0
57 1 0
58 0 1
59 2 1
69 1 0
<NA> 11 10
Here are the summary statistics.
favstats(Anx ~ G21, data = Data3350)
Homogeneity Check: 95 : 45 –> 1.93 : 1. The ratio between the two groups is less than 2 : 1. We do not have sharply unequal group sizes, therefore we have no issues with the homogeneity assumption. We may proceed.
Run Test: (α = 0.05)
t.test(Anx ~ G21, data = Data3350,
alternative = "two.sided")
Welch Two Sample t-test
data: Anx by G21
t = 0.80338, df = 97.922, p-value = 0.4237
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.950232 4.603293
sample estimates:
mean in group N mean in group Y
38.00000 36.67347
p = 0.4237 which is greater than α = 0.05. We fail to reject the null.
Conclusion:
Evidence suggests that age (whether a student is younger or older than 21) does not play a factor into levels of anxiety for this population of students.
Independent Samples t-test of Opt(Optimism) and biological sex. m = male, f = female
H0: μm = μf
Ha: μm ≠ μf
Verification:
favstats(~ Opt, data = Data3350)
So, n (which is the total number of students) is 146, and 146 ≥ 40. Therefore, no data checks are required because of the large sample size. The t-test is robust and will guarantee accurate p-values.
The table below shows the frequency table comparisons.
tally(Opt ~ Sex, data = Data3350)
Sex
Opt F M
6 1 0
8 1 0
9 1 0
10 1 2
12 2 0
13 5 1
14 2 2
15 2 4
16 5 4
17 7 6
18 7 6
19 9 2
20 7 5
21 6 8
22 5 8
23 7 4
24 5 1
25 4 0
26 3 3
27 1 3
28 1 2
29 2 1
<NA> 12 7
Here are the summary statistics.
favstats(Opt ~ Sex, data = Data3350)
Homogeneity Check: 84 : 62 –> 1.35 : 1. The ratio between the two groups is less than 2 : 1. We do not have sharply unequal group sizes, therefore we have no issues with the homogeneity assumption. We may proceed.
Run Test: (α = 0.05)
t.test(Opt ~ Sex, data = Data3350,
alternative = "two.sided")
Welch Two Sample t-test
data: Opt by Sex
t = -0.82488, df = 137.9, p-value = 0.4109
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.1003546 0.8637954
sample estimates:
mean in group F mean in group M
19.33333 19.95161
p = 0.4109 which is greater than α = 0.05.We fail to reject the null.
Conclusion:
Evidence suggests that biological sex does not play a factor into levels of optimism for this population of students.
Part 3
Chi-Square test of independence comparing the variable SitClass (a person’s seating preference in class) to whether they are older students or younger ones (G21).
H0: Seating preference is independent of students’ age
Ha: Seating preference is dependent of students’ age
Verification:
We require that there be no more than 20% low Expected cell counts where a “low cell count” is defined to be strictly less than 5.
xSqr = xchisq.test(SitClass ~ G21, data = Data3350)
xSqr$expected
G21
SitClass N Y
B 22.48485 12.51515
F 37.26061 20.73939
M 46.25455 25.74545
Since none of the cells less are than 5, it is OK to proceed.
Mosaic Plot
mosaicplot(SitClass ~ G21 , data = Data3350,
color = TRUE,
main = "Mosaic Plot: SitClass by Age")

B, F, M, refer to Back, Middle, and Front of the classroom. Since the Back is narrower than Front, which is narrower than the Middle, overall, more participants choose to sit in Middle than the Front. The back has the least amount of responses of the three choices. In the 21 or older category (Y), the responses seem to be approximately distributed evely among B, F, M. However, for the Under 21 group (N), there seems to be a larger portion of students who choose M and F over B. This plot indicates that an age based difference seems to exist. However, a statistical analysis (Chi-squared) is needed to test whether is it statistically significant.
Run Test: α = 0.05
xSqr = xchisq.test(SitClass ~ G21, data = Data3350)
Pearson's Chi-squared test
data: x
X-squared = 1.2288, df = 2, p-value = 0.541
20 15
(22.48) (12.52)
[0.2746] [0.4934]
<-0.524> < 0.702>
37 21
(37.26) (20.74)
[0.0018] [0.0033]
<-0.043> < 0.057>
49 23
(46.25) (25.75)
[0.1630] [0.2928]
< 0.404> <-0.541>
key:
observed
(expected)
[contribution to X-squared]
<Pearson residual>
The Chi-Squared value is 1.2288, and the p-value is 0.541 which is less than α = 0.05. Therefore we fail to reject the null.
Conclusion:
Evidence suggests that where a student sits in class (SitClass) does not depend on the students’ age (G21).
Part 4
ANOVA Test
I will be conducting an ANOVA test using the variables Play(adult playfulness) and the grouping variable PHS (primary humor style).
H0: μAF = μAG = μSE = μSD
Ha: At least one is different
Verification
favstats(Play ~ PHS, data = Data3350)
The overall sample size is more than 20. 135 > 20. Also the ratio of the largest to smallest group size is to be 2 : 1 or less. 39 : 29 –> 1.34 : 1. We also do not have sharply unequal group sizes. Data is appropriate for ANOVA procedures. We can guarantee accurate p-values and may proceed.
Run Test α = 0.05
ANOVA
mod = lm(Play ~ PHS , data = Data3350)
anova(mod)
Analysis of Variance Table
Response: Play
Df Sum Sq Mean Sq F value Pr(>F)
PHS 3 5242 1747.30 5.6889 0.00108 **
Residuals 131 40236 307.14
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since p = 0.00108 we reject the null. We have strong evidence for a difference in at least some of the group means, but we don’t know for sure where those differences lie. Since we have rejected the null, we need to run a post hoc Tukey HSD test.
post hoc Tukey HSD
TukeyHSD(mod, conf.level = 0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = x)
$PHS
diff lwr upr p adj
AG-AF -14.5022104 -25.6850758 -3.319345 0.0053138
SD-AF -14.1475096 -25.5273552 -2.767664 0.0082794
SE-AF -4.0489433 -15.8311251 7.733239 0.8078253
SD-AG 0.3547009 -10.1861828 10.895584 0.9997582
SE-AG 10.4532672 -0.5207544 21.427289 0.0679879
SE-SD 10.0985663 -1.0761175 21.273250 0.0918200
mplot
mplot(TukeyHSD(mod, conf.level = 0.95))

The mplot makes it easier to inspect the confidence intervals. We find two significant intervals:
AG-AF –> Significance: p = 0.0053138 –> AG < AF
SD-AF –> Significance: p = 0.0082794 –> SD < AF
Conclusion
Evidence suggests that subjects with aggressive humor traits (AF) scored significantly lower on the adult playfulness measure than those with affiliative humor traits (AF). Subjects with self-defeating (SD) humor scored significantly lower on the adult playfulness measure than those with affiliative humor traits (AF) as well.
Part 5
Verifiying Linearity Assumption
xyplot of Neuroticism vs. Thrill Seeking
xyplot(Thrill ~ Neuro, data = Data3350 , type = c("p","r"),
main = "Neurotisism vs. Thrill Seeking",
xlab = "Neurotisism",
ylab = "Thrill Seeking")

The pattern in the scatter plot provides evidence of a linear pattern between the variables Neuroticism and Thrill Seeking. As the Neuroticism score increases, generally the Thrill Seeking scores decrease.
Linear model for Neuroticism vs. Thrill Seeking
lm(Thrill ~ Neuro, data = Data3350)
Call:
lm(formula = Thrill ~ Neuro, data = Data3350)
Coefficients:
(Intercept) Neuro
24.2539 -0.1102
Created an object called mod
mod = lm(Thrill ~ Neuro, data = Data3350)
Verifying Normality Assumption
We will evaluate the normality assumption which says that the residuals should be normally distributed.
Histogram of the residuals.
histogram (~ resid (mod))

The histogram shows a normal distribution with a skew to the left.
qplot to test normality assumption
qqmath( ~ resid(mod))

With the qplot, we should see a straight line if the residuals meet the model assumptions. There doesn’t seem to be any sharp drop-offs or any points that jump too far out of line. Therefore, the data appears to meet the normality assumption.
Analysis Statements for Linear Regression
summary(mod)
Call:
lm(formula = Thrill ~ Neuro, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-11.7262 -3.8521 0.5888 3.8250 11.2581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.25385 1.44049 16.837 < 2e-16 ***
Neuro -0.11024 0.03201 -3.444 0.000753 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.304 on 143 degrees of freedom
(20 observations deleted due to missingness)
Multiple R-squared: 0.07657, Adjusted R-squared: 0.07012
F-statistic: 11.86 on 1 and 143 DF, p-value: 0.0007535
The \(R^{2}\) value is 0.07657. \(\sqrt{0.07657}\) ≈ 0.2767 = r.
In this example, we find that there is a weak, negative correlation between Neuroticism and Thrill Seeking.
Analyzing \(R^{2}\)
\(R^{2}\) indicates the coefficient of determination. Since \(R^{2}\) = 0.07657, 7.66% of the Thrill Seeking scores are accounted for how the Neuroticism scores change.
Prediction Equation
f = makeFun(mod)
f
function (Neuro, ..., transformation = function (x)
x)
return(transformation(predict(model, newdata = data.frame(Neuro = Neuro),
...)))
<environment: 0x7f943c9725e0>
attr(,"coefficients")
(Intercept) Neuro
24.2538528 -0.1102396
Equation = f(x) = 24.254 - 0.1102x
We can predict out the level of Neuroticism based on a Thrill Seeking score of 21, by substituting f(x) with 21 and solving for x.
If someone has a Thrill Seeking score of 21, we expect their Neuroticism score to be about 29.5 ≈ 30.
Conclusion
We verified the linearity and normality assumptions and constructed a linear model, and analyzed three of the model outputs: correlation, \(R^{2}\) and the slope coefficient. Evidence suggests that there is a weak, negative correlation and that Neuroticim appears to account for about 7.66% of the variance in Thrill Seeking. We also understand from the slope how changes in the predictor variable influences the dependent variable.
