library(mosaic)

Part 1

I will be completing a data analysis on the two variables Anx(Anxiety) and Opt(Optimism).

Histogram for Anx(Anxiety):

histogram(Data3350$Anx,
          width = 1,
          type = "count",
           main = "Histogram: Students' Anxiety Level",
          xlab = "Anxiety Level",
          ylab = "Number of Students")

Box plot for Anx(Anxiety):

boxplot(Data3350$Anx, horizontal = TRUE,
     main = "Box plot: Students' Anxiety Level",
     xlab = "Anxiety Level")

Density plot for Anx(Anxiety):

densityplot(Data3350$Anx,
      #type = "count",
      main = "Density plot: Students' Anxiety Levels",
      xlab = "Anxiety Levels")

The shape of the data for students’ anxiety levels show a normal distribution that is skewed to the right. The histogram and density plot both show the shape well. There is a bell shaped curve with a longer tail on the right-hand side. That longer tail is due to an outlier in the data set. The box plot very easily shows the outlier past the upper tail. Even the right tail is unequal (longer) in length to the left tail, indicating a right skewed data set. The outlier tells us that the particular student has much higher levels of anxiety than the average population. The graphs indicate that the average student has an anxiety level around the mid 30’s.

Histogram for Opt(Optimism):

histogram(Data3350$Opt,
          width = 1,
          type = "count",
          main = "Histogram: Students' Optimism Level",
          xlab = "Optimism Level",
          ylab = "Number of Students")

Box plot for Opt(Optimism):

boxplot(Data3350$Opt, horizontal = TRUE,
     main = "Box plot: Students' Optimism Level",
     xlab = "Optimism Level")

Density plot for Opt(Optimism):

densityplot(Data3350$Opt,
      main = "Density plot: Students' Optimsim Levels",
      xlab = "Optimism Levels")

The data for students’ optimism levels shows a normal distribution as seen in the histogram and density plots. There is a distinct bell-shaped curve that is skewed to the left. The box plot shows that there is an outlier on the left hand side of the data, below the lower tail. The left tail is a bit longer than the right, indicating a left-handed skew. The outlier tells us that the specific student is much less optimistic than the general population. The graphs indicate that the average student’s optimism level is around 20.


Part 2

Independent Samples t-test of Anx(Anxiety) and age, older vs younger students (G21). The G21 variable indicates whether the student is greater than or equal to 21 years old. So y = greater than or equal to 21. u = under 21.

H0: μy = μu

Ha: μy ≠ μu

Verification:

favstats(~ Anx, data = Data3350)

So, n (which is the total number of students) is 144, and 144 ≥ 40. Therefore, no data checks are required because of the large sample size. The t-test is robust and will guarantee accurate p-values.

The table below shows the frequency table comparisons.

tally(Anx ~ G21, data = Data3350)
      G21
Anx     N  Y
  23    1  0
  24    2  1
  25    3  2
  26    4  1
  27    4  3
  28    5  5
  29    2  1
  30    4  4
  31    0  2
  32    6  2
  33    6  2
  34    1  1
  35    4  2
  36    5  4
  37    3  0
  38    1  1
  39    4  0
  40    2  2
  41    5  2
  42    2  1
  43    3  1
  44    3  2
  45    6  1
  46    4  2
  47    2  0
  48    1  0
  50    3  0
  51    0  2
  52    1  1
  53    3  2
  54    1  0
  57    1  0
  58    0  1
  59    2  1
  69    1  0
  <NA> 11 10

Here are the summary statistics.

favstats(Anx ~ G21, data = Data3350)

Homogeneity Check: 95 : 45 –> 1.93 : 1. The ratio between the two groups is less than 2 : 1. We do not have sharply unequal group sizes, therefore we have no issues with the homogeneity assumption. We may proceed.

Run Test: (α = 0.05)

t.test(Anx ~ G21, data = Data3350,
       alternative  = "two.sided")

    Welch Two Sample t-test

data:  Anx by G21
t = 0.80338, df = 97.922, p-value = 0.4237
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.950232  4.603293
sample estimates:
mean in group N mean in group Y 
       38.00000        36.67347 

p = 0.4237 which is greater than α = 0.05. We fail to reject the null.

Conclusion:

Evidence suggests that age (whether a student is younger or older than 21) does not play a factor into levels of anxiety for this population of students.


Independent Samples t-test of Opt(Optimism) and biological sex. m = male, f = female

H0: μm = μf

Ha: μm ≠ μf

Verification:

favstats(~ Opt, data = Data3350)

So, n (which is the total number of students) is 146, and 146 ≥ 40. Therefore, no data checks are required because of the large sample size. The t-test is robust and will guarantee accurate p-values.

The table below shows the frequency table comparisons.

tally(Opt ~ Sex, data = Data3350)
      Sex
Opt     F  M
  6     1  0
  8     1  0
  9     1  0
  10    1  2
  12    2  0
  13    5  1
  14    2  2
  15    2  4
  16    5  4
  17    7  6
  18    7  6
  19    9  2
  20    7  5
  21    6  8
  22    5  8
  23    7  4
  24    5  1
  25    4  0
  26    3  3
  27    1  3
  28    1  2
  29    2  1
  <NA> 12  7

Here are the summary statistics.

favstats(Opt ~ Sex, data = Data3350)

Homogeneity Check: 84 : 62 –> 1.35 : 1. The ratio between the two groups is less than 2 : 1. We do not have sharply unequal group sizes, therefore we have no issues with the homogeneity assumption. We may proceed.

Run Test: (α = 0.05)

t.test(Opt ~ Sex, data = Data3350,
       alternative  = "two.sided")

    Welch Two Sample t-test

data:  Opt by Sex
t = -0.82488, df = 137.9, p-value = 0.4109
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.1003546  0.8637954
sample estimates:
mean in group F mean in group M 
       19.33333        19.95161 

p = 0.4109 which is greater than α = 0.05.We fail to reject the null.

Conclusion:

Evidence suggests that biological sex does not play a factor into levels of optimism for this population of students.


Part 3

Chi-Square test of independence comparing the variable SitClass (a person’s seating preference in class) to whether they are older students or younger ones (G21).

H0: Seating preference is independent of students’ age

Ha: Seating preference is dependent of students’ age

Verification:

We require that there be no more than 20% low Expected cell counts where a “low cell count” is defined to be strictly less than 5.

xSqr = xchisq.test(SitClass ~ G21, data = Data3350)
xSqr$expected
        G21
SitClass        N        Y
       B 22.48485 12.51515
       F 37.26061 20.73939
       M 46.25455 25.74545

Since none of the cells less are than 5, it is OK to proceed.

Mosaic Plot

mosaicplot(SitClass ~ G21 , data = Data3350, 
           color = TRUE,
           main = "Mosaic Plot: SitClass by Age")

B, F, M, refer to Back, Middle, and Front of the classroom. Since the Back is narrower than Front, which is narrower than the Middle, overall, more participants choose to sit in Middle than the Front. The back has the least amount of responses of the three choices. In the 21 or older category (Y), the responses seem to be approximately distributed evely among B, F, M. However, for the Under 21 group (N), there seems to be a larger portion of students who choose M and F over B. This plot indicates that an age based difference seems to exist. However, a statistical analysis (Chi-squared) is needed to test whether is it statistically significant.

Run Test: α = 0.05

xSqr = xchisq.test(SitClass ~ G21, data = Data3350)

    Pearson's Chi-squared test

data:  x
X-squared = 1.2288, df = 2, p-value = 0.541

   20       15   
(22.48)  (12.52) 
[0.2746] [0.4934]
<-0.524> < 0.702>
   
   37       21   
(37.26)  (20.74) 
[0.0018] [0.0033]
<-0.043> < 0.057>
   
   49       23   
(46.25)  (25.75) 
[0.1630] [0.2928]
< 0.404> <-0.541>
   
key:
    observed
    (expected)
    [contribution to X-squared]
    <Pearson residual>

The Chi-Squared value is 1.2288, and the p-value is 0.541 which is less than α = 0.05. Therefore we fail to reject the null.

Conclusion:

Evidence suggests that where a student sits in class (SitClass) does not depend on the students’ age (G21).


Part 4

ANOVA Test

I will be conducting an ANOVA test using the variables Play(adult playfulness) and the grouping variable PHS (primary humor style).

H0: μAF = μAG = μSE = μSD

Ha: At least one is different

Verification

favstats(Play ~ PHS, data = Data3350)

The overall sample size is more than 20. 135 > 20. Also the ratio of the largest to smallest group size is to be 2 : 1 or less. 39 : 29 –> 1.34 : 1. We also do not have sharply unequal group sizes. Data is appropriate for ANOVA procedures. We can guarantee accurate p-values and may proceed.

Run Test α = 0.05

ANOVA

mod = lm(Play ~ PHS , data = Data3350)
anova(mod)
Analysis of Variance Table

Response: Play
           Df Sum Sq Mean Sq F value  Pr(>F)   
PHS         3   5242 1747.30  5.6889 0.00108 **
Residuals 131  40236  307.14                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since p = 0.00108 we reject the null. We have strong evidence for a difference in at least some of the group means, but we don’t know for sure where those differences lie. Since we have rejected the null, we need to run a post hoc Tukey HSD test.

post hoc Tukey HSD

TukeyHSD(mod, conf.level = 0.95)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = x)

$PHS
             diff         lwr       upr     p adj
AG-AF -14.5022104 -25.6850758 -3.319345 0.0053138
SD-AF -14.1475096 -25.5273552 -2.767664 0.0082794
SE-AF  -4.0489433 -15.8311251  7.733239 0.8078253
SD-AG   0.3547009 -10.1861828 10.895584 0.9997582
SE-AG  10.4532672  -0.5207544 21.427289 0.0679879
SE-SD  10.0985663  -1.0761175 21.273250 0.0918200

mplot

mplot(TukeyHSD(mod, conf.level = 0.95))

The mplot makes it easier to inspect the confidence intervals. We find two significant intervals:

AG-AF –> Significance: p = 0.0053138 –> AG < AF

SD-AF –> Significance: p = 0.0082794 –> SD < AF

Conclusion

Evidence suggests that subjects with aggressive humor traits (AF) scored significantly lower on the adult playfulness measure than those with affiliative humor traits (AF). Subjects with self-defeating (SD) humor scored significantly lower on the adult playfulness measure than those with affiliative humor traits (AF) as well.


Part 5

Verifiying Linearity Assumption

xyplot of Neuroticism vs. Thrill Seeking

xyplot(Thrill ~ Neuro, data = Data3350 , type = c("p","r"),
       main = "Neurotisism vs. Thrill Seeking",
       xlab = "Neurotisism",
       ylab = "Thrill Seeking")

The pattern in the scatter plot provides evidence of a linear pattern between the variables Neuroticism and Thrill Seeking. As the Neuroticism score increases, generally the Thrill Seeking scores decrease.

Linear model for Neuroticism vs. Thrill Seeking

lm(Thrill ~ Neuro, data = Data3350)

Call:
lm(formula = Thrill ~ Neuro, data = Data3350)

Coefficients:
(Intercept)        Neuro  
    24.2539      -0.1102  

Created an object called mod

mod = lm(Thrill ~ Neuro, data = Data3350)

Verifying Normality Assumption

We will evaluate the normality assumption which says that the residuals should be normally distributed.

Histogram of the residuals.

histogram (~ resid (mod))

The histogram shows a normal distribution with a skew to the left.

qplot to test normality assumption

qqmath( ~ resid(mod))

With the qplot, we should see a straight line if the residuals meet the model assumptions. There doesn’t seem to be any sharp drop-offs or any points that jump too far out of line. Therefore, the data appears to meet the normality assumption.

Analysis Statements for Linear Regression

summary(mod)

Call:
lm(formula = Thrill ~ Neuro, data = Data3350)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.7262  -3.8521   0.5888   3.8250  11.2581 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 24.25385    1.44049  16.837  < 2e-16 ***
Neuro       -0.11024    0.03201  -3.444 0.000753 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.304 on 143 degrees of freedom
  (20 observations deleted due to missingness)
Multiple R-squared:  0.07657,   Adjusted R-squared:  0.07012 
F-statistic: 11.86 on 1 and 143 DF,  p-value: 0.0007535

The \(R^{2}\) value is 0.07657. \(\sqrt{0.07657}\) ≈ 0.2767 = r.

In this example, we find that there is a weak, negative correlation between Neuroticism and Thrill Seeking.

Analyzing \(R^{2}\)

\(R^{2}\) indicates the coefficient of determination. Since \(R^{2}\) = 0.07657, 7.66% of the Thrill Seeking scores are accounted for how the Neuroticism scores change.

Prediction Equation

f = makeFun(mod)
f
function (Neuro, ..., transformation = function (x) 
x) 
return(transformation(predict(model, newdata = data.frame(Neuro = Neuro), 
    ...)))
<environment: 0x7f943c9725e0>
attr(,"coefficients")
(Intercept)       Neuro 
 24.2538528  -0.1102396 

Equation = f(x) = 24.254 - 0.1102x

We can predict out the level of Neuroticism based on a Thrill Seeking score of 21, by substituting f(x) with 21 and solving for x.

If someone has a Thrill Seeking score of 21, we expect their Neuroticism score to be about 29.5 ≈ 30.

Conclusion

We verified the linearity and normality assumptions and constructed a linear model, and analyzed three of the model outputs: correlation, \(R^{2}\) and the slope coefficient. Evidence suggests that there is a weak, negative correlation and that Neuroticim appears to account for about 7.66% of the variance in Thrill Seeking. We also understand from the slope how changes in the predictor variable influences the dependent variable.

