Load Packages

library(tidyverse)
library(ggplot2)
library(dslabs)
library(readxl)
library(car)

Review

Q1

Import dataframe

Modality <- read_excel("Modality.xlsx")

A. Boxplots

boxplot(Modality$Final ~ Modality$Modality, xlab = "Teaching Modality", ylab = "Final Exam Score")

It looks like f2f is higher than the other 2

B. Assumptions of ANOVA

  1. normality
f2f <- Modality %>% filter(Modality == "f2f")
hist(f2f$Final) #looks normal

shapiro.test(f2f$Final) #0.961 ok!
## 
##  Shapiro-Wilk normality test
## 
## data:  f2f$Final
## W = 0.982, p-value = 0.961
hyb <- Modality %>% filter(Modality == "hybrid")
shapiro.test(hyb$Final) #0.8953 ok!
## 
##  Shapiro-Wilk normality test
## 
## data:  hyb$Final
## W = 0.96966, p-value = 0.8953
hist(hyb$Final) #looks ok

onl <- Modality %>% filter(Modality == "online")
shapiro.test(onl$Final) #0.9443 ok!
## 
##  Shapiro-Wilk normality test
## 
## data:  onl$Final
## W = 0.97663, p-value = 0.9443
hist(onl$Final)

meets the assumption of normality

2. Homogeneity of variance

leveneTest(Modality$Final ~ as.factor(Modality$Modality))
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2   0.534 0.5948
##       19

p-value = 0.5948 - ok! Assumption of homogeneity of variance is met!

3. interval

looks fine

4. independent

ok!

All assumptions are met

IF ASSUMPTIONS ARE NOT MET

BUT NOT RELEVANT IN THIS CASE

kruskal.test(Modality$Final ~ Modality$Modality)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Modality$Final by Modality$Modality
## Kruskal-Wallis chi-squared = 5.3961, df = 2, p-value = 0.06734
#post-hoc test (non-parametric)
pairwise.wilcox.test(Modality$Final, Modality$Modality, p.adjust.method = "holm", exact=F)
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  Modality$Final and Modality$Modality 
## 
##        f2f  hybrid
## hybrid 0.12 -     
## online 0.12 0.96  
## 
## P value adjustment method: holm

C. Run appropriate model - ANOVA

anovaMod <- aov(Modality$Final ~ Modality$Modality)
summary(anovaMod)
##                   Df Sum Sq Mean Sq F value Pr(>F)  
## Modality$Modality  2   1527   763.6   4.097 0.0332 *
## Residuals         19   3541   186.4                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value = 0.0332 - this is significant!

Conclude that at least 1 group is significantly different, but which one??

D. Post-hoc test

pairwise.t.test(Modality$Final, Modality$Modality, p.adj="holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Modality$Final and Modality$Modality 
## 
##        f2f  hybrid
## hybrid 0.06 -     
## online 0.06 1.00  
## 
## P value adjustment method: holm

f2f is NOT significantly different from hybrid & online, but “marginally or close to significant”

hybrid & on-line are not significantly different from each other

E. What would you conclude?

There is a significant difference between means of final exam scores across modalities (p = 0.0332). But Post-hoc tests show that students in face-to-face modalities score higher on the final exam (but only marginally significant) than students in either the hybrid or on-line modalities. There is no difference in final exam scores of those in hybrid or on-line modalities.

Q2.

You are interested in the relationship between anger and heart disease

A. Test

cont_table <- data.frame(c(53, 110, 27), c(3057, 4621, 606))
chisq.test(cont_table)
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 16.077, df = 2, p-value = 0.0003228

p-value is 0.0003

There is a significant difference in proportions

B. Assumptions

Assumptions that expected values > 5 are met.I know that because there is no error in the Pearson’s Chi-squared test output.

C. Interpret your results.

What would you conclude? We need descriptive statistics.

53/(53+3057)
## [1] 0.0170418

low anger = 0.017 ~ 1.7%

110 / (110 +4621)
## [1] 0.0232509

moderate anger 0.023 ~ 2.3%

27 / (27 + 606)
## [1] 0.04265403

high anger 0.04 ~ 4.3%

53/3057
## [1] 0.01733726
27/606
## [1] 0.04455446
(27/606) / (53/3057)
## [1] 2.569867

The proportion of CHD is higher among those with high anger compared to those with moderate or low anger. There is a significant difference in proportions across the groups. The odds of having CHD is 2.57 times higher for a person who scores high on the easily angered scale compared to a person who scores low on an easily angered scale

Q3.

Depression and Recreational drugs

Enter the data

df <- data.frame(drug=c("E", "E", "E", "E", "E", "E", "E", "E", "E", "E", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"), 
    depression = c(15, 35, 16, 18, 19, 17, 27, 16, 13, 20, 16, 15, 20, 15, 16, 13, 14, 19, 18, 18))

A. Boxplot for alcohol & ecstasy

boxplot(df$depression ~ df$drug, xlab = "Type of drug", ylab="Depression score")

Interpret

Ecstasy has a higher average with some high outliers

B. Determine the appropriate statistical tes

independent t-test

Assumptions

Conduct tests to determine if this data meets the assumptions

  • Independence - yes!

  • Data is interval - yes!

  • Data within groups is normal

ecstasy <- df %>% filter(drug=="E")
shapiro.test(ecstasy$depression) #p = 0.019 Significantly NOT normal!
## 
##  Shapiro-Wilk normality test
## 
## data:  ecstasy$depression
## W = 0.81064, p-value = 0.01952
hist(ecstasy$depression)

alcohol <- df %>% filter(drug == "A")
shapiro.test(alcohol$depression) #p=0.78 normal
## 
##  Shapiro-Wilk normality test
## 
## data:  alcohol$depression
## W = 0.95947, p-value = 0.7798
hist(alcohol$depression)

Assumption not met.From theShapiro-Wilk normality test for ecstasy$depression the W = 0.81064, p-value = 0.01952 means that there is significance.

  • Homogeneity of variance
leveneTest(df$depression ~ df$drug)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  1.8803 0.1872
##       18

p-value = 0.187 - ok!

Homogeneity of variance assumption met!

C. Run appropriate test

Since assumptions aren’t met, run the Wilcoxon test

wilcox.test(df$depression ~ df$drug, exact=F)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df$depression by df$drug
## W = 35.5, p-value = 0.2861
## alternative hypothesis: true location shift is not equal to 0
# An alternative way to run
wilcox.test(ecstasy$depression, alcohol$depression, exact=F)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ecstasy$depression and alcohol$depression
## W = 64.5, p-value = 0.2861
## alternative hypothesis: true location shift is not equal to 0

p-value = 0.2861

D Interpret results

p-value from wilcoxon test is 0.2861 - not significant. There is not a significant difference in depression following the use of ecstasy or alcohol.

mean(alcohol$depression)
## [1] 16.4
sd(alcohol$depression)
## [1] 2.270585
median(alcohol$depression)
## [1] 16
mean(ecstasy$depression)
## [1] 19.6
sd(ecstasy$depression)
## [1] 6.60303
median(ecstasy$depression)
## [1] 17.5

The depression score for ecstasy (Mdn=17.5) is not significantly different than depression score for alcohol (Mdn = 16).Since ecstasy did not meet our assumption of normality, a Wilcoxon rank-sum test was conducted. Results show that the median difference between groups was not significantly different W=64.5, p=0.29.

Q4.

Fostering kittens & happiness

A. create a graphical visualization of the data

Kittens <- read_excel("Kittens.xlsx")

One option: boxplot

boxplot(Kittens$Kitten, Kittens$No_kitten, ylab = "Happiness", names=c("fostering", "no fostering"))

Second option: look at the difference scores

diff <- Kittens$Kitten - Kittens$No_kitten
boxplot(diff, ylab = "Difference in happiness of fostering v. not")

hist(diff, xlab = "Happiness of fostering - happiness without fostering")

B. Determine the appropriate statistical test

  • Dependent t-test

  • Assumptions

    1. Differences are normally distributed

    2. Data are dependent - yes!

    3. Data are measured at least at the interval level - yes!

C. Test assumption of normality

shapiro.test(diff) 
## 
##  Shapiro-Wilk normality test
## 
## data:  diff
## W = 0.86632, p-value = 0.01013

p-value = 0.01013 - not normal. Assumptions are not met.

D. Run statistical test

wilcox.test(Kittens$Kitten, Kittens$No_kitten, paired=T, exact=F)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Kittens$Kitten and Kittens$No_kitten
## V = 141.5, p-value = 0.06372
## alternative hypothesis: true location shift is not equal to 0

p-value = 0.06. This is not significant (but close).

E. Interpret results

median(diff)
## [1] 4.5

Testing whether fostering kittens increases happiness, we find that people experience an median increase of 4.5 on their happiness score. The difference scores were not normally distributed, so we ran a Wilcoxon signed-rank test. The results of this test suggest that this difference isn’t statistically different at alpha = 0.05 but the p-vale of 0.06 is close to our 0.05 cutoff. With an alpha = 0.05, we do not find a statistically significant increase in happiness with fostering kittens.