Statistical Inference

2 sample tests

t-test for 2 independent samples (1)

You are a market analyst and want to make a survey analysis of customer satisfaction of 2 different car rental companies. Based on your survey (just 20 observations), can we conclude that the satisfaction levels of Panek and Traficar are significantly different?

trafficar <- c(18,21,16,22,19,24,17,21,23,18,14,16,16,19,18,20,12,22,15,17)
panek <- c(22, 25, 17, 24, 16, 29, 20, 23, 19,20,15,15,18,26,18,24,18,25,19,16)
mean(trafficar)

## [1] 18.4

mean(panek)

## [1] 20.45

First, let’s take a look at the plots.

# We are using ggboxplot function from ggpubr library.

dane <- data.frame( 
                company = as.factor(rep(c("traficar", "panek"), each = 20)),
                rating = c(trafficar,  panek)
                )

p <- ggplot(dane, aes(x=rating, y=company)) + 
  geom_boxplot()
p + geom_dotplot(binaxis='y', stackdir='center', dotsize=1)+ theme(legend.position="bottom")

var(panek)

## [1] 16.47105

var(trafficar)

## [1] 9.936842

Let’s draw a histogram with a density plot in the background:

ggdensity(panek, 
          main = "Density chart for Panek ratings",
          xlab = "Satisfaction with services")

ggdensity(trafficar, 
          main = "Density chart for Traficar ratings",
          xlab = "Satisfaction with services")

Tests of normality of both distributions:

ggqqplot(panek)

ggqqplot(trafficar)

# The null hypothesis: normality of distribution (the strongest test is the Shapiro-Wilk test):
shapiro.test(panek)

## 
##  Shapiro-Wilk normality test
## 
## data:  panek
## W = 0.94235, p-value = 0.2654

shapiro.test(trafficar)

## 
##  Shapiro-Wilk normality test
## 
## data:  trafficar
## W = 0.98197, p-value = 0.9569

# in both cases p>>alpha - that is, we can not reject the hypothesis of normality of the distributions! and so we can make full use of the Student's t-test, but we only need to watch out for differences in variances!

And we perform the test ;-) if the assumptions are met, variance, normality….

# Step one - which variance is higher? check:
var(trafficar)

## [1] 9.936842

var(panek) # - higher variance than trafficar

## [1] 16.47105

# Step two - in the F-test, we give in turn the variable with the higher and then the lower of the two variances - this is the variance quotient test (higher by lower) - one-sided, right-sided.
var.test(panek, trafficar, alternative="greater")  # no problem with homogeneity of variance

## 
##  F test to compare two variances
## 
## data:  panek and trafficar
## F = 1.6576, num df = 19, denom df = 19, p-value = 0.1398
## alternative hypothesis: true ratio of variances is greater than 1
## 95 percent confidence interval:
##  0.764475      Inf
## sample estimates:
## ratio of variances 
##           1.657574

Finally, thanks to those 2 assumptions tests, we know which test should be applied - it’s a t-test with pooled variances:

# we can perform the test with the usual t.test function:
t.test(trafficar, panek, mu=0, paired=FALSE, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  trafficar and panek
## t = -1.784, df = 38, p-value = 0.08241
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.3761996  0.2761996
## sample estimates:
## mean of x mean of y 
##     18.40     20.45

# or visualize the results on a violin chart:
ggbetweenstats(
  data=dane,
  x=company, #2 groups
  y=rating,
  var.equal = TRUE,
  k=3  #3 Number of digits after decimal point
)

In this case, we were able to use the t-test for independent samples with homogeneous - common variance.

At a significance level of 0.05, we have no basis for rejecting the hypothesis that Panek and Trafficar’s customer ratings are equal (p=0.082).

t-test for 2 independent samples (2)

Based on the “mtcars” dataset, evaluate the assumption that cars with automatic transmission have significantly higher fuel consumption than those with a manual transmission.

data(mtcars)
attach(mtcars)
ggqqplot(mtcars[am==0,1])  #automatic

ggqqplot(mtcars[am==1,1])   #manual

shapiro.test(mtcars[am==0,1])

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars[am == 0, 1]
## W = 0.97677, p-value = 0.8987

shapiro.test(mtcars[am==1,1])

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars[am == 1, 1]
## W = 0.9458, p-value = 0.5363

var(mtcars[am==0,1])

## [1] 14.6993

var(mtcars[am==1,1])

## [1] 38.02577

# in the F-test we give successively the variable with the higher and then the lower of the two variances - this is the ratio of variance test (higher by lower) - one-sided, always right-sided.
var.test(mtcars[am==1,1],mtcars[am==0,1],
         alternative="greater")  #any conclusions here?

## 
##  F test to compare two variances
## 
## data:  mtcars[am == 1, 1] and mtcars[am == 0, 1]
## F = 2.5869, num df = 12, denom df = 18, p-value = 0.03345
## alternative hypothesis: true ratio of variances is greater than 1
## 95 percent confidence interval:
##  1.104542      Inf
## sample estimates:
## ratio of variances 
##           2.586911

# finally the t-test
ggbetweenstats(
  data=mtcars,
  x=am, #2 grupy
  y=mpg,
  var.equal=FALSE,
  k=3  #3 Number of digits after decimal point
)

We are forced to use the Student’s t-test for independent samples with heterogeneous variance. At the 0.01 level of significance, we reject the hypothesis of equality of average combustion (mpg) for cars with automatic and manual transmissions (p=0).

t-test for 2 dependent samples (paired)

A study was conducted to test a new sleeping drug. Ten students were invited to participate. Evaluate whether the additional sleep time (“extra”) is significantly longer when the students received the drug than when the students received a placebo.

The data are in the “sleep” set. Let’s check the normality of the students’ sleep time. For this, we use a quantile-quantile normality plot and a very strong Shapiro-Wilk normality test.

data(sleep)
attach(sleep)
ggqqplot(extra)

shapiro.test(extra)

## 
##  Shapiro-Wilk normality test
## 
## data:  extra
## W = 0.94607, p-value = 0.3114

Both the normality plot and the test (p>>alpha) indicate that the distribution of sleep time is normal.

For this reason, we will use the Student’s t-test for dependent samples for comparison (these are the same students tested on placebo and on the drug).

Null hypothesis: additional sleep time (“extra”) is not significantly different when students received the drug than when students received the placebo.

Alternative hypothesis: additional sleep time (“extra”) is significantly longer when students received the drug than when students received the placebo.

t.test(extra[group==2],extra[group==1],
       paired=TRUE,alternative="greater")

## 
##  Paired t-test
## 
## data:  extra[group == 2] and extra[group == 1]
## t = 4.0621, df = 9, p-value = 0.001416
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  0.8669947       Inf
## sample estimates:
## mean difference 
##            1.58

As you can see, the test statistic is 4.06, p<<alpha, hence we are entitled to reject the null hypothesis and conclude that the new sleeping drug significantly increases sleep time.

Let’s conduct the same study using the ggstatsplot package. The “ggwithinstats” function is used to perform tests for dependent samples, while the “ggbetweenstats” function is used to perform tests for independent samples.

#very important! after - before:
sleep$group <- reorder(sleep$group, new.order=c(2, 1))

ggwithinstats(
  data=sleep,
  x=group,
  y=extra,
  type="p",
  k=5  #5 Number of digits after decimal point
)

Challenge 1.

Is that true that the amount of loans vary significantly by credit risk?

NOTE: if the assumption of normality of the distribution is not met - we must use a non-parametric substitute of t-test (Wilcoxon rank test; type=“np”). Please take a look at the lecture notes and read what these tests are based on.

data("GermanCredit")
attach(GermanCredit)
ggqqplot(GermanCredit$amount[credit_risk == "good"])

ggqqplot(GermanCredit$amount[credit_risk == "bad"])

#Both not normal

sd(GermanCredit$amount[credit_risk == "bad"])

## [1] 3535.819

sd(GermanCredit$amount[credit_risk == "good"])

## [1] 2401.472

var.test(GermanCredit$amount[credit_risk == "bad"], GermanCredit$amount[credit_risk == "good"], alternative = "two.sided")

## 
##  F test to compare two variances
## 
## data:  GermanCredit$amount[credit_risk == "bad"] and GermanCredit$amount[credit_risk == "good"]
## F = 2.1678, num df = 299, denom df = 699, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.796126 2.636073
## sample estimates:
## ratio of variances 
##           2.167828

report(wilcox.test(GermanCredit$amount[credit_risk == "good"], GermanCredit$amount[credit_risk == "bad"], type="nonparametric", alternative = "less"))

## Effect sizes were labelled following Funder's (2019) recommendations.
## 
## The Wilcoxon rank sum test with continuity correction testing the difference in
## ranks between GermanCredit$amount[credit_risk == "good"] and
## GermanCredit$amount[credit_risk == "bad"] suggests that the effect is negative,
## statistically significant, and small (W = 93480.00, p = 0.003; r (rank
## biserial) = -0.11, 95% CI [-1.00, -0.04])

ggbetweenstats(
  GermanCredit,
  x=credit_risk,
  y=amount,
  type="nonparametric",
)

Challenge 2.

We will use data from the IMDB portal for the analysis.

Let’s verify the hypothesis that dramas have better reviews than romantic comedies.

With 2 independent samples, we need to check that the distributions in both cases are normal and that the variance is homogeneous (at the same level) to use the Student’s t-test. If not, we must use the rank-based nonparametric tests instead.

Let’s check the normality of rating:

data("movies_long")   # IMDB datasets
help("movies_long")
str(movies_long)

## tibble [1,579 × 8] (S3: tbl_df/tbl/data.frame)
##  $ title : chr [1:1579] "Shawshank Redemption, The" "Lord of the Rings: The Return of the King, The" "Lord of the Rings: The Fellowship of the Ring, The" "Lord of the Rings: The Two Towers, The" ...
##  $ year  : int [1:1579] 1994 2003 2001 2002 1994 1993 1977 1980 1968 2002 ...
##  $ length: int [1:1579] 142 251 208 223 168 195 125 129 158 135 ...
##  $ budget: num [1:1579] 25 94 93 94 8 25 11 18 5 3.3 ...
##  $ rating: num [1:1579] 9.1 9 8.8 8.8 8.8 8.8 8.8 8.8 8.7 8.7 ...
##  $ votes : int [1:1579] 149494 103631 157608 114797 132745 97667 134640 103706 17241 25964 ...
##  $ mpaa  : Factor w/ 3 levels "PG","PG-13","R": 3 2 2 2 3 3 1 1 2 3 ...
##  $ genre : Factor w/ 9 levels "Action","Action Comedy",..: 7 1 1 1 7 7 1 1 7 7 ...

ggqqplot(rating[genre=="RomCom"])

ggqqplot(rating[genre=="Drama"])

movies_long %>%
  group_by(genre) %>%
  shapiro_test(rating)

## # A tibble: 9 × 4
##   genre         variable statistic          p
##   <fct>         <chr>        <dbl>      <dbl>
## 1 Action        rating       0.990 0.210     
## 2 Action Comedy rating       0.978 0.139     
## 3 Action Drama  rating       0.984 0.161     
## 4 Animated      rating       0.965 0.234     
## 5 Comedy        rating       0.993 0.311     
## 6 Comedy Drama  rating       0.982 0.0383    
## 7 Drama         rating       0.979 0.00000863
## 8 Romance Drama rating       0.940 0.000130  
## 9 RomCom        rating       0.972 0.000556

As we can see, the ratings of both movie genres are not distributed normally.

Finally, we know which test is the most appropriate one for verification of our claim. Let’s calculate the test statistic and the p-value:

wilcox.test(rating[genre=="RomCom"],rating[genre=="Drama"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rating[genre == "RomCom"] and rating[genre == "Drama"]
## W = 26195, p-value = 8.639e-14
## alternative hypothesis: true location shift is not equal to 0

We can see the test statistic for Wilcoxon W = 26195 being significant (p=0).

We can also visualize and compute this test with the ggstatsplot package using the “ggbetweenstats” function:

movies_long %>%
  filter(genre %in% c("RomCom", "Drama")) %>%
ggbetweenstats(
  x=genre,
  y=rating,
  type="np",
  k=3
)

Here also (M-W test statistic = 57265 is significant, p=0).

Tests for 2 proportions

We will use ‘germancredit’ credit data.

Is there any basis for the claim that the % of people with a bad credit score is significantly higher among foreign workers?

data(GermanCredit)
attach(GermanCredit)
# test for 2 proportions - Chi2 test for proportions (contingency):
ggpiestats(
  data=GermanCredit,
  x=foreign_worker,
  y=credit_risk,
  proportion.test=TRUE,
  k=5  #5 Number of digits after decimal point 
)

# base version - without visualization:

tabelka<-table(foreign_worker,credit_risk) #umieszczamy dane w tabeli
prop.test(tabelka, correct = FALSE) # testujemy proporcje tabelki

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  tabelka
## X-squared = 6.737, df = 1, p-value = 0.009443
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.30347582 -0.09505355
## sample estimates:
##    prop 1    prop 2 
## 0.6926272 0.8918919

# Continuity correction remarks:

# Unlike a number of statistical softwares, ggstatsplot doesn't provide the option for Yates' correction for the Pearson's chi-squared statistic. This is due to compelling amount of Monte-Carlo simulation research which suggests that the Yates' correction is overly conservative, even in small sample sizes. As such it is recommended that it should not ever be applied in practice (Camilli & Hopkins, 1978, 1979; Feinberg, 1980; Larntz, 1978; Thompson, 1988).

The Chi2 test of contingency-difference of proportions (without adjusting for continuity) showed that these differences are statistically significant (p=0.009).

We conclude that the level of credit risk is significantly different in domestic and foreign employees -> conclusion: this variable significantly determines risk - it is worth including in the scoring model.

Paired proportions tests

Presidential Approval Ratings - Approval of the President’s performance in office in two surveys, one month apart, for a random sample of 1600 voting-age Americans.

Performance <-
    matrix(c(794, 86, 150, 570),
           nrow = 2,
           dimnames = list("1st Survey" = c("Approve", "Disapprove"),
               "2nd Survey" = c("Approve", "Disapprove")))
Performance

##             2nd Survey
## 1st Survey   Approve Disapprove
##   Approve        794        150
##   Disapprove      86        570

mcnemar.test(Performance)

## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  Performance
## McNemar's chi-squared = 16.818, df = 1, p-value = 4.115e-05

Conclusion: We may reject the null hypothesis here - and conclude that the approval of the President’s performance significantly changed (A-D) - in fact, drop in approval ratings!

Challenge 3.

We are using here the Arthritis dataset from the package vcd. The data represent a double-blind clinical trial of new treatments for rheumatoid arthritis.

data(Arthritis)
head(Arthritis)   # top few observations from the data

##   ID Treatment  Sex Age Improved
## 1 57   Treated Male  27     Some
## 2 46   Treated Male  29     None
## 3 77   Treated Male  30     None
## 4 17   Treated Male  32   Marked
## 5 36   Treated Male  46   Marked
## 6 23   Treated Male  58   Marked

help(Arthritis)

Arthritis$Improvement <- ifelse(Arthritis$Improved=="Marked","yes","no")

tabelka <- xtabs(~Treatment+Improvement,data=Arthritis) 
mcnemar.test(tabelka)

## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  tabelka
## McNemar's chi-squared = 5.3333, df = 1, p-value = 0.02092

mosaic(Improvement~Treatment, gp = shading_Friendly, data=Arthritis)

Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors.

Please verify the hypothesis, that the new treatments for rheumatoid arthritis significantly improved the health of patients.

If the test is requested for two paired groups, the mcnemar.test is used.

If the test is requested for more than two paired groups, the test based on Cochran-Mantel-Haenzen for repeated measures is used (powered by mantelhaen.test).

COCHRAN–MANTEL–HAENSZEL TEST

The mantelhaen.test() function provides a Cochran–Mantel–Haenszel chi-square test of the null hypothesis that two nominal variables are conditionally independent in each stratum of a third variable.

The following code tests the hypothesis that Treatment and Improved variables are independent within each level Sex. The test assumes that there’s no three-way (Treatment x Improved x Sex) interaction.

mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
mantelhaen.test(mytable)

## 
##  Cochran-Mantel-Haenszel test
## 
## data:  mytable
## Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647

mosaic(Improved ~ Treatment | Sex, data = Arthritis, zero_size = 0,
       highlighting_direction = "right")

Conclusion???