How to combine ANOVA and Tukey Tests in R and Python

Author

Shah Nawaz

0.1 Introduction

This is a file with all the assignments of the course Machine learning course and it will include following topics

  • Statistics Model and their implementation
  • Supervised and Unsupervised Machine learning implementation on Real world data
  • Theory of the above topics

1 General Steps in all chapters

1.1 Importing libraries

First of all we importing the libraries needed in each section in the code chunk below.


# Data Wrangling modules
import pandas as pd 
import numpy as np 

# Libraries with datasets and visulations
import seaborn as sns
/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
import matplotlib.pyplot as plt
import bokeh as bk
import plotly.express as px

# Statistics Modules
import scipy 

After words we move towards the importing of data set in the @data-importing section.

tips = sns.load_dataset("tips")
tips.head()
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
tips = py$tips

# Box plot for the categorical and continuous variable

ggplot(tips) + aes(x = sex, y = tip, fill = time) + geom_boxplot() +
    scale_fill_hue(direction = 1)

Figure 1: Boxplot of tips for each sex at different times

1.2 ANOVA Tests

1.2.1 1 way ANOVA in python

This type of ANOVA test involves 2 dependent variables which are linked to 1 independent variable.

# dropping rows with `nan` from the dataframe
tips <- na.omit(tips)
# getting a glimpse of dataframe
head(tips)
  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
4      23.68 3.31   Male     No Sun Dinner    2
5      24.59 3.61 Female     No Sun Dinner    4
6      25.29 4.71   Male     No Sun Dinner    4
# renaming all column names to small letters
tips <- tips %>%
    rename_all(tolower)
# | label: shapiro-test
shapiro.test(tips$tip)

    Shapiro-Wilk normality test

data:  tips$tip
W = 0.89781, p-value = 8.2e-12

The tip column in the tips dataframe is normally distributed since p-value for shapiro-wilk test is < \(\alpha\). We can do the normaity check for other variables for 1-way ANOVA test.

shapiro.test(tips$total_bill)

    Shapiro-Wilk normality test

data:  tips$total_bill
W = 0.91972, p-value = 3.325e-10

The test can be done by using models given in python code chunk below



import statsmodels.api as sm 
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.sandbox.stats.multicomp import MultiComparison



one_way_anova = ols('tip ~ sex',data=tips).fit()
sm.stats.anova_lm(one_way_anova,type=2)
             df      sum_sq   mean_sq         F    PR(>F)
sex         1.0    3.673534  3.673534  1.926155  0.166456
Residual  242.0  461.538943  1.907186       NaN       NaN

1.2.2 1 way ANOVA in R

We are using the same variables as above and running the 1 way ANOVA in R below.

# As compared to python we first fit the model and then
# apply ANOVA below
LM = lm(tip ~ sex, data = tips)
# summary(LM)

OWA = aov(tip ~ sex, data = tips)
summary(OWA)
             Df Sum Sq Mean Sq F value Pr(>F)
sex           1    3.7   3.674   1.926  0.166
Residuals   242  461.5   1.907               

1.2.3 2 way ANOVA in python

In this test we are going to check how the means of individuals variables effect each other. In the below example we are checking how much the individual and combined effect of sex and smoker are impacting the tip variable. Other parameter of fitting the regression are same as 1 way ANOVA given in section Section 1.2.1.


two_way_anova = ols('tip ~ C(sex)+C(smoker)+C(sex):C(smoker)',data=tips).fit()
sm.stats.anova_lm(two_way_anova,type=2)
                     df      sum_sq   mean_sq         F    PR(>F)
C(sex)              1.0    3.673534  3.673534  1.912950  0.167921
C(smoker)           1.0    0.015000  0.015000  0.007811  0.929648
C(sex):C(smoker)    1.0    0.639891  0.639891  0.333216  0.564313
Residual          240.0  460.884051  1.920350       NaN       NaN

Our results of two way ANOVA tells us that there is no significant either individual or combined effect of variables sex and smoker on varible tip since the p-value in each case is more than 0.05.

1.2.4 2 way ANOVA in R

TWA <- aov(tip ~ sex + smoker + sex:smoker, data = tips)

summary(TWA)
             Df Sum Sq Mean Sq F value Pr(>F)
sex           1    3.7   3.674   1.913  0.168
smoker        1    0.0   0.015   0.008  0.930
sex:smoker    1    0.6   0.640   0.333  0.564
Residuals   240  460.9   1.920               

Another way of doing 2 way ANOVA test is by using pingoin module.

import pingouin as pg

two_way_anova_pg = pg.anova(data=tips,dv='tip',between=['sex','smoker'],detailed=True)
print(two_way_anova_pg)
         Source          SS     DF        MS         F     p-unc       np2
0           sex    3.672183    1.0  3.672183  1.912247  0.167999  0.007905
1        smoker    0.015000    1.0  0.015000  0.007811  0.929648  0.000033
2  sex * smoker    0.639891    1.0  0.639891  0.333216  0.564313  0.001386
3      Residual  460.884051  240.0  1.920350       NaN       NaN       NaN

Although our p-value is more than 0.05 we will still do the tukey tests for the sake of practice below.

1.3 Tukey Tests

1.3.1 Tukey test - 1 way ANOVA in python

The below section will describe the post hoc analysis of tukey test on 1 way ANOVA performed in Section 1.2.1. Post hoc analysis tests such Anderson Darling test and tukey test show us the how much the individual means are significantly different from the mean of target variable. In our cases the target continuous variable is tip.

pt = pg.pairwise_tukey(data=tips,dv="tip",between='sex')
/opt/anaconda3/lib/python3.9/site-packages/pingouin/parametric.py:941: FutureWarning:

Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

    >>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

    >>> .groupby(..., group_keys=True)
print(pt)
        A     B   mean(A)   mean(B)  ...        se        T   p-tukey    hedges
0  Female  Male  2.833448  3.089618  ...  0.184579 -1.38786  0.166456 -0.184919

[1 rows x 9 columns]

Another way of doing tukey test for 1 way ANOVA is given below

import statsmodels.stats.multicomp as smm

one_tukey_smm = smm.pairwise_tukeyhsd(endog=tips['tip'],groups=tips['sex'],alpha=0.05)

print(one_tukey_smm)
 
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
Female   Male   0.2562 0.1665 -0.1074 0.6198  False
---------------------------------------------------
rows = one_tukey_smm.summary().data[1:]
plt.hlines( range(len(rows)), [row[4] for row in rows], [row[5] for row in rows] )
plt.vlines( 0, -1, len( rows )-1, linestyles='dashed' )
plt.gca().set_yticks( range( len( rows ) ) )
plt.gca().set_yticklabels( [ f'{x[0]}-{x[1]}' for x in rows ] )
plt.show()

 

1.3.2 Tukey test - 1 way ANOVA in R

In the code we are going to do the tukey test after 1 way ANOVA in R.

library(agricolae)
# Tukey Test 1 way ANOVA
Tukey = HSD.test(OWA, "sex", group = TRUE, alpha = 0.05)
Tukey
$statistics
   MSerror  Df     Mean       CV
  1.907186 242 2.998279 46.06006

$parameters
   test name.t ntr StudentizedRange alpha
  Tukey    sex   2         2.785739  0.05

$means
            tip      std   r Min  Max Q25  Q50  Q75
Female 2.833448 1.159495  87   1  6.5   2 2.75 3.50
Male   3.089618 1.489102 157   1 10.0   2 3.00 3.76

$comparison
NULL

$groups
            tip groups
Male   3.089618      a
Female 2.833448      a

attr(,"class")
[1] "group"
# Tuckey test representation :
plot(Tukey, las = 1, col = "brown")

MC = HSD.test(OWA, "sex", group = TRUE)

MC
$statistics
   MSerror  Df     Mean       CV
  1.907186 242 2.998279 46.06006

$parameters
   test name.t ntr StudentizedRange alpha
  Tukey    sex   2         2.785739  0.05

$means
            tip      std   r Min  Max Q25  Q50  Q75
Female 2.833448 1.159495  87   1  6.5   2 2.75 3.50
Male   3.089618 1.489102 157   1 10.0   2 3.00 3.76

$comparison
NULL

$groups
            tip groups
Male   3.089618      a
Female 2.833448      a

attr(,"class")
[1] "group"

1.3.3 Tukey test - 2 way ANOVA in python

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# = tips['sex'] + " / " + tips['time']
# perform multiple pairwise comparison (Tukey HSD)
#m_comp = pairwise_tukeyhsd(endog=anova_df['volume'], groups=anova_df['combination'], alpha=0.05)

tips['combination']= tips.loc[:,['sex']]
two_tukey_smm =smm.pairwise_tukeyhsd(endog=tips['tip'],groups=tips['combination'],alpha=0.05)

 #print(one_tukey_smm)

# coerce the tukeyhsd table to a DataFrame
tukey_data = pd.DataFrame(data=two_tukey_smm._results_table.data[1:], columns = two_tukey_smm._results_table.data[0])

group1_comp =tukey_data.loc[tukey_data.reject == True].groupby('group1').reject.count()
group2_comp = tukey_data.loc[tukey_data.reject == True].groupby('group2').reject.count()
tukey_data = pd.concat([group1_comp, group2_comp], axis=1)


tukey_data = tukey_data.fillna(0)
tukey_data.columns = ['reject1', 'reject2']
tukey_data['total_sum'] = tukey_data.reject1 + tukey_data.reject2

# just show the top 20 results
tukey_data.sort_values('total_sum',ascending=False).head(20)
Empty DataFrame
Columns: [reject1, reject2, total_sum]
Index: []

Lets two the above procedure for different dataset if we do not get empty frames


from statsmodels.stats.multicomp import pairwise_tukeyhsd


# perform multiple pairwise comparison (Tukey HSD)
#m_comp = pairwise_tukeyhsd(endog=anova_df['volume'], groups=anova_df['combination'], alpha=0.05)

one_tukey_smm =smm.pairwise_tukeyhsd(endog=tips['tip'],groups=tips['time'],alpha=0.05)

 #print(one_tukey_smm)

# coerce the tukeyhsd table to a DataFrame
tukey_data = pd.DataFrame(data=one_tukey_smm._results_table.data[1:], columns = one_tukey_smm._results_table.data[0])

group1_comp =tukey_data.loc[tukey_data.reject == True].groupby('group1').reject.count()
group2_comp = tukey_data.loc[tukey_data.reject == True].groupby('group2').reject.count()
tukey_data = pd.concat([group1_comp, group2_comp], axis=1)

tukey_data = tukey_data.fillna(0)
tukey_data.columns = ['reject1', 'reject2']
tukey_data['total_sum'] = tukey_data.reject1 + tukey_data.reject2

# just show the top 20 results
tukey_data.sort_values('total_sum',ascending=False).head(20)
Empty DataFrame
Columns: [reject1, reject2, total_sum]
Index: []

1.3.4 Tukey test - 2 way ANOVA in R

two_tukeyr <- HSD.test(TWA, c("sex", "smoker"), main = "tip~sex+smoker+sex:smoker",
    console = TRUE)

Study: tip~sex+smoker+sex:smoker

HSD Test for tip 

Mean Square Error:  1.92035 

sex:smoker,  means

                tip      std  r  Min  Max
Female:No  2.773519 1.128425 54 1.00  5.2
Female:Yes 2.931515 1.219916 33 1.00  6.5
Male:No    3.113402 1.489559 97 1.25  9.0
Male:Yes   3.051167 1.500120 60 1.00 10.0

Alpha: 0.05 ; DF Error: 240 
Critical Value of Studentized Range: 3.658742 

Groups according to probability of means differences and alpha level( 0.05 )

Treatments with the same letter are not significantly different.

                tip groups
Male:No    3.113402      a
Male:Yes   3.051167      a
Female:Yes 2.931515      a
Female:No  2.773519      a
two_tukeyr
$statistics
  MSerror  Df     Mean       CV
  1.92035 240 2.998279 46.21875

$parameters
   test     name.t ntr StudentizedRange alpha
  Tukey sex:smoker   4         3.658742  0.05

$means
                tip      std  r  Min  Max Q25  Q50    Q75
Female:No  2.773519 1.128425 54 1.00  5.2   2 2.68 3.4375
Female:Yes 2.931515 1.219916 33 1.00  6.5   2 2.88 3.5000
Male:No    3.113402 1.489559 97 1.25  9.0   2 2.74 3.7100
Male:Yes   3.051167 1.500120 60 1.00 10.0   2 3.00 3.8200

$comparison
NULL

$groups
                tip groups
Male:No    3.113402      a
Male:Yes   3.051167      a
Female:Yes 2.931515      a
Female:No  2.773519      a

attr(,"class")
[1] "group"
plot(two_tukeyr, las = 1, col = "brown")

1.3.5 Saving tukey results table

In the below code we are going to save the results of section Section 1.3.1 as a dataframe. Same procedure can be followed for other tabular results from other tests.


ANOVA_data = pd.DataFrame(data=one_tukey_smm._results_table.data[1:], columns = one_tukey_smm._results_table.data[0])

1.3.6 Lettering and Grouping of 2-way ANOVA

In the significance representation of the ANOVA test or after tukey test we use special annotations named ‘a,b,c’ etc . If the group shares the letter a in the represenation it means they are highly signficantly different from the target variable as compared to the group with letter b or c representation. The number of letter depend upon the groups. In the above result of 2-way ANOVA tukey test we can observe that all groups under variable sex are sharing 1 letter a which denotes that they are not significantly different from each other. If we had one letter b in Female:Yes group we would have predicted that the other 3 groups of Male:No, Male:Yes and Female:No are significantly different from the Female:Yes group.

1.4 Manova test

‘Manova’ stands for multivariate analysis of variance. In this test we tend to look into the details of those variables which

from statsmodels.multivariate.manova import MANOVA

mnova = MANOVA.from_formula("smoker+sex+time~ tip",data=tips)

print(mnova.mv_test())
                                Multivariate linear model
==========================================================================================
                                                                                          
------------------------------------------------------------------------------------------
       Intercept               Value         Num DF  Den DF         F Value         Pr > F
------------------------------------------------------------------------------------------
          Wilks' lambda               0.0000 4.0000 239.0000 10154342556052284.0000 0.0000
         Pillai's trace               1.0000 4.0000 239.0000 10154342556052286.0000 0.0000
 Hotelling-Lawley trace 169947155749829.0312 4.0000 239.0000 10154342556052284.0000 0.0000
    Roy's greatest root 169947155749829.0312 4.0000 239.0000 10154342556052284.0000 0.0000
------------------------------------------------------------------------------------------
                                                                                          
-----------------------------------------------------------------------------------------------
                   tip                Value       Num DF       Den DF       F Value      Pr > F
-----------------------------------------------------------------------------------------------
                   Wilks' lambda      0.9809      4.0000      239.0000       1.1607      0.3288
                  Pillai's trace      0.0191      4.0000      239.0000       1.1607      0.3288
          Hotelling-Lawley trace      0.0194      4.0000      239.0000       1.1607      0.3288
             Roy's greatest root      0.0194      4.0000      239.0000       1.1607      0.3288
==========================================================================================

1.4.1 Saving MANOVA data in a dataframe

we can save the two tables above into separate dataframes as shown in code below.


# Saving our datframe tip results 
x = pd.DataFrame((mnova.mv_test().results['tip']['stat']))
x.head()
# Saving the intercept 
                           Value Num DF Den DF  F Value    Pr > F
Wilks' lambda           0.980944      4  239.0  1.16074  0.328829
Pillai's trace          0.019056    4.0  239.0  1.16074  0.328829
Hotelling-Lawley trace  0.019427      4  239.0  1.16074  0.328829
Roy's greatest root     0.019427      4    239  1.16074  0.328829
x = pd.DataFrame((mnova.mv_test().results['Intercept']['stat']))
x.head()
# Saving only vavlues same as above. 
#x = pd.DataFrame((mnova.mv_test().results[]))
#x.head()
                                        Value  ... Pr > F
Wilks' lambda                             0.0  ...    0.0
Pillai's trace                            1.0  ...    0.0
Hotelling-Lawley trace  169947155749829.03125  ...    0.0
Roy's greatest root     169947155749829.03125  ...    0.0

[4 rows x 5 columns]
Note

This is a pretty incomplete analysis, but hopefully the document provides a good overview of some of the authoring features of Machine learning!

References