How to combine ANOVA and Tukey Tests in R and Python
Author
Shah Nawaz
0.1 Introduction
This is a file with all the assignments of the course Machine learning course and it will include following topics
Statistics Model and their implementation
Supervised and Unsupervised Machine learning implementation on Real world data
Theory of the above topics
1 General Steps in all chapters
1.1 Importing libraries
First of all we importing the libraries needed in each section in the code chunk below.
# Data Wrangling modulesimport pandas as pd import numpy as np # Libraries with datasets and visulationsimport seaborn as sns
/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
import matplotlib.pyplot as pltimport bokeh as bkimport plotly.express as px# Statistics Modulesimport scipy
After words we move towards the importing of data set in the @data-importing section.
tips = sns.load_dataset("tips")tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
tips=py$tips# Box plot for the categorical and continuous variableggplot(tips)+aes(x =sex, y =tip, fill =time)+geom_boxplot()+scale_fill_hue(direction =1)
Figure 1: Boxplot of tips for each sex at different times
1.2 ANOVA Tests
1.2.1 1 way ANOVA in python
This type of ANOVA test involves 2 dependent variables which are linked to 1 independent variable.
# dropping rows with `nan` from the dataframetips<-na.omit(tips)# getting a glimpse of dataframehead(tips)
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
# renaming all column names to small letterstips<-tips%>%rename_all(tolower)
Shapiro-Wilk normality test
data: tips$tip
W = 0.89781, p-value = 8.2e-12
The tip column in the tips dataframe is normally distributed since p-value for shapiro-wilk test is < \(\alpha\). We can do the normaity check for other variables for 1-way ANOVA test.
Shapiro-Wilk normality test
data: tips$total_bill
W = 0.91972, p-value = 3.325e-10
The test can be done by using models given in python code chunk below
import statsmodels.api as sm from statsmodels.formula.api import olsfrom statsmodels.stats.anova import anova_lmfrom statsmodels.sandbox.stats.multicomp import MultiComparisonone_way_anova = ols('tip ~ sex',data=tips).fit()sm.stats.anova_lm(one_way_anova,type=2)
df sum_sq mean_sq F PR(>F)
sex 1.0 3.673534 3.673534 1.926155 0.166456
Residual 242.0 461.538943 1.907186 NaN NaN
1.2.2 1 way ANOVA in R
We are using the same variables as above and running the 1 way ANOVA in R below.
# As compared to python we first fit the model and then# apply ANOVA belowLM=lm(tip~sex, data =tips)# summary(LM)OWA=aov(tip~sex, data =tips)summary(OWA)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 3.7 3.674 1.926 0.166
Residuals 242 461.5 1.907
1.2.3 2 way ANOVA in python
In this test we are going to check how the means of individuals variables effect each other. In the below example we are checking how much the individual and combined effect of sex and smoker are impacting the tip variable. Other parameter of fitting the regression are same as 1 way ANOVA given in section Section 1.2.1.
df sum_sq mean_sq F PR(>F)
C(sex) 1.0 3.673534 3.673534 1.912950 0.167921
C(smoker) 1.0 0.015000 0.015000 0.007811 0.929648
C(sex):C(smoker) 1.0 0.639891 0.639891 0.333216 0.564313
Residual 240.0 460.884051 1.920350 NaN NaN
Our results of two way ANOVA tells us that there is no significant either individual or combined effect of variables sex and smoker on varible tip since the p-value in each case is more than 0.05.
1.2.4 2 way ANOVA in R
TWA<-aov(tip~sex+smoker+sex:smoker, data =tips)summary(TWA)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 3.7 3.674 1.913 0.168
smoker 1 0.0 0.015 0.008 0.930
sex:smoker 1 0.6 0.640 0.333 0.564
Residuals 240 460.9 1.920
Another way of doing 2 way ANOVA test is by using pingoin module.
import pingouin as pgtwo_way_anova_pg = pg.anova(data=tips,dv='tip',between=['sex','smoker'],detailed=True)print(two_way_anova_pg)
Source SS DF MS F p-unc np2
0 sex 3.672183 1.0 3.672183 1.912247 0.167999 0.007905
1 smoker 0.015000 1.0 0.015000 0.007811 0.929648 0.000033
2 sex * smoker 0.639891 1.0 0.639891 0.333216 0.564313 0.001386
3 Residual 460.884051 240.0 1.920350 NaN NaN NaN
Although our p-value is more than 0.05 we will still do the tukey tests for the sake of practice below.
1.3 Tukey Tests
1.3.1 Tukey test - 1 way ANOVA in python
The below section will describe the post hoc analysis of tukey test on 1 way ANOVA performed in Section 1.2.1. Post hoc analysis tests such Anderson Darling test and tukey test show us the how much the individual means are significantly different from the mean of target variable. In our cases the target continuous variable is tip.
/opt/anaconda3/lib/python3.9/site-packages/pingouin/parametric.py:941: FutureWarning:
Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use
>>> .groupby(..., group_keys=False)
To adopt the future behavior and silence this warning, use
>>> .groupby(..., group_keys=True)
print(pt)
A B mean(A) mean(B) ... se T p-tukey hedges
0 Female Male 2.833448 3.089618 ... 0.184579 -1.38786 0.166456 -0.184919
[1 rows x 9 columns]
Another way of doing tukey test for 1 way ANOVA is given below
import statsmodels.stats.multicomp as smmone_tukey_smm = smm.pairwise_tukeyhsd(endog=tips['tip'],groups=tips['sex'],alpha=0.05)print(one_tukey_smm)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------
Female Male 0.2562 0.1665 -0.1074 0.6198 False
---------------------------------------------------
rows = one_tukey_smm.summary().data[1:]plt.hlines( range(len(rows)), [row[4] for row in rows], [row[5] for row in rows] )plt.vlines( 0, -1, len( rows )-1, linestyles='dashed' )plt.gca().set_yticks( range( len( rows ) ) )plt.gca().set_yticklabels( [ f'{x[0]}-{x[1]}'for x in rows ] )plt.show()
1.3.2 Tukey test - 1 way ANOVA in R
In the code we are going to do the tukey test after 1 way ANOVA in R.
library(agricolae)# Tukey Test 1 way ANOVATukey=HSD.test(OWA, "sex", group =TRUE, alpha =0.05)Tukey
$statistics
MSerror Df Mean CV
1.907186 242 2.998279 46.06006
$parameters
test name.t ntr StudentizedRange alpha
Tukey sex 2 2.785739 0.05
$means
tip std r Min Max Q25 Q50 Q75
Female 2.833448 1.159495 87 1 6.5 2 2.75 3.50
Male 3.089618 1.489102 157 1 10.0 2 3.00 3.76
$comparison
NULL
$groups
tip groups
Male 3.089618 a
Female 2.833448 a
attr(,"class")
[1] "group"
# Tuckey test representation :plot(Tukey, las =1, col ="brown")
$statistics
MSerror Df Mean CV
1.907186 242 2.998279 46.06006
$parameters
test name.t ntr StudentizedRange alpha
Tukey sex 2 2.785739 0.05
$means
tip std r Min Max Q25 Q50 Q75
Female 2.833448 1.159495 87 1 6.5 2 2.75 3.50
Male 3.089618 1.489102 157 1 10.0 2 3.00 3.76
$comparison
NULL
$groups
tip groups
Male 3.089618 a
Female 2.833448 a
attr(,"class")
[1] "group"
1.3.3 Tukey test - 2 way ANOVA in python
from statsmodels.stats.multicomp import pairwise_tukeyhsd# = tips['sex'] + " / " + tips['time']# perform multiple pairwise comparison (Tukey HSD)#m_comp = pairwise_tukeyhsd(endog=anova_df['volume'], groups=anova_df['combination'], alpha=0.05)tips['combination']= tips.loc[:,['sex']]two_tukey_smm =smm.pairwise_tukeyhsd(endog=tips['tip'],groups=tips['combination'],alpha=0.05)#print(one_tukey_smm)# coerce the tukeyhsd table to a DataFrametukey_data = pd.DataFrame(data=two_tukey_smm._results_table.data[1:], columns = two_tukey_smm._results_table.data[0])group1_comp =tukey_data.loc[tukey_data.reject ==True].groupby('group1').reject.count()group2_comp = tukey_data.loc[tukey_data.reject ==True].groupby('group2').reject.count()tukey_data = pd.concat([group1_comp, group2_comp], axis=1)tukey_data = tukey_data.fillna(0)tukey_data.columns = ['reject1', 'reject2']tukey_data['total_sum'] = tukey_data.reject1 + tukey_data.reject2# just show the top 20 resultstukey_data.sort_values('total_sum',ascending=False).head(20)
two_tukeyr<-HSD.test(TWA, c("sex", "smoker"), main ="tip~sex+smoker+sex:smoker", console =TRUE)
Study: tip~sex+smoker+sex:smoker
HSD Test for tip
Mean Square Error: 1.92035
sex:smoker, means
tip std r Min Max
Female:No 2.773519 1.128425 54 1.00 5.2
Female:Yes 2.931515 1.219916 33 1.00 6.5
Male:No 3.113402 1.489559 97 1.25 9.0
Male:Yes 3.051167 1.500120 60 1.00 10.0
Alpha: 0.05 ; DF Error: 240
Critical Value of Studentized Range: 3.658742
Groups according to probability of means differences and alpha level( 0.05 )
Treatments with the same letter are not significantly different.
tip groups
Male:No 3.113402 a
Male:Yes 3.051167 a
Female:Yes 2.931515 a
Female:No 2.773519 a
two_tukeyr
$statistics
MSerror Df Mean CV
1.92035 240 2.998279 46.21875
$parameters
test name.t ntr StudentizedRange alpha
Tukey sex:smoker 4 3.658742 0.05
$means
tip std r Min Max Q25 Q50 Q75
Female:No 2.773519 1.128425 54 1.00 5.2 2 2.68 3.4375
Female:Yes 2.931515 1.219916 33 1.00 6.5 2 2.88 3.5000
Male:No 3.113402 1.489559 97 1.25 9.0 2 2.74 3.7100
Male:Yes 3.051167 1.500120 60 1.00 10.0 2 3.00 3.8200
$comparison
NULL
$groups
tip groups
Male:No 3.113402 a
Male:Yes 3.051167 a
Female:Yes 2.931515 a
Female:No 2.773519 a
attr(,"class")
[1] "group"
In the below code we are going to save the results of section Section 1.3.1 as a dataframe. Same procedure can be followed for other tabular results from other tests.
In the significance representation of the ANOVA test or after tukey test we use special annotations named ‘a,b,c’ etc . If the group shares the letter a in the represenation it means they are highly signficantly different from the target variable as compared to the group with letter b or c representation. The number of letter depend upon the groups. In the above result of 2-way ANOVA tukey test we can observe that all groups under variable sex are sharing 1 letter a which denotes that they are not significantly different from each other. If we had one letter b in Female:Yes group we would have predicted that the other 3 groups of Male:No, Male:Yes and Female:No are significantly different from the Female:Yes group.
1.4 Manova test
‘Manova’ stands for multivariate analysis of variance. In this test we tend to look into the details of those variables which
from statsmodels.multivariate.manova import MANOVAmnova = MANOVA.from_formula("smoker+sex+time~ tip",data=tips)print(mnova.mv_test())
Multivariate linear model
==========================================================================================
------------------------------------------------------------------------------------------
Intercept Value Num DF Den DF F Value Pr > F
------------------------------------------------------------------------------------------
Wilks' lambda 0.0000 4.0000 239.0000 10154342556052284.0000 0.0000
Pillai's trace 1.0000 4.0000 239.0000 10154342556052286.0000 0.0000
Hotelling-Lawley trace 169947155749829.0312 4.0000 239.0000 10154342556052284.0000 0.0000
Roy's greatest root 169947155749829.0312 4.0000 239.0000 10154342556052284.0000 0.0000
------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
tip Value Num DF Den DF F Value Pr > F
-----------------------------------------------------------------------------------------------
Wilks' lambda 0.9809 4.0000 239.0000 1.1607 0.3288
Pillai's trace 0.0191 4.0000 239.0000 1.1607 0.3288
Hotelling-Lawley trace 0.0194 4.0000 239.0000 1.1607 0.3288
Roy's greatest root 0.0194 4.0000 239.0000 1.1607 0.3288
==========================================================================================
1.4.1 Saving MANOVA data in a dataframe
we can save the two tables above into separate dataframes as shown in code below.
# Saving our datframe tip results x = pd.DataFrame((mnova.mv_test().results['tip']['stat']))x.head()# Saving the intercept
Value Num DF Den DF F Value Pr > F
Wilks' lambda 0.980944 4 239.0 1.16074 0.328829
Pillai's trace 0.019056 4.0 239.0 1.16074 0.328829
Hotelling-Lawley trace 0.019427 4 239.0 1.16074 0.328829
Roy's greatest root 0.019427 4 239 1.16074 0.328829
x = pd.DataFrame((mnova.mv_test().results['Intercept']['stat']))x.head()# Saving only vavlues same as above. #x = pd.DataFrame((mnova.mv_test().results[]))#x.head()