Odoo’s 3rd project. ANOVA

odoo_team

13/03/2019

~ Introduction

Hello, everybody!

Shortly about us (mostly the same, but still).

Our team-members and their responsibilities:

Our data:

As usual, here are all the packages that we need. Let’s open them (or install and open):

library(foreign)
library(dplyr)
library(magrittr)
library(lubridate)
library(ggplot2)
library(psych)
library(knitr)
library(lsr)
library(vcd)
library(sjPlot)
library(corrplot)
library(RColorBrewer)

Intro to the topic, hypothesis

In the context of our topic of study, i.e. subjective well-being in Netherlands, we would like to test a new theory. What if the variation of individuals’ total working hours (overtime included) can be subjected to change, if we examine it along with a parameter that indicates one’s time devoted to social activities compared to others of same age. For instance, we assume that if a person spend much time on social meetings or participating in any kinds of social activities that are of importance for him/her, they may either skip some part of the formally assigned working time and then work these hours off, or the work itself may not be present much in his/her life, as it would be in case of part-time job of a full-time student.

For the start, we can conduct one-way analysis of variances (ANOVA) and check if means across groups of the variable which indicates participating in social activities compared to others of the same age are not signiffically different, or there is a variation in means which would support our theory.

Thus, the hypotheses would be the following:

Dataset

Loading our dataset:

ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL <- ESS %>% select (wkhtot, sclact)
NL = na.omit(NL)
whours <- as.numeric(as.character(NL$wkhtot))
NL <- cbind(NL, whours)
dim(NL)
## [1] 1580    3

There are 1580 observations of 3 variables: two, what we have taken from the whole ESS dataset and one new variable - whours. We had to convert it to a continious type to work with it later.

Checking normality of working hours’ distribution

ggplot() + 
  geom_histogram(data = NL, aes(x = whours), binwidth = 8, fill = "pink", col= "black", alpha = 0.7) +
  ggtitle("Total hours normally worked per week in main job") +
  xlab("Hours") + 
  ylab("Count") +
  theme_bw() 

summary(whours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   36.00   34.37   42.00  168.00

The distribution is close to normal, even though it is seen that it is not perfect. It is slightly skewed to the left and has biggest number somewhere about 40 hours as possibly it is the very common number of hours for work.

Social activity levels in our data:

Let’s take a look at the “sclact” variable. As it can be seen from the barplot below, number of observations in groups, chosen to be examined, vary too much. That may affect the F-ratio, so we got to do something with them.

par(mar=c(4, 10, 2, 10))
barplot(table(NL$sclact)/nrow(NL)*100, horiz = TRUE, las = 1,  xlim = c(0, 50))

Reorganising levels of social activity

One of the way to balance sizes of the groups is to merge the smallest of them to the bigger ones. And that’s what we are going to do. With the following we create three new groups, two of which combine several levels of the “sclact” variable. Thus, we have created the folling groups:

Also, we established the order of levels for the “sclact” factor.

NL$sclactEd <- rep(NA, length(NL$sclact))
NL$sclactEd[NL$sclact == "Much more than most" | 
            NL$sclact == "More than most"] <- "More often than others"
NL$sclactEd[NL$sclact == "About the same"] <- "About the same"
NL$sclactEd[NL$sclact == "Less than most" |
            NL$sclact == "Much less than most"] <- "Less often than others"
sclact <- as.factor(NL$sclactEd)

Descriptive statistics with describeBy

Now, let’s look at the descriptive values across our new groups. It seems like group sizes are more or less comparable now. At the same time, we can see the values of skewness across three groups, which do not exceed 2, hinting at the normality of the distribution of our continious variable “whours”.

describeBy(whours, sclact, mat = TRUE)
##     item                 group1 vars   n     mean       sd median  trimmed
## X11    1         About the same    1 687 32.70888 15.05402     34 32.48457
## X12    2 Less often than others    1 548 35.04197 16.45980     38 34.86364
## X13    3 More often than others    1 345 36.60000 17.15209     38 36.14801
##         mad min max range      skew   kurtosis        se
## X11 14.8260   0  90    90 0.2087559 0.02880948 0.5743468
## X12 14.8260   0 100   100 0.2027204 0.43345238 0.7031279
## X13 13.3434   0 168   168 1.4252582 9.56287314 0.9234375

Boxplots

boxplot(whours ~ sclact, ylab = "Working hours in total", xlab = "Social activity compared to others of same age", col = c("pink", "mistyrose", "palevioletred3"))

Here we can see some differences in number of total working hours for people who participate with different frequency in social activities (darker the color = more frequency). The median number of working hours for people from the category, describing social activity as “about the same” is lower then for people from the category “less” or “more than others”. However, we cannot be sure in significanve of this difference without a test. There are also some outliers.

Final preparations for ANOVA - Cheking homogeneity of variance

Currently, we are ready to proceed to the one way analysis (i.e. ANOVA). However, considering several assumptions for ANOVA whose violation can lead to robust results, we need to do some additional preparations before running the test. While the independence of observations is implied, homogeneity of variances across the groups and normality of residuals are still to be checked. Firstly, let’s inspect the variances. In order to do that, we apply Levene’s test. Variances homogeneity test

library(car)
leveneTest(whours ~ sclact)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    2  0.5173 0.5962
##       1577

It returns the p-value equal to 0.596 which is greater than alpha level 0.05, indicating that there is no significant difference in variances between the groups. Obtaining this information, we can state that variances are homogeneous and put var.equal = T in conditions of ANOVA test.

ANOVA

Now, let’s conduct the ANOVA test itself, first part of which would provide a p-value, and the second one - F-ratio.

oneway.test(whours ~ sclact, var.equal = T) 
## 
##  One-way analysis of means
## 
## data:  whours and sclact
## F = 7.5152, num df = 2, denom df = 1577, p-value = 0.0005645
aov.out <- aov(whours ~  sclact) 
summary(aov.out)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## sclact         2   3859  1929.4   7.515 0.000564 ***
## Residuals   1577 404863   256.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, we can see that F-ratio is equal to 7.515, which is bigger than the computed critical F-value, and p-value is 0.0005, which is smaller than alpha level (0.05), that allows us to reject the null hypothesis, concluding that there is the variations between the groups means due to true differences about the populations means.

Cheking normality of residuals

layout(matrix(1:4, 2, 2))
plot(aov.out)

Upper graphs’ red lines are almost straight and Q-Q plot is not ideal but still has something like a straight line.

Double-check with a non-parametric test

Kruskal-Wallis test

kruskal.test(wkhtot ~ sclact, data = NL)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  wkhtot by sclact
## Kruskal-Wallis chi-squared = 18.94, df = 4, p-value = 0.0008075

With KW chi-square = 18.94 and p-value = 0.0008, it means that the mean ranks of the different groups are not the same. This results confirms the ANOVA test results.

Dunn’s post hoc test

AS Kruskal-Wallis test is significant, we can also run a non-parametric Dunn’s post hoc test:

library(dunn.test)
dunn.test(whours, sclact, kw=TRUE)
##   Kruskal-Wallis rank sum test
## 
## data: whours and sclact
## Kruskal-Wallis chi-squared = 14.7901, df = 2, p-value = 0
## 
## 
##                         Comparison of whours by sclact                         
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |   About th   Less oft
## ---------+----------------------
## Less oft |  -2.737675
##          |    0.0031*
##          |
## More oft |  -3.547013  -1.124059
##          |    0.0002*     0.1305
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

These results show that most pairs of groups have statistically significant differences in their medians.

Application and interpretation of post hoc test

As the F-ratio is significant, we can apply a post hoc test to check which groups contribute to the statistical significance of this test. In our case we use Tukey’s post hoc test because the variances are equal across all three groups (as we’ve checked earlier)

TukeyHSD(aov.out)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = whours ~ sclact)
## 
## $sclact
##                                                   diff        lwr      upr
## Less often than others-About the same         2.333092  0.1802306 4.485953
## More often than others-About the same         3.891121  1.4108272 6.371414
## More often than others-Less often than others 1.558029 -1.0252840 4.141342
##                                                   p adj
## Less often than others-About the same         0.0298638
## More often than others-About the same         0.0007040
## More often than others-Less often than others 0.3334238

According to the Tukey ‘Honestly Significant Differences’ test, with 95% confidence there is a significant difference between the mean number of working hours of those who devote more time to social activities than others of the same age and those who devote about the same time to social activities compared to others of the same age (the adjusted p-value = 0.0007040). Also, there is a significant difference between the mean number of working hours of people who devote less time to social activities compared to others of the same age and those who devote about the same amount of time to social activities compared to others of their age and people, but here the difference is less significant (the adjusted p-value = 0.0298638). The difference between the mean number of working hours of the remaining group is not significant (the adjusted p-value = 0.3334238)

We can plot this:

par(cex.axis=0.8)
par(mar=c(4, 15, 2, 0))
plot(TukeyHSD(aov.out), las = 1)

The plot represents the above mentioned conclusions. If confidence intervals do not contain 0 there is an evidence that the groups are different. So, the pairs of groups that does not cross the dotted line are significantly different.

~ Thank you for your attention!