odoo_team
13/03/2019
Hello, everybody!
Shortly about us (mostly the same, but still).
Our team-members and their responsibilities:
Our data:
As usual, here are all the packages that we need. Let’s open them (or install and open):
library(foreign)
library(dplyr)
library(magrittr)
library(lubridate)
library(ggplot2)
library(psych)
library(knitr)
library(lsr)
library(vcd)
library(sjPlot)
library(corrplot)
library(RColorBrewer)In the context of our topic of study, i.e. subjective well-being in Netherlands, we would like to test a new theory. What if the variation of individuals’ total working hours (overtime included) can be subjected to change, if we examine it along with a parameter that indicates one’s time devoted to social activities compared to others of same age. For instance, we assume that if a person spend much time on social meetings or participating in any kinds of social activities that are of importance for him/her, they may either skip some part of the formally assigned working time and then work these hours off, or the work itself may not be present much in his/her life, as it would be in case of part-time job of a full-time student.
For the start, we can conduct one-way analysis of variances (ANOVA) and check if means across groups of the variable which indicates participating in social activities compared to others of the same age are not signiffically different, or there is a variation in means which would support our theory.
Thus, the hypotheses would be the following:
Loading our dataset:
ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL <- ESS %>% select (wkhtot, sclact)
NL = na.omit(NL)
whours <- as.numeric(as.character(NL$wkhtot))
NL <- cbind(NL, whours)
dim(NL)## [1] 1580 3
There are 1580 observations of 3 variables: two, what we have taken from the whole ESS dataset and one new variable - whours. We had to convert it to a continious type to work with it later.
ggplot() +
geom_histogram(data = NL, aes(x = whours), binwidth = 8, fill = "pink", col= "black", alpha = 0.7) +
ggtitle("Total hours normally worked per week in main job") +
xlab("Hours") +
ylab("Count") +
theme_bw() summary(whours)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 36.00 34.37 42.00 168.00
The distribution is close to normal, even though it is seen that it is not perfect. It is slightly skewed to the left and has biggest number somewhere about 40 hours as possibly it is the very common number of hours for work.
One of the way to balance sizes of the groups is to merge the smallest of them to the bigger ones. And that’s what we are going to do. With the following we create three new groups, two of which combine several levels of the “sclact” variable. Thus, we have created the folling groups:
Also, we established the order of levels for the “sclact” factor.
NL$sclactEd <- rep(NA, length(NL$sclact))
NL$sclactEd[NL$sclact == "Much more than most" |
NL$sclact == "More than most"] <- "More often than others"
NL$sclactEd[NL$sclact == "About the same"] <- "About the same"
NL$sclactEd[NL$sclact == "Less than most" |
NL$sclact == "Much less than most"] <- "Less often than others"
sclact <- as.factor(NL$sclactEd)describeByNow, let’s look at the descriptive values across our new groups. It seems like group sizes are more or less comparable now. At the same time, we can see the values of skewness across three groups, which do not exceed 2, hinting at the normality of the distribution of our continious variable “whours”.
describeBy(whours, sclact, mat = TRUE)## item group1 vars n mean sd median trimmed
## X11 1 About the same 1 687 32.70888 15.05402 34 32.48457
## X12 2 Less often than others 1 548 35.04197 16.45980 38 34.86364
## X13 3 More often than others 1 345 36.60000 17.15209 38 36.14801
## mad min max range skew kurtosis se
## X11 14.8260 0 90 90 0.2087559 0.02880948 0.5743468
## X12 14.8260 0 100 100 0.2027204 0.43345238 0.7031279
## X13 13.3434 0 168 168 1.4252582 9.56287314 0.9234375
boxplot(whours ~ sclact, ylab = "Working hours in total", xlab = "Social activity compared to others of same age", col = c("pink", "mistyrose", "palevioletred3"))Here we can see some differences in number of total working hours for people who participate with different frequency in social activities (darker the color = more frequency). The median number of working hours for people from the category, describing social activity as “about the same” is lower then for people from the category “less” or “more than others”. However, we cannot be sure in significanve of this difference without a test. There are also some outliers.
Currently, we are ready to proceed to the one way analysis (i.e. ANOVA). However, considering several assumptions for ANOVA whose violation can lead to robust results, we need to do some additional preparations before running the test. While the independence of observations is implied, homogeneity of variances across the groups and normality of residuals are still to be checked. Firstly, let’s inspect the variances. In order to do that, we apply Levene’s test. Variances homogeneity test
library(car)
leveneTest(whours ~ sclact)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5173 0.5962
## 1577
It returns the p-value equal to 0.596 which is greater than alpha level 0.05, indicating that there is no significant difference in variances between the groups. Obtaining this information, we can state that variances are homogeneous and put var.equal = T in conditions of ANOVA test.
Now, let’s conduct the ANOVA test itself, first part of which would provide a p-value, and the second one - F-ratio.
oneway.test(whours ~ sclact, var.equal = T) ##
## One-way analysis of means
##
## data: whours and sclact
## F = 7.5152, num df = 2, denom df = 1577, p-value = 0.0005645
aov.out <- aov(whours ~ sclact)
summary(aov.out)## Df Sum Sq Mean Sq F value Pr(>F)
## sclact 2 3859 1929.4 7.515 0.000564 ***
## Residuals 1577 404863 256.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, we can see that F-ratio is equal to 7.515, which is bigger than the computed critical F-value, and p-value is 0.0005, which is smaller than alpha level (0.05), that allows us to reject the null hypothesis, concluding that there is the variations between the groups means due to true differences about the populations means.
layout(matrix(1:4, 2, 2))
plot(aov.out)Upper graphs’ red lines are almost straight and Q-Q plot is not ideal but still has something like a straight line.
kruskal.test(wkhtot ~ sclact, data = NL)##
## Kruskal-Wallis rank sum test
##
## data: wkhtot by sclact
## Kruskal-Wallis chi-squared = 18.94, df = 4, p-value = 0.0008075
With KW chi-square = 18.94 and p-value = 0.0008, it means that the mean ranks of the different groups are not the same. This results confirms the ANOVA test results.
AS Kruskal-Wallis test is significant, we can also run a non-parametric Dunn’s post hoc test:
library(dunn.test)
dunn.test(whours, sclact, kw=TRUE)## Kruskal-Wallis rank sum test
##
## data: whours and sclact
## Kruskal-Wallis chi-squared = 14.7901, df = 2, p-value = 0
##
##
## Comparison of whours by sclact
## (No adjustment)
## Col Mean-|
## Row Mean | About th Less oft
## ---------+----------------------
## Less oft | -2.737675
## | 0.0031*
## |
## More oft | -3.547013 -1.124059
## | 0.0002* 0.1305
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
These results show that most pairs of groups have statistically significant differences in their medians.
As the F-ratio is significant, we can apply a post hoc test to check which groups contribute to the statistical significance of this test. In our case we use Tukey’s post hoc test because the variances are equal across all three groups (as we’ve checked earlier)
TukeyHSD(aov.out)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = whours ~ sclact)
##
## $sclact
## diff lwr upr
## Less often than others-About the same 2.333092 0.1802306 4.485953
## More often than others-About the same 3.891121 1.4108272 6.371414
## More often than others-Less often than others 1.558029 -1.0252840 4.141342
## p adj
## Less often than others-About the same 0.0298638
## More often than others-About the same 0.0007040
## More often than others-Less often than others 0.3334238
According to the Tukey ‘Honestly Significant Differences’ test, with 95% confidence there is a significant difference between the mean number of working hours of those who devote more time to social activities than others of the same age and those who devote about the same time to social activities compared to others of the same age (the adjusted p-value = 0.0007040). Also, there is a significant difference between the mean number of working hours of people who devote less time to social activities compared to others of the same age and those who devote about the same amount of time to social activities compared to others of their age and people, but here the difference is less significant (the adjusted p-value = 0.0298638). The difference between the mean number of working hours of the remaining group is not significant (the adjusted p-value = 0.3334238)
We can plot this:
par(cex.axis=0.8)
par(mar=c(4, 15, 2, 0))
plot(TukeyHSD(aov.out), las = 1)The plot represents the above mentioned conclusions. If confidence intervals do not contain 0 there is an evidence that the groups are different. So, the pairs of groups that does not cross the dotted line are significantly different.
Social activity levels in our data:
Let’s take a look at the “sclact” variable. As it can be seen from the barplot below, number of observations in groups, chosen to be examined, vary too much. That may affect the F-ratio, so we got to do something with them.