Hi! Our team is called Odoo (named after one business software, but that’s not so important). And here we are: Anna Moroz (our leader who dropkickes all of us (that’s useful!), Anna Podolskaya (always knits and checks everything before the submission), Anastasia Prokudina (the first alarmist, who calms all of us down) and Olga Chemodanova (data scientist + she googles faster than you).
So, here we would like to present you all the projects, that we have done during the course of Data Analysis :)
IN all our projects we work with the following data:
Let’s get started!
library(foreign)
library(dplyr)
library(lubridate)
library(ggplot2)
library(knitr)
Contribution:
Title <- "odoo"
Team <- c("Anna Podolskaya", "Anna Moroz", "Anastasia Prokudina", "Olga Chemodanova")
Team_Member <- sort(Team)
Contribution <- c("Construction of a scatterplot, a boxplot, a stacked barplot", "Construction of a histogram and a stacked barplot, variables' types description, arranging graphs' colours, labels", "Construction of a barplot, identifying CTM for the histogram with 'happy' variable, variables' type description, collecting all scripts in one and arranging the final layout", "Construction of a scatterplot and a boxplot, assistance to other members of the team with R troubles")
Individual_Contribution <- data.frame(Team_Member, Contribution)
kable(Individual_Contribution)
| Team_Member | Contribution |
|---|---|
| Anastasia Prokudina | Construction of a scatterplot, a boxplot, a stacked barplot |
| Anna Moroz | Construction of a histogram and a stacked barplot, variables’ types description, arranging graphs’ colours, labels |
| Anna Podolskaya | Construction of a barplot, identifying CTM for the histogram with ‘happy’ variable, variables’ type description, collecting all scripts in one and arranging the final layout |
| Olga Chemodanova | Construction of a scatterplot and a boxplot, assistance to other members of the team with R troubles |
Loading our data:
ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL <- ESS %>% select (gndr, agea, happy, sclmeet, inprdsc, crmvct, aesfdrk, health)
NL = na.omit(NL)
NL$happy <- as.numeric(NL$happy)
NL$health <- as.factor(NL$health)
NL$gndr <- as.factor(NL$gndr)
NL$agea <- as.numeric(NL$agea)
Variable_name <- c("gndr: Gender", "agea: Age of a respondent", "happy: How happy are you", "sclmeet: How often socially meet with friends, relatives or colleagues", "inprdsc: How many people with whom you can discuss intimate and personal matters", "crmvct: Respondent or household member victim of burglary/assault last 5 years", "aesfdrk: Feeling of safety of walking alone in local area after dark", "health: Subjective general health")
Level_Of_Measurement <- c("Nominal", "Ratio", "Nominal -> Interval", "Ordinal", "Ordinal", "Nominal", "Ordinal", "Ordinal")
Continious_Or_Discrete <- c("None", "Continious", "None -> Continious", "None", "None", "None", "None", "None")
df0 <- data.frame(Variable_name, Level_Of_Measurement, Continious_Or_Discrete, stringsAsFactors = FALSE, check.rows = TRUE, check.names = TRUE)
kable(df0)
| Variable_name | Level_Of_Measurement | Continious_Or_Discrete |
|---|---|---|
| gndr: Gender | Nominal | None |
| agea: Age of a respondent | Ratio | Continious |
| happy: How happy are you | Nominal -> Interval | None -> Continious |
| sclmeet: How often socially meet with friends, relatives or colleagues | Ordinal | None |
| inprdsc: How many people with whom you can discuss intimate and personal matters | Ordinal | None |
| crmvct: Respondent or household member victim of burglary/assault last 5 years | Nominal | None |
| aesfdrk: Feeling of safety of walking alone in local area after dark | Ordinal | None |
| health: Subjective general health | Ordinal | None |
A histogram and central tendency measures for the variable “happy”. The histogram can help us to understand the general distribution of happiness rates among people from Netherlands. As can be seen from it, the distribution of people’s happiness is skewed to the left. That means there are more happy people than unhappy ones. The dark blue line shows us the mean score of happiness (which is 8.82) and the dashed white line shows the median (which equals to 9).
ggplot() +
geom_histogram(data = NL,
aes(x = happy),
binwidth = 1,
fill = "salmon",
col= "black",
alpha = 0.7) +
labs(subtitle="Histogram",
y="Frequency",
x="Level of happiness",
title="Distribution of people's happiness") +
geom_vline(aes(xintercept = mean(NL$happy)), linetype="solid", color="darkblue", size=1) +
geom_vline(aes(xintercept = median(NL$happy)), linetype="dashed", color="white", size=1) +
theme_bw() #graphical presentation of the 'happy' variable's central tendency measures
happy <- NL$happy
Variable_happy <- c("Level of happiness")
#central tendency measures for varible "happy"
summary(NL$happy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 8.000 9.000 8.882 10.000 11.000
A barplot for the variable “health”. This bar chart can help us if we, for instance, want to find if there is a general trend of subjective general health among people. According to the plot, more than 50% of respondents estimate their general health as “good”. While the least amount of people think that their health is “5 points out of 5”very bad".
ggplot() +
geom_bar(data = NL,
colour = "black",
fill = "cornflowerblue",
aes(x = health, y=..count../sum(..count..)*100)) +
labs(subtitle="Barplot",
y="Number of people, %",
x="General health state",
title="Subjective General Health") +
coord_flip() +
ylim(0,60) +
theme_bw()
A scatterplot for the variables “happy” and “agea”. This scatterplot can help us to see the distribution of people’s levels of happiness by their age and gender.
ggplot() +
geom_point(data = NL,
aes(x = happy, y = agea, color = gndr)) +
labs(subtitle="Scatterplot",
y="Age, years",
x="Level of happiness",
title="Distribution of Happiness by Age",
color = "Gender") +
theme_minimal()
A boxplot for the variables “agea” and “aesfdrk”. By this boxplot, it can be seen that most people of all ages feel unsafe while walking alone in local area after dark. Yet, those who feel very unsafe on the streets after dark are the minority out of all presented categories of variable ‘aesfdrk’. Besides, only in this category, outliers are observed.
ggplot(data = NL) +
geom_boxplot(aes(x = aesfdrk, y = NL$age),
colour = "orangered") +
labs(subtitle="Boxplots",
y="Age, years",
x="Feeling of safety",
title="Distribution of Feeling of Safety of Walking Alone\nin Local Area After Dark by Age") +
theme_bw()
Central tendency measures for the variable “age”. The information given tells us that the average age of respondents in our dataset is 37 years, maximum age is 80 years, and minimum age is 1 years, the latter being a bit suspicious.
summary(NL$agea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 22.00 38.00 37.11 52.00 80.00
A stacked barplot for the variables “gndr” and “sclmeet”. This graph helps us to trace how often people of each gender socially meet with friends, relatives or colleagues. We can see that the number of those who never socially meet with friends, relatives, and/or work colleagues is quite small. While the majority of respondents have social meetings several times a week. Also, it is clear that ther is a gender balance in each category.
stackedBP <- table(NL$gndr, NL$sclmeet)/nrow(NL)*100
par(mar=c(4, 10, 2, 1))
barplot(stackedBP,
main="Frequency of Social Meetings by Gender",
xlab="Number of people, %",
col=c("dodgerblue","brown1"),
xlim = c(0,50),
horiz = TRUE,
las = 1,
legend = rownames(stackedBP))
A stacked barplot for the variables “crmvct” and “inprdsc”. And this barplot measures if there is any interdependence between being a victim of burglary or assault in last 5 years and the number of people with whom a respondent can discuss intimate and personal matters. As it is graphically shown, the situation is not exactly like this, the majority of those with traumatizing experience disscuss their intimate issues with 4-6 people, which is quite the average out of all categories (from no people to discuss personal matters with to 10 or more).
stackedBP <- table(NL$crmvct, NL$inprdsc)/nrow(NL)*100
barplot(stackedBP,
main="Number of people to discuss intimate/personal matters\n by Having Traumatizing Experience",
xlab="Number of people to discuss intimate/personal matters with", col=c("salmon","mediumturquoise"),
ylab="Number of respondents, %",
ylim=c(0, 50),
args.legend=list(title="Traumatizing experience"),
legend = rownames(stackedBP))
And that’s all. Now you can move on to our next project!
library(magrittr)
library(psych)
library(lsr)
library(vcd)
library(sjPlot)
library(corrplot)
Contribution:
Our work is divided into two parts: chi-square test and T-test. For each we’ve created separate datasets: NL_chi and NL_ttest, and then cleaned all missing values in them.
NL_chi <- ESS %>% select(blgetmg, dscrgrp)
NL_chi = na.omit(NL_chi)
NL_ttest<- ESS %>% select (agea, crmvct)
NL_ttest = na.omit(NL_ttest)
NL_ttest$agea <- as.numeric(NL_ttest$agea)
To begin with, let’s look at the variables “blgetmg”, which defines belonging to a minority ethnic group in the Netherlands and “dscrgrp” - that stands for being a member of a group discriminated against in the Netherlands. We are interested in differences and association that may be noticed.
ggplot(NL_chi,
aes(x = blgetmg, y=..count../sum(..count..)*100,
fill = dscrgrp)) +
geom_bar(position = "stack") +
labs(title="Belong to a Minotity Ethnic Group by Belonging\n To a Group Discriminated Against In the Netherlands",
subtitle="Barplot",
x="Belonging to a minority ethnic group",
y="Number of observations, %",
fill="Being a member of a group\ndiscriminated against in the Netherlands") +
ggtitle("Number of respondents, who belong to minotity ethnic group and if they belong\nor not to a group discriminated against in Netherlands") +
theme_bw()
It is clear that number of respondents who belong to a minority ethnic group is significantly lower than the number of those who do not. If we look closer to the red parts on the graph, we can find that the number of those who are members of a discriminated group is about one third of respondents who belong to a minority ethnic group. In the second bar this proportion is much less. Let’s check it with the chi-square test.
However, firstly we are to check the assumptions of the chi-square test. In order to do it, we construct a contingency table with our variables. Then visualize it.
table <- table(NL_chi)
table
## dscrgrp
## blgetmg Yes No
## Yes 35 57
## No 94 1487
#visualize a contingency table
assoc(head(table, 5), shade = TRUE, las=3, xlab = "Being a member of a group discriminated against in Netherlands", ylab = "Belonging to minority ethnic groups")
First of all, from the table it can be seen that in each СЃell of the contingency table there are at least 5 observations (in fact, there are much more than 5). Secondly, each observation contribute to one group only, so the observations are independent. Thirdly, there are no empty cells. Considering all these conditions, we conclude that the assumptions are met.
Now, let’s observe the difference in proportion of people from the Netherlands who belong to a minority ethnic group and who are members of a group being discriminated. It can be seen that the percentage of those who don’t belong to a minority and are discriminated is small (equal to 6%), while the percentage of those who belong to a minority and are discriminated is higher (scoring 38%).
sjp.xtab(NL_chi$blgetmg, NL_chi$dscrgrp,
margin = "row",
bar.pos = "stack",
title = "The proportion of respondents who belong to minotity ethnic group and their belongness to a group discriminated against in the Netherlands",
axis.titles = "Belonging to ethnic minority",
legend.title = "Being a member of a group discriminated against in the Netherlands",
show.summary = TRUE,
coord.flip = TRUE)
The hypotheses for the test are:
We are perform chi-square test and looking at how many observed and expected observations we have. From the test we can see that variables are statistically significantly associated as p-value is tending towards zero. Also, we can see observed and expected values.
options(scipen=999)
#chi square
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 121.4, df = 1, p-value < 0.00000000000000022
chi <- chisq.test(table)
str(chi)
## List of 9
## $ statistic: Named num 121
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named int 1
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.000000000000000000000000000312
## $ method : chr "Pearson's Chi-squared test with Yates' continuity correction"
## $ data.name: chr "table"
## $ observed : 'table' int [1:2, 1:2] 35 94 57 1487
## ..- attr(*, "dimnames")=List of 2
## .. ..$ blgetmg: chr [1:2] "Yes" "No"
## .. ..$ dscrgrp: chr [1:2] "Yes" "No"
## $ expected : num [1:2, 1:2] 7.09 121.91 84.91 1459.09
## ..- attr(*, "dimnames")=List of 2
## .. ..$ blgetmg: chr [1:2] "Yes" "No"
## .. ..$ dscrgrp: chr [1:2] "Yes" "No"
## $ residuals: 'table' num [1:2, 1:2] 10.478 -2.527 -3.029 0.731
## ..- attr(*, "dimnames")=List of 2
## .. ..$ blgetmg: chr [1:2] "Yes" "No"
## .. ..$ dscrgrp: chr [1:2] "Yes" "No"
## $ stdres : 'table' num [1:2, 1:2] 11.2 -11.2 -11.2 11.2
## ..- attr(*, "dimnames")=List of 2
## .. ..$ blgetmg: chr [1:2] "Yes" "No"
## .. ..$ dscrgrp: chr [1:2] "Yes" "No"
## - attr(*, "class")= chr "htest"
chi$observed #observed
## dscrgrp
## blgetmg Yes No
## Yes 35 57
## No 94 1487
round(chi$expected,2) #expected
## dscrgrp
## blgetmg Yes No
## Yes 7.09 84.91
## No 121.91 1459.09
From chi-square test we have that p-value is smaller than 2.2e-16, so if the value of standardized residuals is lower than -3.29 it means that the cell contains fewer observations than it was expected (the case of variables independence). If the value of standardized residuals is higher than 3,29 then the cell contains more observations that it was expected. The standardized residuals and plots with the results are shown below:
chi$stdres #standardized residuals
## dscrgrp
## blgetmg Yes No
## Yes 11.2193 -11.2193
## No -11.2193 11.2193
assocplot(t(table), main="Residuals and number of observations", xlab = "Being a member of a group discriminated against in Netherlands", ylab = "Belonging to minority ethnic groups")
#Let’s visualize Pearson residuals using the package corrplot
corrplot(chi$stdres, is.cor = FALSE)
#we can see that there is association between variables
The probability of independence of variables (or that H0, the null hypothesis, is true) is zero. The function can be seen below:
1 - pchisq(121.4, 0.05)
## [1] 0
So, having the p-value < 2.2e-16 and looking at the plots above, we can conclude that H0 is false and reject it. That means that there is an association between belonging to a minority ethnic group in the Netherlands and being a member of a group discriminated against in the Netherlands.
###Box-plots and checking the normality assumption
While studying the topic of subjective well-being in Netherlands, we pose a question of what age do people typically become a victim of a burglary or an assault. To find this out, we can start with constructing a boxplot which would show us minimum, maximum, and median values of respondents’ age, and find out if these values tend to differ across those who had traumatizing experience in last 5 years and those who did not. Information about traumatizing experience is contained in the variable “crmvct”, the answer on which denotes if a respondent or household member was a victim of burglary or assault in last 5 years or not. Thus, we have two groups on which we can check the distribution of age parameter, the variable “agea”. Having constructed the boxplot, we got the infromation that median, maximum, and minimum values indeed vary across the two groups, and the data in both groups don’t produce any outliers. The summary of the mentioned values, and more, can be seen as the result of applying function psych::describeBy().
ggplot(NL_ttest,
aes(x = NL_ttest$crmvct, y = agea, fill = NL_ttest$crmvct)) +
ylim(0, 80) +
labs(title="Age Distribution by Presence of Traumatizing Experience",
subtitle="Boxplots",
x="Having had traumatizing experience",
y="Age of respondents",
fill="Traumatizing experience") +
geom_boxplot() +
theme_bw()
describeBy(NL_ttest$agea, NL_ttest$crmvct, mat = T)
Now, to be absolutely confident about the result given by the boxplots and its correctness, we want to suggest eresearch hypotheses and apply the t-test to verify it. Yet, before doing so, we need to check the assumptions required by this test. Precisely, the normality of the ‘agea’ variable distribution and homogeneity of the sample groups. Previously printed table on psych::describeBy() function has provided some paramaters that are applicable in the current task.
Those parameters are skews and kurtosis. The first group, that consists of people who were victims of vurglary or assault in last 5 years, shows a skew of value 0.19 meaning that the distribution of data in this group is slightly skewed to the right, while kurtosis is of value -0.83 meaning that the distribution of respondents’ age for this group is platykurtic (flat). The second group also shows a negative kurtosis, denoting flat distribution, this time of value -0.94. Yet, its skew is of negative value meaning that the data is insignificantly, as the value is -0.13, skewed to the left. Besides, sample groups are homogeneous as their variances are equal.
Also, we can to present a Q-Q plot or a histogram of our two variables, in order to double check normality of respondents’ age distribution in both groups. Well, let’s do both. The first Q-Q plot shows distribution of residuals of the age variable for the category of respondents who had traumatizing experience, and the second one presents distribution of residuals of the age variable for the category of respondents who didn’t have such experience.
ggplot(NL_ttest,
aes(x = agea)) +
geom_histogram(col = "orange",
fill = "wheat1",
binwidth = 2.5) +
labs(title = "Age Distribution",
subtitle="Histogram",
x = "Distribution of respondents' age",
y = "Frequency") +
scale_x_continuous(limits = c(0, 100)) +
geom_vline(aes(xintercept = mean(agea)),
color="darkblue",
size = 0.8) +
geom_vline(aes(xintercept = median(agea)),
color = "red4",
size = 0.8) +
facet_grid(NL_ttest$crmvct ~ .) +
theme_bw()
yes <- subset(NL_ttest, NL_ttest$crmvct == "Yes")
no <- subset(NL_ttest, NL_ttest$crmvct == "No")
plot_positive <- qqnorm(as.numeric(yes$agea)); qqline(as.numeric(yes$agea, col = 2)) #a Q-Q plot showing age distribution for those who had traumatizing experience
plot_negative <- qqnorm(as.numeric(no$agea)); qqline(as.numeric(no$agea, col = 2)) #a Q-Q plot showing age distribution for those without traumatizing experience
The histogram visualizes the data distribution and shows that the median and the mean, denoted by vertical lines of different color, appear to almost coincide, indicating that the distribution is close to symmetric. The Q-Q plot lets us assume that our sample of “agea” variable comes from a population that is approximately normally distributed with some signiffiant deviations from normality in the beginning and in the end of it. That is not far from the perfect normality required to conduct the t-test. Besides, for large samples (n > 300) this characteristic is not that important.
Now, let us apply the t-test. As our null hypothesis(H0) we suggest that the means in two populations, from which the sample groups are taken, are equal, and as an alternative hypothesis (H1) we suggest that these means are not equal.
t.test(NL_ttest$agea ~ NL_ttest$crmvct, na.rm = T, var.equal = F)
##
## Welch Two Sample t-test
##
## data: NL_ttest$agea by NL_ttest$crmvct
## t = -3.9936, df = 400.28, p-value = 0.00007744
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.867574 -2.336657
## sample estimates:
## mean in group Yes mean in group No
## 33.31579 37.91791
The results of the t-test show that the means in two populations are not equal, that allows us to reject the null hypothesis. P-value, which is less than 0.05, confirms it. This way, we conducted an analysis that has given us some insights for further exploration of our initial question (“at what age do people in the Netherlands typically become a victim of a burglary or an assault, a.k.a. come to have traumatizing experience”).
Now, let’s check the results of the t-test with unpaired Wilcoxon or Mann-Whitney test.
The hypotheses are:
wilcox.test(NL_ttest$agea ~ NL_ttest$crmvct)
##
## Wilcoxon rank sum test with continuity correction
##
## data: NL_ttest$agea by NL_ttest$crmvct
## W = 159900, p-value = 0.0001113
## alternative hypothesis: true location shift is not equal to 0
We can see that the p-value of the test is 0.0001113, which is less than the significance level alpha = 0.05. This way, the null hypothesis can be rejected. We can conclude that among respondents who expirienced burglary/assalt the median age is significantly different from a group that didn’t expirience burglary/assalt median age.
Let’s aso check the results with Cohen`s D test! It will show how strong the effect size and difference is.
cohensD(NL_ttest$agea ~ NL_ttest$crmvct)
## [1] 0.2473307
The result of 0.25 shows that there is a small effect size. With a Cohen’s d of 0.2:
All for the second part.
library(lsr)
library(vcd)
library(RColorBrewer)
In the context of our topic of the study, i.e. subjective well-being in the Netherlands, we would like to test a new theory. What if the variation of individuals’ total working hours (overtime included) can be subjected to change, if we examine it along with a parameter that indicates one’s time devoted to social activities compared to others of same age. For instance, we assume that if a person spend much time on social meetings or participating in any kinds of social activities that are of importance for him/her, they may either skip some part of the formally assigned working time and then work these hours off, or the work itself may not be presented much in his/her life, as it would be in case of part-time job of a full-time student.
For the start, we can conduct one-way analysis of variances (ANOVA) and check if means across groups of the variable which indicates participating in social activities compared to others of the same age (‘sclact’) are not signiffically different, or there is a variation in means which would support our theory.
Thus, the hypotheses would be the following:
Loading our dataset:
NL <- ESS %>% select (wkhtot, sclact)
NL = na.omit(NL)
whours <- as.numeric(as.character(NL$wkhtot))
NL <- cbind(NL, whours)
dim(NL)
## [1] 1580 3
There are 1580 observations of 3 variables: two, what we have taken from the whole ESS dataset and one new variable - whours. We had to convert it to a continious type to work with it later.
ggplot() +
geom_histogram(data = NL,
aes(x = whours),
binwidth = 8,
fill = "pink",
col= "black",
alpha = 0.7) +
labs(title="Total hours normally worked per week in main job",
subtitle="Histogram",
x="Hours",
y="Count") +
theme_bw()
summary(whours)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 36.00 34.37 42.00 168.00
The distribution is close to normal, even though it is seen that it is not perfect. It is slightly skewed to the left and has biggest number somewhere about 40 hours as possibly it is the very common number of total hours for work per week.
describeByNow, let’s look at the descriptive values across our new groups. It seems like group sizes are more or less comparable now. At the same time, we can see the values of skewness across three groups, which do not exceed 2, hinting at the normality of the distribution of our continious variable “whours”.
describeBy(whours, sclact, mat = TRUE)
boxplot(whours ~ sclact,
main="Social activity compared to others of same age",
ylab = "Working hours in total",
col = c("mistyrose", "pink", "palevioletred3"))
Here we can see some differences in number of total working hours for people who participate with different frequency in social activities (note that the darker the color, the more frequency). The median number of working hours for people from the category, describing social activity as “about the same” is lower then for people from the category “less” or “more than others”. However, we cannot be sure in significanve of this difference without a test. There are also some outliers observed.
Currently, we are ready to proceed to the one way analysis (i.e. ANOVA). However, considering several assumptions for ANOVA whose violation can lead to robust results, we need to do some additional preparations before running the test. While the independence of observations is implied, homogeneity of variances across the groups and normality of residuals are still to be checked. Firstly, let’s inspect the variances. In order to do that, we apply Levene’s test. Variances homogeneity test
library(car)
leveneTest(whours ~ sclact)
It returns the p-value equal to 0.596 which is greater than alpha level 0.05, indicating that there is no significant difference in variances between the groups. Obtaining this information, we can state that the variances are homogeneous and put var.equal = T in conditions of ANOVA test.
Now, let’s conduct the ANOVA test itself, first part of which would provide p-value, and the second one - F-ratio.
oneway.test(whours ~ sclact, var.equal = T)
##
## One-way analysis of means
##
## data: whours and sclact
## F = 7.5152, num df = 2, denom df = 1577, p-value = 0.0005645
aov.out <- aov(whours ~ sclact)
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## sclact 2 3859 1929.4 7.515 0.000564 ***
## Residuals 1577 404863 256.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, we can see that F-ratio is equal to 7.515, which is bigger than the computed critical F-value, and p-value is 0.0005, which is smaller than alpha level (0.05), that allows us to reject the null hypothesis, concluding that there are variations between the groups’ means due to true differences in the populations means.
layout(matrix(1:4, 2, 2))
plot(aov.out)
Upper graphs’ red lines are almost straight and Q-Q plot is not ideal but still has something like a straight line in it.
Now we are going to double-check the ANOVA results with the Kruskal-Wallis test
kruskal.test(wkhtot ~ sclact, data = NL)
##
## Kruskal-Wallis rank sum test
##
## data: wkhtot by sclact
## Kruskal-Wallis chi-squared = 18.94, df = 4, p-value = 0.0008075
With the KW chi-square equal to 18.94 and p-value is 0.0008, it means that the mean ranks of the different groups are not the same. This results confirms the ANOVA test results.
AS Kruskal-Wallis test is significant, we can also run a non-parametric Dunn`s post hoc test:
library(dunn.test)
dunn.test(whours, sclact, kw=TRUE)
## Kruskal-Wallis rank sum test
##
## data: whours and sclact
## Kruskal-Wallis chi-squared = 14.7901, df = 2, p-value = 0
##
##
## Comparison of whours by sclact
## (No adjustment)
## Col Mean-|
## Row Mean | 1. Less 2. About
## ---------+----------------------
## 2. About | 2.737675
## | 0.0031*
## |
## 3. More | -1.124059 -3.547013
## | 0.1305 0.0002*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
These results show that most pairs of groups have statistically significant differences in their medians.
As the F-ratio is significant, we can apply a post hoc test to check which groups contribute to the statistical significance of this test. In our case we use Tukey’s post hoc test because the variances are equal across all three groups (as we’ve checked earlier)
TukeyHSD(aov.out)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = whours ~ sclact)
##
## $sclact
## diff lwr
## 2. About the same-1. Less often than others -2.333092 -4.485953
## 3. More often than others-1. Less often than others 1.558029 -1.025284
## 3. More often than others-2. About the same 3.891121 1.410827
## upr p adj
## 2. About the same-1. Less often than others -0.1802306 0.0298638
## 3. More often than others-1. Less often than others 4.1413424 0.3334238
## 3. More often than others-2. About the same 6.3714144 0.0007040
According to the Tukey вЂHonestly Significant Differences’ test, with 95% confidence there is a significant difference between the mean number of working hours of those who devote more time to social activities than others of the same age and those who devote about the same time to social activities compared to others of the same age (the adjusted p-value = 0.0007040). Also, there is a significant difference between the mean number of working hours of people who devote less time to social activities compared to others of the same age and those who devote about the same amount of time to social activities compared to others of their age and people, but here the difference is less significant (the adjusted p-value = 0.0298638). The difference between the mean number of working hours of the remaining group is not significant (the adjusted p-value = 0.3334238)
We can plot this:
par(cex.axis=0.8)
par(mar=c(4, 15, 2, 0))
plot(TukeyHSD(aov.out), las = 1)
The plot represents the above mentioned conclusions. If confidence intervals do not contain 0 there is an evidence that the groups are different. So, the pairs of groups that does not cross the dotted line are significantly different.
End of the third part!
(OMG, cannot believe we are so close to the end!) Btw, all neccessary libraries have already been loaded, so let’s move straight to…
Recently, our team has found out an interesting fact: the country we chose to study, that is Netherlands, was at the top of the ranking of the European Union countries with the highest Internet access at home in 2018 (Statista, 2019). Further exploration revealed that, in 2017, 98% of Dutch households had Internet access, compared to the average 87% in Europe (CDS, 2018). Also, the article from the cbs.nl web-site points out the popularity of mobile Internet in Netherlands, saying that about 87% of population, which is quite a big number, uses the Internet on their mobile devices outside of home or work.
We got curious about this prevalence of the Internet in Netherlands and decided to give it a bit of research. We wanted to check if there are any factors that contribute to the Internet usage among the Dutch. Thus, we settled on applying the regression analysis on data that we got from European Social Survey (ESS) database. As the ESS database allows us to perform the analysis using data only from the 2016 round, or even the later ones, we follow the idea to predict the Internet usage variability in Netherlands just before it became so popular in this country. We assume, it could give us some insights into the prerequisites of such a tendency.
In the regression analysis we used the ‘netustm’ variable that speaks for how much time, in minutes, a respondent spends using the internet on a computer, tablet, smartphone or other device, whether for work or personal use. To predict this as an outcome, we, first of all, are trying a respondent’s age (‘agea’) as an explanatory variable. Further, to see a bigger picture, we are adding to the model respondents being hampered in daily activities by illness/disability/infirmity/mental problem (‘hlthhmp’ variable) and the frequency of taking part in social activities compared to others of same age (‘sclact’ variable) as additional predictors. This way we suggest that if there is any inconvenience for a person to take part in daily activities, then he/she may spend more time in the Internet as a distraction from issues of the day. We also support this model with the frequency of taking part in social activities as it may contibute to the proposition a lot.
Here we want to find out if there were are any relationships between minutes spent on the Internet on a typical day (dependent variable or outcome) and age of respondents (independent variable or predictor) in the Netherlands. In the second model, we also added another predictor, health, to age. We also wanted to predict minutes spent on the Internet on a typical day based on these two predictors in the Netherlands.
Our hypotheses were that the older the respondents, the less time they spent on the Internet on a typical day. Moreover, we assumed that healthy respondents spent less time on the Internet, as they devoted time to other activities. Also, respondents with bad health may spend more time on the Internet as they search for information about their diseases, or physically do not have the opportunity to actively spend time and surf the Internet instead.
Loading our dataset:
NL <- ESS %>% select (agea, netustm, sclact, hlthhmp)
NL = na.omit(NL)
NL$netustm <- as.numeric(NL$netustm)
NL$agea <- as.numeric(NL$agea)
dim(NL)
## [1] 1442 4
The resulting dataset that we are going to work with contains 4 variables with 1442 observations each.
Summary of the data:
summary(NL)
## agea netustm sclact
## Min. : 1.00 Min. : 1.00 Much less than most: 93
## 1st Qu.:20.00 1st Qu.: 12.00 Less than most :386
## Median :34.00 Median : 26.00 About the same :631
## Mean :34.13 Mean : 28.93 More than most :290
## 3rd Qu.:48.00 3rd Qu.: 39.00 Much more than most: 42
## Max. :75.00 Max. :102.00
## hlthhmp
## Yes a lot : 69
## Yes to some extent: 310
## No :1063
##
##
##
Distribution of our variables:
ggplot(NL, aes(x = agea)) +
labs(title="Respondents' age distribution",
subtitle="Histogram",
x="Age",
y="Count") +
geom_histogram(binwidth = 3,
fill = "sienna1",
col= "sienna2",
alpha = 0.7) +
geom_vline(aes(xintercept = mean(NL$agea)), linetype="solid", color="#8B0000", size=1) +
geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) +
theme_bw()
#Checking the normality of distribution
library(ggpubr)
ggqqplot(NL$agea)
summary(NL$agea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 20.00 34.00 34.13 48.00 75.00
From the histogram and the Q-Q plot presented above it can be concluded that the distribution is approximately normal (mean, the dashed white line, and median, the dark red line, are almost the same). The mean age of respondents is roughly 34 years. The oldest respondent is 75 years old. Generally, the majority of the respondents are younger than 60 years.
ggplot(NL, aes(x = netustm)) +
geom_histogram(binwidth = 7,
fill = "sienna1",
col= "sienna2",
alpha = 0.7) +
labs(title="The distribution of number of minutes spent on the Internet on a typical day",
subtitle="Histogram",
x = "Number of minutes spent on the Internet on a typical day",
y = "Count") +
geom_vline(aes(xintercept = mean(NL$netustm)), linetype="solid", color="#8B0000", size=1) +
geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) +
theme_bw()
#checking the normality of distribution
ggqqplot(NL$netustm)
summary(NL$netustm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 26.00 28.93 39.00 102.00
From the histogram, and Q-Q plot it can be seen that the distribution is not normal and skewed to the right, so many respondents spent less minutes on the Internet compared to the mean score (solid line on the histogram). The largest number of minutes spent on the Internet on a typical day is 102 minutes while the average number of minutes spent on the Internet among respondents is approximately 29 minutes.
ggplot() +
geom_bar(data = NL,
aes(x = sclact, y=..count../sum(..count..)*100),
fill = "red2",
alpha = 0.7) +
labs(title="Frequency of taking part in social activities\ncompared to others of same age",
subtitle="Barplot",
y="Number of respondents, %",
x="Taking part in socia activities") +
theme_bw()
It can be seen that more than 40% of the respondents take part in social activities about the same often as others of the their age. About 27% of the respondents spend less than most while almost 20% of the respondents - more than most. There are much less respondents who take part in social activities much less as well as much more frequent than those of the same age, about 7% and less than 5% correspondingly.
ggplot() +
geom_bar(data = NL,
aes(x = hlthhmp, y=..count../sum(..count..)*100),
fill = "darkgoldenrod1",
alpha = 0.7) +
labs(title="Distribution of respondents being hampered in daily activities\nby illness/disability/infirmity/mental problem",
subtitle="Barplot",
x="Respondents being hampered in daily activities by illness/disability/infirmity/mental problem",
y="Number of people, %") +
theme_bw()
The majority of the respondents, more than 65%, are not hampered in daily activities by illness/disability/infirmity/mental problem. About 20% of the respondents are humpered to some extent and only ~5% are humpered in daily activities due to some issues.
ggplot() +
geom_bar(data = NL,
aes(x = sclact, fill = hlthhmp),
position = "fill") +
xlab("Frequency") +
ggtitle("The Distribution of Frequency of Taking Part in Social Activities\nCompared to Others of Same Age by Respondents Being\nHampered In Daily Activities Due to Some Issues") +
labs(fill="If hampered") +
coord_flip() +
scale_fill_brewer(palette="Reds")+
theme_bw()
In general, the greater percentage of respondents who take part in social activities much less than others of the same age are hampered by illness/disability/infirmity/mental problem either a lot or to some extent compared to others.
ggplot() +
geom_boxplot(data = NL,
aes(x = sclact, y = netustm))+
xlab("The frequency of taking part in social activities compared to others of same age") +
ylab("Number of minutes spent on the Internet on a typical day") +
ggtitle("Numer of minutes spent on the Internet on a typical day\nby the frequency of taking part in social activities compared to others of same age") +
labs(subtitle="Boxplots") +
theme_bw()
NL %>% group_by(sclact) %>% dplyr::summarise(mean_netuse = mean(netustm))
As can be seen, boxplots are almost the same: their medians are around 20-27 minutes among all categories of taking part in social activities compared to others of the same age. It can be said about boxplots whiskers and the 1st and the 3rd quatiles, as well. All boxplots have outliers, but in the one that represents number of minutes on the Internet of people who partcicpate in social activities more often than most of the same age, there are less of them.
Also, from the table we can see that the mean number of minutes spent on the Internet is almost the same among respondents of different groups by frequency of taking part in social activities.
ggplot() +
geom_boxplot(data = NL, aes(x = hlthhmp, y = netustm)) +
xlab("If hampered in daily activities by illness/disability/infirmity/mental problem") +
ylab("Number of minutes spent on the Internet on a typical day") +
ggtitle("Being Hampered In Daily Activities by Illness/disability/infirmity/mental problem\nBy Number of Minutes Spent On the Internet On a Typical Day") +
labs(subtitle="Boxplots") +
theme_bw()
NL %>% group_by(hlthhmp) %>% dplyr::summarise(mean_internetuse = mean(netustm))
So, the medians and the 1st and the 3rd quartiles of the boxplots are almost the same, but the last boxplot has a larger upper whisker and more outliers. That means that more people who are not hampered in daily activities by illness/disability/infirmity/mental problem spend ~60-80 minutes on the Internet per day than those who are not hampered or hampered only to some extent. Still, that may be due to the fact that this group of people is larger than the other two.
From the table it can be concluded that people who are not hampered in daily activities have the highest mean number of minutes spent on the Internet.
First of all, let’s run a correlation test to check if any correlative relationship is possible between our chosen variables. It can be done with the parametric Pearson correlation method and the non-parametric Spearman correlation method which test the correlation coefficient for significance.
As we have close to a normal distribution of the age variable, we can apply the Pearson test.
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "pearson")
##
## Pearson's product-moment correlation
##
## data: as.numeric(NL$age) and as.numeric(NL$netustm)
## t = -14.701, df = 1440, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4053000 -0.3154979
## sample estimates:
## cor
## -0.3612362
The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.36, indicating a negative, relatively weak correlation between variables.
To double-check the significance of the correlation coefficient we apply the Spearman test, as well. The hypotheses are the following:
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "spearman")
##
## Spearman's rank correlation rho
##
## data: as.numeric(NL$age) and as.numeric(NL$netustm)
## S = 705640000, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4120086
The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.41, indicating negative, relatively weak, yet stronger than shown with the Pearson test, correlation between variables.
Now, let’s visualize our first model which includes the relationship between age and Internet usage with the “best” fit line that approximates all points on the scatterplot to one tendency line.
# Firstly we centered the 'agea' variable to make intercept interpretable
NL$agea_centered <- NL$agea - mean(NL$agea)
ggplot(NL,
aes(x = as.numeric(agea_centered), y = as.numeric(netustm))) +
geom_point(shape = 19,
color = "tomato") +
geom_smooth(method = "lm",
se = FALSE,
color = "purple4") +
labs(subtitle="Age Vs Internet Usage",
y="Time Spent On the Internet",
x="Age",
title="Scatterplot") +
theme_minimal()
ggplot(NL,
aes(x = as.numeric(agea), y = as.numeric(netustm))) +
geom_point(shape = 19,
color = "orange") +
geom_smooth(method = "lm",
se = FALSE,
color = "purple4") +
labs(subtitle="Age Centered Vs Internet Usage",
y="Time Spent On the Internet",
x="Age Centered",
title="Scatterplot") +
theme_minimal()
Orange scatterplot is built with centered variable. You can also have a glance on the graph with no centring of the variable of respondents’ age - it is red-colored.
In general, it can be seen that the majority of respondents spend less than 60 minutes on the Internet per day. Also, people from ~65 years spend less than 60 minutes per day or do not use the Internet at all. Age and number of minutes spent on the Internet are negatively linearly related, means that with the increase of age the number of minutes spent on the Internet decreases. Still, the relationship is quite weak (as it was assumed by both of the correlation tests). Now, let’s proceed to the first linear regression model.
By this model, we are going to predict how much time people spend on the Internet in a day (‘netustm’ variable) by age variable predictor(‘agea’).
model1 <- lm(netustm ~ agea, data = NL)
summary(model1)
##
## Call:
## lm(formula = netustm ~ agea, data = NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.387 -12.309 -3.835 8.522 76.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.2313 1.0178 41.49 <0.0000000000000002 ***
## agea -0.3896 0.0265 -14.70 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1299
## F-statistic: 216.1 on 1 and 1440 DF, p-value: < 0.00000000000000022
By looking at the summary, we can see that the model is a significant fit to the data and it is better than no model, as F-statistics line shows the p-value (< 2.2e-16) is much smaller than the conventional cut off of 0.05.
R-squared value, i.e. the coefficient of determination, equals 0.1299 and indicates that our model (model1) explains about 12% of the variability of Internet usage in the Netherlands. At the same time, there is a residual standard error (RSE) value that can tell us about the difference between the observed time spent on the Internet and the time spent on the Internet predicted by the model (model1). This difference is 17.66 percentage points.
Looking at the coefficients in the summary, we can construct a linear model equation. To accomplish that, we should know the coefficients for our equation, namely, the slope value and the intercept value, which have been discovered by now. To quickly remind ourselves of these numbers, we can call the following:
coef(model1)
## (Intercept) agea
## 42.231334 -0.389613
So, the intercept value is 42.22, while the slope value is -0,39. Thus, the linear model is the following:
\[\hat{InternetUsageInMinutes} = 42.23 - 0.39*age\]
It shows us that, when one’s age equals to 0, he/she typically spends 42.23 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.
As our age variable doesn’t contain meaningful 0 values, we can also make the same model with the centered age variable as a predictor. That means that we move the mean attendance to zero by subtracting the mean from the variable.
model1_centered <- lm(netustm ~ agea_centered, data = NL)
summary(model1_centered)
##
## Call:
## lm(formula = netustm ~ agea_centered, data = NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.387 -12.309 -3.835 8.522 76.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.9334 0.4666 62.01 <0.0000000000000002 ***
## agea_centered -0.3896 0.0265 -14.70 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1299
## F-statistic: 216.1 on 1 and 1440 DF, p-value: < 0.00000000000000022
Now, the intercept point in the model was changed and equals 28.93. The centered linear model is the following:
\[\hat{InternetUsageInMinutes} = 28.93 - 0.39*age\]
It suggests that when one’s age equals to 0, he/she typically spends 28.93 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.
Now that we have two models that may predict Internet usage in the Netherlands, we are going to compare them and see which model fits better to our data. In order to do so, we apply the ANOVA testing.
This way, we are going to compare the first model with one predictor, age, and the second model with 3 predictors, age, being hampered in daily activities, and frequency of social meetings compared to others of the same age.
In our case, full model is model2, as it has three predictors, while model1 has only one predictor.
anova(model1, model2)
The anova testing reveals a small p-value, as it is less than 0.05, that allows us to reject the null hypothesis. Also we can have a look on RSS, which is smaller for the second model. This way, we can conclude that model2 is significantly better at capturing the data.
Taking all the analyses into consideration we may say that our substantive hypotheses are supported, but the correlation excessively weak. That was seen after testing with Pearson and Spearman tests. The same was confirmed with our models, which explained about 12-13% of the variability of Internet usage in the Netherlands. Also we found out that it’s better to analyse internet usage not only with variable of reprondents’ age - ANOVA testing have shown that additon of 2 new predictors is a good idea.
Is is also important to remember, that we explored the data for 2016 year, not the most recent one! Maybe things have changed throughout the last 3 years. It would be interesting to see if they have and how exactly, but (unfortunately) we have no such oportunity yet. All in all it was an interesting investigation!
Both models can be seen in a more delicate way in the following table:
tab_model(model1, model2)
| netustm | netustm | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 42.23 | 40.24 – 44.23 | <0.001 | 44.82 | 39.24 – 50.39 | <0.001 |
| agea | -0.39 | -0.44 – -0.34 | <0.001 | -0.39 | -0.44 – -0.34 | <0.001 |
| Yes to some extent | -3.49 | -8.14 – 1.15 | 0.141 | |||
| No | -1.21 | -5.61 – 3.19 | 0.591 | |||
| Less than most | -1.00 | -5.02 – 3.01 | 0.624 | |||
| About the same | -2.39 | -6.29 – 1.51 | 0.229 | |||
| More than most | 1.39 | -2.78 – 5.56 | 0.515 | |||
| Much more than most | 2.39 | -4.08 – 8.87 | 0.469 | |||
| Observations | 1442 | 1442 | ||||
| R2 / R2 adjusted | 0.130 / 0.130 | 0.140 / 0.136 | ||||
Not the end yet… stay tuned…
Nevertheless we were not satisfied with the results of our linear regression models. For this reason we decided to try other predistors, for example working hours per week and gender.
We assumed that the effect of the variable “wkhtot” (working hours per week) may differ based on the level of variable “gndr” (respondents’ gender) because, firstly, women generally use the Internet less than men (Thanuskodi, 2013). Moreover, males and females tend to have different types of jobs, and the internet usage may depend on person`s type of job.
In this project our research question was if the relationship between the number of working hours per week and number of minutes spent on the Internet per day was different based on gender? In others words, is the effect of “wkhtot” variable differs based on the level of “gndr” variable?
In this part of project our contribution is:
Let’s upgrade our dataset with new variables:
NL <- ESS %>% select (agea, netustm, sclact, hlthhmp, gndr, wkhtot, wkhtotp)
NL = na.omit(NL)
NL$netustm <- as.numeric(NL$netustm)
NL$wkhtot<- as.numeric(NL$wkhtot)
dim(NL)
## [1] 587 7
The resulting dataset that we are going to work with contains 7 variables with 587 observations each.
ggplot(NL, aes(x = wkhtot)) +
ggtitle("Respondents total hours normally worked per week in main job distribution") +
xlab("Hours") +
ylab("Count") +
geom_histogram(binwidth = 4, fill = "sienna1", col= "sienna2", alpha = 0.7) +
geom_vline(aes(xintercept = mean(NL$wkhtot)), linetype="solid", color="#8B0000", size=1) +
geom_vline(aes(xintercept = median(NL$wkhtot)), linetype="dashed", color="white", size=1) +
theme_bw()
ggqqplot(NL$wkhtot)
summary(NL$wkhtot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 24.00 35.00 33.72 42.00 68.00
From the histogram, and Q-Q plot it can be seen that the distribution is close to normal being a little bit right-skewed with some people who works much more hours than the mean number(solid line on the histogram). The largest number of hours worked per weer is 68 while the mean number is 34 hours per week.
ggplot(NL, aes(x = netustm)) +
geom_histogram(binwidth = 7, fill = "sienna1", col= "sienna2", alpha = 0.7) +
labs(x = "Number of minutes spent on the Internet on a typical day", y = "count") +
theme_bw() +
ggtitle("The distribution of number of minutes spent on the Internet on a typical day") +
geom_vline(aes(xintercept = mean(NL$netustm)), linetype="solid", color="#8B0000", size=1) +
geom_vline(aes(xintercept = median(NL$netustm)), linetype="dashed", color="white", size=1)
#checking the normality of distribution
ggqqplot(NL$netustm)
summary(NL$netustm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 26.00 29.05 39.00 102.00
From the histogram, and Q-Q plot it can be seen that the distribution is not normal and right-skewed, so many respondents spent less minutes on the Internet compared to the mean score (solid line on the histogram). The largest number of minutes spent on the Internet on a typical day was 102 minutes while the average number of minutes on the Internet among respondents was approximately 29 minutes.
ggplot() +
geom_bar(data = NL, aes(x = gndr), fill = "red2", alpha = 0.6) +
ylab("Number of respondents") +
xlab("Gender") +
ggtitle("Gender distribution") +
theme_bw()
From the bar plot, it is seen that there is more females than males in the sample.
First of all, let’s run a correlation test to check if any correlative relationship is possible between our chosen variables. It can be done with the Parametric Pearson correlation method.
Hypotheses:
cor.test(NL$wkhtot, NL$netustm, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: NL$wkhtot and NL$netustm
## t = 3.5709, df = 585, p-value = 0.000385
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06590721 0.22432974
## sample estimates:
## cor
## 0.1460547
The p-value is smaller than 0.05 that allows us to reject the null hypothesis meaning that there is a correlation between Working hours and Internet usage. The correlation coefficient is equal to 0.14, indicating a positive and weak correlation between variables.
let’s visualize the relationship between Number of working hours and Internet usage with the “best” fit line that approximates all points on the scatterplot to one tendency line.
ggplot(NL, aes(x = wkhtot , y = as.numeric(netustm))) +
geom_point(shape = 19, color = "tomato") +
geom_smooth(method = "lm", se = FALSE, color = "purple4") +
labs(subtitle="Number of working hours Vs Internet Usage",
y="Time Spent On the Internet",
x="Number of working hours",
title="Scatterplot") +
theme_bw()
By this model, we are going to predict how much time people spend on the Internet in a day (‘netustm’ variable) by Number of working hours per week variable predictor(‘wkhtot’) and gender (‘gndr’ variable).
model1 <- lm(netustm ~ wkhtot + gndr, data = NL)
summary(model1)
##
## Call:
## lm(formula = netustm ~ wkhtot + gndr, data = NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.741 -12.562 -3.769 9.607 74.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.7144 3.1498 6.259 0.000000000751 ***
## wkhtot 0.2416 0.0679 3.558 0.000405 ***
## gndrFemale 1.9484 1.8432 1.057 0.290908
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.12 on 584 degrees of freedom
## Multiple R-squared: 0.0232, Adjusted R-squared: 0.01986
## F-statistic: 6.936 on 2 and 584 DF, p-value: 0.001055
In this model, we can see that p-value is less than 0.05 meaning that our model is better than having no model. Adjusted R-squared value is equal to 0.02 meaning that the model explains about 2% of variance of the predicted variable.
The equation linear model is the following:
\[\hat{InternetUsageInMinutes} = 133.8 + 2.3*WorkingHours + 17.5*IfFemale\] It shows us that, when one’s working hours per week equals to 0, a person typically spends 133.8 minutes on the Internet. With every additional working hour, a person spends 2.3 minutes more and if it is a woman, number of minutes increases by 17.5.
In this model we predict how much time people spend on the Internet in a day (‘netustm’ variable) by Number of working hours per week variable predictor(‘wkhtot’) whcih relationship is moderated by gender (‘gndr’ variable). In means that effect of working hours per week variable differs based on the level of gender variable.
model2 <- lm(netustm ~ wkhtot * gndr , data = NL)
summary(model2)
##
## Call:
## lm(formula = netustm ~ wkhtot * gndr, data = NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.831 -12.841 -4.163 9.362 75.075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.42028 5.11836 6.139 0.00000000154 ***
## wkhtot -0.03127 0.11603 -0.270 0.78763
## gndrFemale -13.63973 5.69532 -2.395 0.01694 *
## wkhtot:gndrFemale 0.41229 0.14263 2.891 0.00399 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.01 on 583 degrees of freedom
## Multiple R-squared: 0.037, Adjusted R-squared: 0.03205
## F-statistic: 7.467 on 3 and 583 DF, p-value: 0.00006522
Looking in the summary, we can see that p-value less than 0.05 meaning that our model is better than having no model. Adjusted R-squared value is equal to 0.03 meaning that the model explains about 3% of variance of the predicted variable. The p-value for coefficient of interaction effect is smaller than 0.05 meaning that there is an effect of Gender on Working hours for preddiction of Internet usage.
The equations of linear model for both Females and Males are :
\[\hat{InternetUsageInMinutes(F)} = 228.6 + 0.14*WorkingHours - 122*IfFemale + 3.77*WorkingHours*IfFemale\]
\[\hat{InternetUsageInMinutes(M)} = 228.5 + 0.14*WorkingHours\]
In other words, a female with 10 working hours per week is predicted to spend on internet:
\[228.6 + 0.14*10 - 122*1 + 3.77*10*1 = 145.7(minutes)\]
And a male with 10 working hours will spend:
\[228.6 + 0.14*10 = 230(minutes)\]
let’s compare the additive model and the interaction model.
anova(model1, model2)
tab_model(model1, model2)
| netustm | netustm | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 19.71 | 13.54 – 25.89 | <0.001 | 31.42 | 21.39 – 41.45 | <0.001 |
| wkhtot | 0.24 | 0.11 – 0.37 | <0.001 | -0.03 | -0.26 – 0.20 | 0.788 |
| Female | 1.95 | -1.66 – 5.56 | 0.291 | -13.64 | -24.80 – -2.48 | 0.017 |
| wkhtot:gndrFemale | 0.41 | 0.13 – 0.69 | 0.004 | |||
| Observations | 587 | 587 | ||||
| R2 / R2 adjusted | 0.023 / 0.020 | 0.037 / 0.032 | ||||
The anova testing reveals a small p-value less than 0.05, that allows us to reject the null hypothesis meaning that we can conclude that Interation model2 is significantly better at capturing the data than additive model. Moreover, if we take a look on adjusted R-squared, which is bigger for the model2, we may conclude that the variation of the outcome variable (netuse) is better explained by the second (interaction) model.
# here we did some manipulations to save our variables with prettier names
Gender <- NL$gndr
cbind(NL, Gender)
Working_hours_per_week <- NL$wkhtot
cbind(NL, Working_hours_per_week)
Internet_usage <- NL$netustm
cbind(NL, Internet_usage)
modelC <- lm(Internet_usage ~ Working_hours_per_week * Gender, data=NL)
plot_model(modelC, type="int", title = "Predicted values of Internet usage variable", colors = "Dark2") + theme_sjplot2()
The marginal-effect plot above depicts predicted values of our model by gender group (male or female). The lines showing general trends of predicted values for each group are also supplemented with their confidence intervals, which seem to overlap everywhere except the interval from 49 to 70 points of working hours variable. This mere fact signifies that the amount of time spent on the Internet per week statistically signifficantly varies when we consider different gender groups in our prediction by total working hours per week in the interval from 49 to 70 points. At the same time, this difference in prediction of Internet usage by working hours for two groups is statistically insignifficant in the interval from 0 to 49 points. Besides, the interaction between gender and working hours is clearly observed, as the lines on the graph have diverging slopes.
To conclude, the alternative hypothesis is confirmed: the effect of the variabe “wkhtot” (working hours per week) does differ based on the level of the variable “gndr”, so the relationship between the nuber of working hours and the number of minutes spent on the Internet among the population of the Netherlands depends on the gender of people. The interaction model`s (which is the final one) adjusted R-square is about 0.04 which means that the model explaines 4% of variance in the number of minutes spent on the Internet. Generally, men spend on the Internet about the same number of minutes no matter how many hours a week they work. The number of minutes on the Internet decreases slightly with an increase in working hours. The number of minutes spent on the Internet for women, by contrast, varies depending on the number of working hours per week. In particular, with the increase in the number of working hours, the time spent on the Internet increases, as well. As we have mentioned earlier, his may be due to the specifics of the jobs held by women in the Netherlands.
And here’s the end for all our project.
Congrats to all of us!
We would like to say THANK YOU to all of the team of our teachers and teaching assistants for all their help and knowledge. You are great guys, honestly :) Send you this project with lots’ of love!!!
dat<- data.frame(t=seq(0, 2*pi, by=0.1))
xhrt <- function(t) 16*sin(t)^3
yhrt <- function(t) 13*cos(t)-5*cos(2*t)-2*cos(3*t)-cos(4*t)
dat$y=yhrt(dat$t)
dat$x=xhrt(dat$t)
with(dat, plot(x,y, type="l", col = "hotpink"))
with(dat, polygon(x,y, col="pink"))
Social activity levels in our data:
Let’s take a look at the “sclact” variable. As it can be seen from the barplot below, number of observations in groups, chosen to be examined, vary too much. That may affect the F-ratio, so we got to do something with them.