TO BEGIN WITH…

Hi! Our team is called Odoo (named after one business software, but that’s not so important). And here we are: Anna Moroz (our leader who dropkickes all of us (that’s useful!), Anna Podolskaya (always knits and checks everything before the submission), Anastasia Prokudina (the first alarmist, who calms all of us down) and Olga Chemodanova (data scientist + she googles faster than you).

So, here we would like to present you all the projects, that we have done during the course of Data Analysis :)

IN all our projects we work with the following data:

  • from the Netherlands;
  • collected in 2016;
  • and the topic of our interest is “Subjective well-being”.

Let’s get started!

Odoo’s very 1st project. Visualisation

Libraries, contribution and loading our data

library(foreign) 
library(dplyr)
library(lubridate)
library(ggplot2)
library(knitr)

Contribution:

Title <- "odoo"
Team <- c("Anna Podolskaya", "Anna Moroz", "Anastasia Prokudina", "Olga Chemodanova")
Team_Member <- sort(Team)
Contribution <- c("Construction of a scatterplot, a boxplot, a stacked barplot", "Construction of a histogram and a stacked barplot, variables' types description, arranging graphs' colours, labels", "Construction of a barplot, identifying CTM for the histogram with 'happy' variable, variables' type description, collecting all scripts in one and arranging the final layout", "Construction of a scatterplot and a boxplot, assistance to other members of the team with R troubles")
Individual_Contribution <- data.frame(Team_Member, Contribution)
kable(Individual_Contribution)
Team_Member Contribution
Anastasia Prokudina Construction of a scatterplot, a boxplot, a stacked barplot
Anna Moroz Construction of a histogram and a stacked barplot, variables’ types description, arranging graphs’ colours, labels
Anna Podolskaya Construction of a barplot, identifying CTM for the histogram with ‘happy’ variable, variables’ type description, collecting all scripts in one and arranging the final layout
Olga Chemodanova Construction of a scatterplot and a boxplot, assistance to other members of the team with R troubles

Loading our data:

ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL <- ESS %>% select (gndr, agea, happy, sclmeet, inprdsc, crmvct, aesfdrk, health)
NL = na.omit(NL)
NL$happy <- as.numeric(NL$happy)
NL$health <- as.factor(NL$health)
NL$gndr <- as.factor(NL$gndr)
NL$agea <- as.numeric(NL$agea)

Identifying variables’ types in the NL dataframe

Variable_name <- c("gndr: Gender", "agea: Age of a respondent", "happy: How happy are you", "sclmeet: How often socially meet with friends, relatives or colleagues", "inprdsc: How many people with whom you can discuss intimate and personal matters", "crmvct: Respondent or household member victim of burglary/assault last 5 years", "aesfdrk: Feeling of safety of walking alone in local area after dark", "health: Subjective general health") 
Level_Of_Measurement <- c("Nominal", "Ratio", "Nominal -> Interval", "Ordinal", "Ordinal", "Nominal", "Ordinal", "Ordinal") 
Continious_Or_Discrete <- c("None", "Continious", "None -> Continious", "None", "None", "None", "None", "None") 
df0 <- data.frame(Variable_name, Level_Of_Measurement, Continious_Or_Discrete, stringsAsFactors = FALSE, check.rows = TRUE, check.names = TRUE) 
kable(df0)
Variable_name Level_Of_Measurement Continious_Or_Discrete
gndr: Gender Nominal None
agea: Age of a respondent Ratio Continious
happy: How happy are you Nominal -> Interval None -> Continious
sclmeet: How often socially meet with friends, relatives or colleagues Ordinal None
inprdsc: How many people with whom you can discuss intimate and personal matters Ordinal None
crmvct: Respondent or household member victim of burglary/assault last 5 years Nominal None
aesfdrk: Feeling of safety of walking alone in local area after dark Ordinal None
health: Subjective general health Ordinal None

A histogram and central tendency measures for the variable “happy”. The histogram can help us to understand the general distribution of happiness rates among people from Netherlands. As can be seen from it, the distribution of people’s happiness is skewed to the left. That means there are more happy people than unhappy ones. The dark blue line shows us the mean score of happiness (which is 8.82) and the dashed white line shows the median (which equals to 9).

ggplot() + 
  geom_histogram(data = NL, 
                 aes(x = happy), 
                 binwidth = 1, 
                 fill = "salmon", 
                 col= "black", 
                 alpha = 0.7) +
  labs(subtitle="Histogram", 
       y="Frequency", 
       x="Level of happiness", 
       title="Distribution of people's happiness") +
  geom_vline(aes(xintercept = mean(NL$happy)), linetype="solid", color="darkblue", size=1) +
  geom_vline(aes(xintercept = median(NL$happy)), linetype="dashed", color="white", size=1) +
  theme_bw() #graphical presentation of the 'happy' variable's central tendency measures 

happy <- NL$happy
Variable_happy <- c("Level of happiness")
#central tendency measures for varible "happy"
summary(NL$happy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   8.000   9.000   8.882  10.000  11.000

A barplot for the variable “health”. This bar chart can help us if we, for instance, want to find if there is a general trend of subjective general health among people. According to the plot, more than 50% of respondents estimate their general health as “good”. While the least amount of people think that their health is “5 points out of 5”very bad".

ggplot() + 
  geom_bar(data = NL, 
           colour = "black", 
           fill = "cornflowerblue", 
           aes(x = health, y=..count../sum(..count..)*100)) + 
  labs(subtitle="Barplot", 
       y="Number of people, %", 
       x="General health state", 
       title="Subjective General Health") +
  coord_flip() +
  ylim(0,60) +
  theme_bw()

A scatterplot for the variables “happy” and “agea”. This scatterplot can help us to see the distribution of people’s levels of happiness by their age and gender.

ggplot() +
  geom_point(data = NL, 
             aes(x = happy, y = agea, color = gndr)) +
  labs(subtitle="Scatterplot", 
       y="Age, years", 
       x="Level of happiness", 
       title="Distribution of Happiness by Age",
       color = "Gender") +
  theme_minimal()

A boxplot for the variables “agea” and “aesfdrk”. By this boxplot, it can be seen that most people of all ages feel unsafe while walking alone in local area after dark. Yet, those who feel very unsafe on the streets after dark are the minority out of all presented categories of variable ‘aesfdrk’. Besides, only in this category, outliers are observed.

ggplot(data = NL) + 
  geom_boxplot(aes(x = aesfdrk, y = NL$age), 
               colour = "orangered") + 
  labs(subtitle="Boxplots", 
       y="Age, years", 
       x="Feeling of safety", 
       title="Distribution of Feeling of Safety of Walking Alone\nin Local Area After Dark by Age") +
  theme_bw()

Central tendency measures for the variable “age”. The information given tells us that the average age of respondents in our dataset is 37 years, maximum age is 80 years, and minimum age is 1 years, the latter being a bit suspicious.

summary(NL$agea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   22.00   38.00   37.11   52.00   80.00

A stacked barplot for the variables “gndr” and “sclmeet”. This graph helps us to trace how often people of each gender socially meet with friends, relatives or colleagues. We can see that the number of those who never socially meet with friends, relatives, and/or work colleagues is quite small. While the majority of respondents have social meetings several times a week. Also, it is clear that ther is a gender balance in each category.

stackedBP <- table(NL$gndr, NL$sclmeet)/nrow(NL)*100 
par(mar=c(4, 10, 2, 1))
barplot(stackedBP,
        main="Frequency of Social Meetings by Gender", 
        xlab="Number of people, %",
        col=c("dodgerblue","brown1"), 
        xlim = c(0,50),
        horiz = TRUE, 
        las = 1,
        legend = rownames(stackedBP))

A stacked barplot for the variables “crmvct” and “inprdsc”. And this barplot measures if there is any interdependence between being a victim of burglary or assault in last 5 years and the number of people with whom a respondent can discuss intimate and personal matters. As it is graphically shown, the situation is not exactly like this, the majority of those with traumatizing experience disscuss their intimate issues with 4-6 people, which is quite the average out of all categories (from no people to discuss personal matters with to 10 or more).

stackedBP <- table(NL$crmvct, NL$inprdsc)/nrow(NL)*100 
barplot(stackedBP, 
        main="Number of people to discuss intimate/personal matters\n by Having Traumatizing Experience",
        xlab="Number of people to discuss intimate/personal matters with", col=c("salmon","mediumturquoise"),
        ylab="Number of respondents, %", 
        ylim=c(0, 50),
        args.legend=list(title="Traumatizing experience"),
        legend = rownames(stackedBP))

And that’s all. Now you can move on to our next project!

Odoo’s 2nd project. Chi-squared and t-test

Updating libraries, contribution and loading our data

library(magrittr)
library(psych)
library(lsr)
library(vcd)
library(sjPlot)
library(corrplot)

Contribution:

  • Anna Moroz - boxplots, checking normality and T-test;
  • Anna Podolskaya - barplots, collecting all the scripts in one and arranging the final layout;
  • Anastasia Prokudina - recheck for t-test with non-parametriСЃ ones;
  • Olga Chemodanova - chi-square test.

Our work is divided into two parts: chi-square test and T-test. For each we’ve created separate datasets: NL_chi and NL_ttest, and then cleaned all missing values in them.

NL_chi <- ESS %>% select(blgetmg, dscrgrp)
NL_chi = na.omit(NL_chi)
NL_ttest<- ESS %>% select (agea, crmvct)
NL_ttest = na.omit(NL_ttest)
NL_ttest$agea <- as.numeric(NL_ttest$agea)

Chi-square test

Stacked barplot with two variables

To begin with, let’s look at the variables “blgetmg”, which defines belonging to a minority ethnic group in the Netherlands and “dscrgrp” - that stands for being a member of a group discriminated against in the Netherlands. We are interested in differences and association that may be noticed.

ggplot(NL_chi, 
       aes(x = blgetmg, y=..count../sum(..count..)*100, 
           fill = dscrgrp)) +
  geom_bar(position = "stack") +
  labs(title="Belong to a Minotity Ethnic Group by Belonging\n To a Group Discriminated Against In the Netherlands",
       subtitle="Barplot",
       x="Belonging to a minority ethnic group",
       y="Number of observations, %",
       fill="Being a member of a group\ndiscriminated against in the Netherlands") +
  ggtitle("Number of respondents, who belong to minotity ethnic group and if they belong\nor not to a group discriminated against in Netherlands") +
  theme_bw()

It is clear that number of respondents who belong to a minority ethnic group is significantly lower than the number of those who do not. If we look closer to the red parts on the graph, we can find that the number of those who are members of a discriminated group is about one third of respondents who belong to a minority ethnic group. In the second bar this proportion is much less. Let’s check it with the chi-square test.

Checking all the assumptions of the chi-square

However, firstly we are to check the assumptions of the chi-square test. In order to do it, we construct a contingency table with our variables. Then visualize it.

table <- table(NL_chi)
table
##        dscrgrp
## blgetmg  Yes   No
##     Yes   35   57
##     No    94 1487
#visualize a contingency table
assoc(head(table, 5), shade = TRUE, las=3, xlab = "Being a member of a group discriminated against in Netherlands", ylab = "Belonging to minority ethnic groups")

First of all, from the table it can be seen that in each СЃell of the contingency table there are at least 5 observations (in fact, there are much more than 5). Secondly, each observation contribute to one group only, so the observations are independent. Thirdly, there are no empty cells. Considering all these conditions, we conclude that the assumptions are met.

Chi-square test and analysing residuals

Now, let’s observe the difference in proportion of people from the Netherlands who belong to a minority ethnic group and who are members of a group being discriminated. It can be seen that the percentage of those who don’t belong to a minority and are discriminated is small (equal to 6%), while the percentage of those who belong to a minority and are discriminated is higher (scoring 38%).

sjp.xtab(NL_chi$blgetmg, NL_chi$dscrgrp, 
         margin = "row", 
         bar.pos = "stack", 
         title = "The proportion of respondents who belong to minotity ethnic group and their belongness to a group discriminated against in the Netherlands", 
         axis.titles = "Belonging to ethnic minority", 
         legend.title = "Being a member of a group discriminated against in the Netherlands", 
         show.summary = TRUE, 
         coord.flip = TRUE) 

The hypotheses for the test are:

  • H0: the variables are not associated
  • H1: the variables are associated

We are perform chi-square test and looking at how many observed and expected observations we have. From the test we can see that variables are statistically significantly associated as p-value is tending towards zero. Also, we can see observed and expected values.

options(scipen=999)
#chi square
chisq.test(table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 121.4, df = 1, p-value < 0.00000000000000022
chi <- chisq.test(table)
str(chi)
## List of 9
##  $ statistic: Named num 121
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named int 1
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 0.000000000000000000000000000312
##  $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
##  $ data.name: chr "table"
##  $ observed : 'table' int [1:2, 1:2] 35 94 57 1487
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  $ expected : num [1:2, 1:2] 7.09 121.91 84.91 1459.09
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  $ residuals: 'table' num [1:2, 1:2] 10.478 -2.527 -3.029 0.731
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  $ stdres   : 'table' num [1:2, 1:2] 11.2 -11.2 -11.2 11.2
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  - attr(*, "class")= chr "htest"
chi$observed #observed
##        dscrgrp
## blgetmg  Yes   No
##     Yes   35   57
##     No    94 1487
round(chi$expected,2) #expected
##        dscrgrp
## blgetmg    Yes      No
##     Yes   7.09   84.91
##     No  121.91 1459.09

From chi-square test we have that p-value is smaller than 2.2e-16, so if the value of standardized residuals is lower than -3.29 it means that the cell contains fewer observations than it was expected (the case of variables independence). If the value of standardized residuals is higher than 3,29 then the cell contains more observations that it was expected. The standardized residuals and plots with the results are shown below:

chi$stdres #standardized residuals
##        dscrgrp
## blgetmg      Yes       No
##     Yes  11.2193 -11.2193
##     No  -11.2193  11.2193
assocplot(t(table), main="Residuals and number of observations", xlab = "Being a member of a group discriminated against in Netherlands", ylab = "Belonging to minority ethnic groups")

#Let’s visualize Pearson residuals using the package corrplot
corrplot(chi$stdres, is.cor = FALSE)

#we can see that there is association between variables

The probability of independence of variables (or that H0, the null hypothesis, is true) is zero. The function can be seen below:

1 - pchisq(121.4, 0.05)
## [1] 0

So, having the p-value < 2.2e-16 and looking at the plots above, we can conclude that H0 is false and reject it. That means that there is an association between belonging to a minority ethnic group in the Netherlands and being a member of a group discriminated against in the Netherlands.

T-test

###Box-plots and checking the normality assumption

While studying the topic of subjective well-being in Netherlands, we pose a question of what age do people typically become a victim of a burglary or an assault. To find this out, we can start with constructing a boxplot which would show us minimum, maximum, and median values of respondents’ age, and find out if these values tend to differ across those who had traumatizing experience in last 5 years and those who did not. Information about traumatizing experience is contained in the variable “crmvct”, the answer on which denotes if a respondent or household member was a victim of burglary or assault in last 5 years or not. Thus, we have two groups on which we can check the distribution of age parameter, the variable “agea”. Having constructed the boxplot, we got the infromation that median, maximum, and minimum values indeed vary across the two groups, and the data in both groups don’t produce any outliers. The summary of the mentioned values, and more, can be seen as the result of applying function psych::describeBy().

ggplot(NL_ttest, 
       aes(x = NL_ttest$crmvct, y = agea, fill = NL_ttest$crmvct)) +
  ylim(0, 80) +
  labs(title="Age Distribution by Presence of Traumatizing Experience",
       subtitle="Boxplots",
       x="Having had traumatizing experience",
       y="Age of respondents", 
       fill="Traumatizing experience") +
  geom_boxplot() +
  theme_bw()

describeBy(NL_ttest$agea, NL_ttest$crmvct, mat = T)

Now, to be absolutely confident about the result given by the boxplots and its correctness, we want to suggest eresearch hypotheses and apply the t-test to verify it. Yet, before doing so, we need to check the assumptions required by this test. Precisely, the normality of the ‘agea’ variable distribution and homogeneity of the sample groups. Previously printed table on psych::describeBy() function has provided some paramaters that are applicable in the current task.

Those parameters are skews and kurtosis. The first group, that consists of people who were victims of vurglary or assault in last 5 years, shows a skew of value 0.19 meaning that the distribution of data in this group is slightly skewed to the right, while kurtosis is of value -0.83 meaning that the distribution of respondents’ age for this group is platykurtic (flat). The second group also shows a negative kurtosis, denoting flat distribution, this time of value -0.94. Yet, its skew is of negative value meaning that the data is insignificantly, as the value is -0.13, skewed to the left. Besides, sample groups are homogeneous as their variances are equal.

Also, we can to present a Q-Q plot or a histogram of our two variables, in order to double check normality of respondents’ age distribution in both groups. Well, let’s do both. The first Q-Q plot shows distribution of residuals of the age variable for the category of respondents who had traumatizing experience, and the second one presents distribution of residuals of the age variable for the category of respondents who didn’t have such experience.

ggplot(NL_ttest, 
       aes(x = agea)) +
  geom_histogram(col = "orange", 
                 fill = "wheat1", 
                 binwidth = 2.5) +
  labs(title = "Age Distribution",
       subtitle="Histogram",
       x = "Distribution of respondents' age", 
       y = "Frequency") +
  scale_x_continuous(limits = c(0, 100)) +
  geom_vline(aes(xintercept = mean(agea)), 
             color="darkblue", 
             size = 0.8) +
  geom_vline(aes(xintercept = median(agea)), 
             color = "red4", 
             size = 0.8) +
  facet_grid(NL_ttest$crmvct ~ .) +
  theme_bw()

yes <- subset(NL_ttest, NL_ttest$crmvct == "Yes")
no <- subset(NL_ttest, NL_ttest$crmvct == "No")
plot_positive <- qqnorm(as.numeric(yes$agea)); qqline(as.numeric(yes$agea, col = 2)) #a Q-Q plot showing age distribution for those who had traumatizing experience 

plot_negative <- qqnorm(as.numeric(no$agea)); qqline(as.numeric(no$agea, col = 2)) #a Q-Q plot showing age distribution for those without traumatizing experience

The histogram visualizes the data distribution and shows that the median and the mean, denoted by vertical lines of different color, appear to almost coincide, indicating that the distribution is close to symmetric. The Q-Q plot lets us assume that our sample of “agea” variable comes from a population that is approximately normally distributed with some signiffiant deviations from normality in the beginning and in the end of it. That is not far from the perfect normality required to conduct the t-test. Besides, for large samples (n > 300) this characteristic is not that important.

T-test itself

Now, let us apply the t-test. As our null hypothesis(H0) we suggest that the means in two populations, from which the sample groups are taken, are equal, and as an alternative hypothesis (H1) we suggest that these means are not equal.

t.test(NL_ttest$agea ~ NL_ttest$crmvct, na.rm = T, var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  NL_ttest$agea by NL_ttest$crmvct
## t = -3.9936, df = 400.28, p-value = 0.00007744
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.867574 -2.336657
## sample estimates:
## mean in group Yes  mean in group No 
##          33.31579          37.91791

The results of the t-test show that the means in two populations are not equal, that allows us to reject the null hypothesis. P-value, which is less than 0.05, confirms it. This way, we conducted an analysis that has given us some insights for further exploration of our initial question (“at what age do people in the Netherlands typically become a victim of a burglary or an assault, a.k.a. come to have traumatizing experience”).

Double-checking the results with a non-parametric test

Now, let’s check the results of the t-test with unpaired Wilcoxon or Mann-Whitney test.

The hypotheses are:

  • H0: the distribution of scores for the two groups are equal
  • HA: the distribution of scores for the two groups are not equal
wilcox.test(NL_ttest$agea ~ NL_ttest$crmvct)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  NL_ttest$agea by NL_ttest$crmvct
## W = 159900, p-value = 0.0001113
## alternative hypothesis: true location shift is not equal to 0

We can see that the p-value of the test is 0.0001113, which is less than the significance level alpha = 0.05. This way, the null hypothesis can be rejected. We can conclude that among respondents who expirienced burglary/assalt the median age is significantly different from a group that didn’t expirience burglary/assalt median age.

Let’s aso check the results with Cohen`s D test! It will show how strong the effect size and difference is.

cohensD(NL_ttest$agea ~ NL_ttest$crmvct)
## [1] 0.2473307

The result of 0.25 shows that there is a small effect size. With a Cohen’s d of 0.2:

  • 58 % of the treatment group will be above the mean of the control group;
  • 92 % of the two groups will overlap, and there is a 56 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group
  • Moreover, in order to have one more favorable outcome in the treatment group compared to the control group we need to treat 16.5 people. This means that if 100 people go through the treatment, 6.1 more people will have a favorable outcome compared to if they had received the control treatment.

All for the second part.

Odoo’s 3rd project. ANOVA

Updating libraries and contribution

library(lsr)
library(vcd)
library(RColorBrewer)
  • Anna Moroz - spelling out all the hypotheses, regrouping varible levels, the descriptive table, cheking homogeneity of variance, ANOVA performance;
  • Anna Podolskaya - creating boxplots, colouring of plots, collecting all the scripts in one and arranging the final layout and a little bonus in the end;
  • Anastasia Prokudina - description of variables, checking and interpretating of normality, double-check with a non-parametric test;
  • Olga Chemodanova - application and interpretation of the post hoc test.

Intro to the topic, hypotheses and loading the data

In the context of our topic of the study, i.e. subjective well-being in the Netherlands, we would like to test a new theory. What if the variation of individuals’ total working hours (overtime included) can be subjected to change, if we examine it along with a parameter that indicates one’s time devoted to social activities compared to others of same age. For instance, we assume that if a person spend much time on social meetings or participating in any kinds of social activities that are of importance for him/her, they may either skip some part of the formally assigned working time and then work these hours off, or the work itself may not be presented much in his/her life, as it would be in case of part-time job of a full-time student.

For the start, we can conduct one-way analysis of variances (ANOVA) and check if means across groups of the variable which indicates participating in social activities compared to others of the same age (‘sclact’) are not signiffically different, or there is a variation in means which would support our theory.

Thus, the hypotheses would be the following:

  • H0: data are drawn from populations with means that statistically are not signifficantly different;
  • H1: statistically, there is variation of means across chosen populations.

Loading our dataset:

NL <- ESS %>% select (wkhtot, sclact)
NL = na.omit(NL)
whours <- as.numeric(as.character(NL$wkhtot))
NL <- cbind(NL, whours)
dim(NL)
## [1] 1580    3

There are 1580 observations of 3 variables: two, what we have taken from the whole ESS dataset and one new variable - whours. We had to convert it to a continious type to work with it later.

Checking normality of working hours’ distribution

ggplot() + 
  geom_histogram(data = NL, 
                 aes(x = whours), 
                 binwidth = 8, 
                 fill = "pink", 
                 col= "black", 
                 alpha = 0.7) +
  labs(title="Total hours normally worked per week in main job",
       subtitle="Histogram",
       x="Hours",
       y="Count") +
  theme_bw() 

summary(whours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   36.00   34.37   42.00  168.00

The distribution is close to normal, even though it is seen that it is not perfect. It is slightly skewed to the left and has biggest number somewhere about 40 hours as possibly it is the very common number of total hours for work per week.

Social activity levels in our data:

Let’s take a look at the “sclact” variable. As it can be seen from the barplot below, number of observations in groups, chosen to be examined, vary too much. That may affect the F-ratio, so we got to do something with them.

par(mar=c(4, 10, 2, 10))
barplot(table(NL$sclact)/nrow(NL)*100, 
        horiz = TRUE, 
        las = 1,  
        xlim=c(0,50),
        main="Distribution of the 'sclact' variable by categories (levels)", 
        xlab="Number of observations, %")

Reorganising levels of social activity

One of the way to balance sizes of the groups is to merge the smallest of them to the bigger ones. And that’s what we are going to do. With the following code we create three new groups, two of which combine several levels of the “sclact” variable. Thus, we have created the groups:

  • “More often than others” - contains responses on the inquiry of taking part in social activities compared to others of same age as “Much more than most” and “More than most”;
  • “About the same” group remained the same;
  • “Less often than others” - contains responses for “Less than most” and “Much less than most”.

Also, we established the order of levels for the “sclact” factor.

NL$sclactEd <- rep(NA, length(NL$sclact))
NL$sclactEd[NL$sclact == "Much more than most" | 
            NL$sclact == "More than most"] <- "3. More often than others"
NL$sclactEd[NL$sclact == "About the same"] <- "2. About the same"
NL$sclactEd[NL$sclact == "Less than most" |
            NL$sclact == "Much less than most"] <- "1. Less often than others"
sclact <- as.factor(NL$sclactEd)

Descriptive statistics with describeBy

Now, let’s look at the descriptive values across our new groups. It seems like group sizes are more or less comparable now. At the same time, we can see the values of skewness across three groups, which do not exceed 2, hinting at the normality of the distribution of our continious variable “whours”.

describeBy(whours, sclact, mat = TRUE)

Boxplots

boxplot(whours ~ sclact, 
        main="Social activity compared to others of same age",
        ylab = "Working hours in total", 
        col = c("mistyrose", "pink", "palevioletred3"))

Here we can see some differences in number of total working hours for people who participate with different frequency in social activities (note that the darker the color, the more frequency). The median number of working hours for people from the category, describing social activity as “about the same” is lower then for people from the category “less” or “more than others”. However, we cannot be sure in significanve of this difference without a test. There are also some outliers observed.

Final preparations for ANOVA - Cheking homogeneity of variance

Currently, we are ready to proceed to the one way analysis (i.e. ANOVA). However, considering several assumptions for ANOVA whose violation can lead to robust results, we need to do some additional preparations before running the test. While the independence of observations is implied, homogeneity of variances across the groups and normality of residuals are still to be checked. Firstly, let’s inspect the variances. In order to do that, we apply Levene’s test. Variances homogeneity test

library(car)
leveneTest(whours ~ sclact)

It returns the p-value equal to 0.596 which is greater than alpha level 0.05, indicating that there is no significant difference in variances between the groups. Obtaining this information, we can state that the variances are homogeneous and put var.equal = T in conditions of ANOVA test.

ANOVA

Now, let’s conduct the ANOVA test itself, first part of which would provide p-value, and the second one - F-ratio.

oneway.test(whours ~ sclact, var.equal = T) 
## 
##  One-way analysis of means
## 
## data:  whours and sclact
## F = 7.5152, num df = 2, denom df = 1577, p-value = 0.0005645
aov.out <- aov(whours ~  sclact) 
summary(aov.out)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## sclact         2   3859  1929.4   7.515 0.000564 ***
## Residuals   1577 404863   256.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, we can see that F-ratio is equal to 7.515, which is bigger than the computed critical F-value, and p-value is 0.0005, which is smaller than alpha level (0.05), that allows us to reject the null hypothesis, concluding that there are variations between the groups’ means due to true differences in the populations means.

Cheking normality of residuals

layout(matrix(1:4, 2, 2))
plot(aov.out)

Upper graphs’ red lines are almost straight and Q-Q plot is not ideal but still has something like a straight line in it.

Double-check with a non-parametric test

Kruskal-Wallis test

Now we are going to double-check the ANOVA results with the Kruskal-Wallis test

kruskal.test(wkhtot ~ sclact, data = NL)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  wkhtot by sclact
## Kruskal-Wallis chi-squared = 18.94, df = 4, p-value = 0.0008075

With the KW chi-square equal to 18.94 and p-value is 0.0008, it means that the mean ranks of the different groups are not the same. This results confirms the ANOVA test results.

Dunn`s post hoc test

AS Kruskal-Wallis test is significant, we can also run a non-parametric Dunn`s post hoc test:

library(dunn.test)
dunn.test(whours, sclact, kw=TRUE)
##   Kruskal-Wallis rank sum test
## 
## data: whours and sclact
## Kruskal-Wallis chi-squared = 14.7901, df = 2, p-value = 0
## 
## 
##                         Comparison of whours by sclact                         
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |   1. Less    2. About
## ---------+----------------------
## 2. About |   2.737675
##          |    0.0031*
##          |
## 3. More  |  -1.124059  -3.547013
##          |     0.1305    0.0002*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

These results show that most pairs of groups have statistically significant differences in their medians.

Application and interpretation of post hoc test

As the F-ratio is significant, we can apply a post hoc test to check which groups contribute to the statistical significance of this test. In our case we use Tukey’s post hoc test because the variances are equal across all three groups (as we’ve checked earlier)

TukeyHSD(aov.out)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = whours ~ sclact)
## 
## $sclact
##                                                          diff       lwr
## 2. About the same-1. Less often than others         -2.333092 -4.485953
## 3. More often than others-1. Less often than others  1.558029 -1.025284
## 3. More often than others-2. About the same          3.891121  1.410827
##                                                            upr     p adj
## 2. About the same-1. Less often than others         -0.1802306 0.0298638
## 3. More often than others-1. Less often than others  4.1413424 0.3334238
## 3. More often than others-2. About the same          6.3714144 0.0007040

According to the Tukey ‘Honestly Significant Differences’ test, with 95% confidence there is a significant difference between the mean number of working hours of those who devote more time to social activities than others of the same age and those who devote about the same time to social activities compared to others of the same age (the adjusted p-value = 0.0007040). Also, there is a significant difference between the mean number of working hours of people who devote less time to social activities compared to others of the same age and those who devote about the same amount of time to social activities compared to others of their age and people, but here the difference is less significant (the adjusted p-value = 0.0298638). The difference between the mean number of working hours of the remaining group is not significant (the adjusted p-value = 0.3334238)

We can plot this:

par(cex.axis=0.8)
par(mar=c(4, 15, 2, 0))
plot(TukeyHSD(aov.out), las = 1)

The plot represents the above mentioned conclusions. If confidence intervals do not contain 0 there is an evidence that the groups are different. So, the pairs of groups that does not cross the dotted line are significantly different.

End of the third part!

Odoo’s 4th. Linear regression

(OMG, cannot believe we are so close to the end!) Btw, all neccessary libraries have already been loaded, so let’s move straight to…

Contribution

  • Anna Moroz - introduction, second model with 3 predictors (including its summary, interpretation and linear equation), comparison of two models via ANOVA, creation of scatterplots;
  • Anna Podolskaya - explanation whether the hypotheses are falsified or supported, creating an output table, formulating a conclusion, coloring of graphs, arranging the final layout;
  • Anastasia Prokudina - selection of variables, first model with 1 predictor (including its summary, interpretation and linear equation);
  • Olga Chemodanova - research question and hypotheses, descriptive statistics for the variables, scatterplot description.

Going deeper into the topic

Recently, our team has found out an interesting fact: the country we chose to study, that is Netherlands, was at the top of the ranking of the European Union countries with the highest Internet access at home in 2018 (Statista, 2019). Further exploration revealed that, in 2017, 98% of Dutch households had Internet access, compared to the average 87% in Europe (CDS, 2018). Also, the article from the cbs.nl web-site points out the popularity of mobile Internet in Netherlands, saying that about 87% of population, which is quite a big number, uses the Internet on their mobile devices outside of home or work.

We got curious about this prevalence of the Internet in Netherlands and decided to give it a bit of research. We wanted to check if there are any factors that contribute to the Internet usage among the Dutch. Thus, we settled on applying the regression analysis on data that we got from European Social Survey (ESS) database. As the ESS database allows us to perform the analysis using data only from the 2016 round, or even the later ones, we follow the idea to predict the Internet usage variability in Netherlands just before it became so popular in this country. We assume, it could give us some insights into the prerequisites of such a tendency.

In the regression analysis we used the ‘netustm’ variable that speaks for how much time, in minutes, a respondent spends using the internet on a computer, tablet, smartphone or other device, whether for work or personal use. To predict this as an outcome, we, first of all, are trying a respondent’s age (‘agea’) as an explanatory variable. Further, to see a bigger picture, we are adding to the model respondents being hampered in daily activities by illness/disability/infirmity/mental problem (‘hlthhmp’ variable) and the frequency of taking part in social activities compared to others of same age (‘sclact’ variable) as additional predictors. This way we suggest that if there is any inconvenience for a person to take part in daily activities, then he/she may spend more time in the Internet as a distraction from issues of the day. We also support this model with the frequency of taking part in social activities as it may contibute to the proposition a lot.

Research question, hypotheses and loading the data

Here we want to find out if there were are any relationships between minutes spent on the Internet on a typical day (dependent variable or outcome) and age of respondents (independent variable or predictor) in the Netherlands. In the second model, we also added another predictor, health, to age. We also wanted to predict minutes spent on the Internet on a typical day based on these two predictors in the Netherlands.

Our hypotheses were that the older the respondents, the less time they spent on the Internet on a typical day. Moreover, we assumed that healthy respondents spent less time on the Internet, as they devoted time to other activities. Also, respondents with bad health may spend more time on the Internet as they search for information about their diseases, or physically do not have the opportunity to actively spend time and surf the Internet instead.

Loading our dataset:

NL <- ESS %>% select (agea, netustm, sclact, hlthhmp)
NL = na.omit(NL)
NL$netustm <- as.numeric(NL$netustm)
NL$agea <- as.numeric(NL$agea)
dim(NL)
## [1] 1442    4

The resulting dataset that we are going to work with contains 4 variables with 1442 observations each.

Descriptive statistics for the variables

Summary of the data:

summary(NL)
##       agea          netustm                       sclact   
##  Min.   : 1.00   Min.   :  1.00   Much less than most: 93  
##  1st Qu.:20.00   1st Qu.: 12.00   Less than most     :386  
##  Median :34.00   Median : 26.00   About the same     :631  
##  Mean   :34.13   Mean   : 28.93   More than most     :290  
##  3rd Qu.:48.00   3rd Qu.: 39.00   Much more than most: 42  
##  Max.   :75.00   Max.   :102.00                            
##                hlthhmp    
##  Yes a lot         :  69  
##  Yes to some extent: 310  
##  No                :1063  
##                           
##                           
## 

Distribution of our variables:

  • respondents` age:
ggplot(NL, aes(x = agea)) +
  labs(title="Respondents' age distribution",
       subtitle="Histogram",
       x="Age",
       y="Count") +
  geom_histogram(binwidth = 3, 
                 fill = "sienna1", 
                 col= "sienna2", 
                 alpha = 0.7) +
  geom_vline(aes(xintercept = mean(NL$agea)), linetype="solid", color="#8B0000", size=1) +
  geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) +
  theme_bw()

#Checking the normality of distribution
library(ggpubr)
ggqqplot(NL$agea)

summary(NL$agea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   20.00   34.00   34.13   48.00   75.00

From the histogram and the Q-Q plot presented above it can be concluded that the distribution is approximately normal (mean, the dashed white line, and median, the dark red line, are almost the same). The mean age of respondents is roughly 34 years. The oldest respondent is 75 years old. Generally, the majority of the respondents are younger than 60 years.

  • the number of minutes spent on the Internet on a typical day among the respondents:
ggplot(NL, aes(x = netustm)) +
  geom_histogram(binwidth = 7, 
                 fill = "sienna1", 
                 col= "sienna2", 
                 alpha = 0.7) + 
  labs(title="The distribution of number of minutes spent on the Internet on a typical day",
       subtitle="Histogram",
       x = "Number of minutes spent on the Internet on a typical day", 
       y = "Count") +
  geom_vline(aes(xintercept = mean(NL$netustm)), linetype="solid", color="#8B0000", size=1) +
  geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) +
  theme_bw() 

#checking the normality of distribution
ggqqplot(NL$netustm)

summary(NL$netustm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   26.00   28.93   39.00  102.00

From the histogram, and Q-Q plot it can be seen that the distribution is not normal and skewed to the right, so many respondents spent less minutes on the Internet compared to the mean score (solid line on the histogram). The largest number of minutes spent on the Internet on a typical day is 102 minutes while the average number of minutes spent on the Internet among respondents is approximately 29 minutes.

  • the frequency of taking part in social activities compared to others of same age:
ggplot() + 
  geom_bar(data = NL, 
           aes(x = sclact, y=..count../sum(..count..)*100), 
           fill = "red2", 
           alpha = 0.7) +
  labs(title="Frequency of taking part in social activities\ncompared to others of same age", 
       subtitle="Barplot",
       y="Number of respondents, %",
       x="Taking part in socia activities") +
  theme_bw() 

It can be seen that more than 40% of the respondents take part in social activities about the same often as others of the their age. About 27% of the respondents spend less than most while almost 20% of the respondents - more than most. There are much less respondents who take part in social activities much less as well as much more frequent than those of the same age, about 7% and less than 5% correspondingly.

  • number of respondents being hampered in daily activities by illness/disability/infirmity/mental problems:
ggplot() + 
  geom_bar(data = NL, 
           aes(x = hlthhmp, y=..count../sum(..count..)*100), 
           fill = "darkgoldenrod1", 
           alpha = 0.7) +
  labs(title="Distribution of respondents being hampered in daily activities\nby illness/disability/infirmity/mental problem",
       subtitle="Barplot",
       x="Respondents being hampered in daily activities by illness/disability/infirmity/mental problem",
       y="Number of people, %") +
  theme_bw()

The majority of the respondents, more than 65%, are not hampered in daily activities by illness/disability/infirmity/mental problem. About 20% of the respondents are humpered to some extent and only ~5% are humpered in daily activities due to some issues.

  • This barchart illustrates the distribution of frequency of taking part in social activities compared to others of same age along with the fact if respondents are hampered in daily activities by illness/disability/infirmity/mental problem.
ggplot() + 
  geom_bar(data = NL, 
           aes(x = sclact, fill = hlthhmp), 
           position = "fill") +
  xlab("Frequency") +
  ggtitle("The Distribution of Frequency of Taking Part in Social Activities\nCompared to Others of Same Age by Respondents Being\nHampered In Daily Activities Due to Some Issues") + 
  labs(fill="If hampered") +
  coord_flip() +
  scale_fill_brewer(palette="Reds")+
  theme_bw()

In general, the greater percentage of respondents who take part in social activities much less than others of the same age are hampered by illness/disability/infirmity/mental problem either a lot or to some extent compared to others.

Boxplots for Internet activity

  • The boxplots showing number of minutes spent on the Internet on a typical day by the frequency of taking part in social activities compared to others of same age:
ggplot() + 
  geom_boxplot(data = NL, 
               aes(x = sclact, y = netustm))+
  xlab("The frequency of taking part in social activities compared to others of same age") +
  ylab("Number of minutes spent on the Internet on a typical day") +
  ggtitle("Numer of minutes spent on the Internet on a typical day\nby the frequency of taking part in social activities compared to others of same age") +
  labs(subtitle="Boxplots") +
  theme_bw()

NL %>% group_by(sclact) %>% dplyr::summarise(mean_netuse = mean(netustm))

As can be seen, boxplots are almost the same: their medians are around 20-27 minutes among all categories of taking part in social activities compared to others of the same age. It can be said about boxplots whiskers and the 1st and the 3rd quatiles, as well. All boxplots have outliers, but in the one that represents number of minutes on the Internet of people who partcicpate in social activities more often than most of the same age, there are less of them.

Also, from the table we can see that the mean number of minutes spent on the Internet is almost the same among respondents of different groups by frequency of taking part in social activities.

  • boxplots representing the distribution of respondents being hampered in daily activities by illness/disability/infirmity/mental problem by number of minutes spent on the Internet on a typical day:
ggplot() + 
  geom_boxplot(data = NL, aes(x = hlthhmp, y = netustm)) +
  xlab("If hampered in daily activities by illness/disability/infirmity/mental problem") +
  ylab("Number of minutes spent on the Internet on a typical day") +
  ggtitle("Being Hampered In Daily Activities by Illness/disability/infirmity/mental problem\nBy Number of Minutes Spent On the Internet On a Typical Day") +
  labs(subtitle="Boxplots") +
  theme_bw()

NL %>% group_by(hlthhmp) %>% dplyr::summarise(mean_internetuse = mean(netustm))

So, the medians and the 1st and the 3rd quartiles of the boxplots are almost the same, but the last boxplot has a larger upper whisker and more outliers. That means that more people who are not hampered in daily activities by illness/disability/infirmity/mental problem spend ~60-80 minutes on the Internet per day than those who are not hampered or hampered only to some extent. Still, that may be due to the fact that this group of people is larger than the other two.

From the table it can be concluded that people who are not hampered in daily activities have the highest mean number of minutes spent on the Internet.

Correlation test

First of all, let’s run a correlation test to check if any correlative relationship is possible between our chosen variables. It can be done with the parametric Pearson correlation method and the non-parametric Spearman correlation method which test the correlation coefficient for significance.

As we have close to a normal distribution of the age variable, we can apply the Pearson test.

  • H0: the population correlation coefficient between age and Internet usage is not significantly different from zero;
  • H1: the population correlation coefficient between age and Internet usage is significantly different from zero.
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "pearson") 
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(NL$age) and as.numeric(NL$netustm)
## t = -14.701, df = 1440, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4053000 -0.3154979
## sample estimates:
##        cor 
## -0.3612362

The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.36, indicating a negative, relatively weak correlation between variables.

To double-check the significance of the correlation coefficient we apply the Spearman test, as well. The hypotheses are the following:

  • H0: the population correlation coefficient between age and Internet usage is not significantly different from zero;
  • H1: the population correlation coefficient between age and Internet usage is significantly different from zero.
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "spearman") 
## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(NL$age) and as.numeric(NL$netustm)
## S = 705640000, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4120086

The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.41, indicating negative, relatively weak, yet stronger than shown with the Pearson test, correlation between variables.

Model 1: age vs. internet usage

Visualisation of the 1st model

Now, let’s visualize our first model which includes the relationship between age and Internet usage with the “best” fit line that approximates all points on the scatterplot to one tendency line.

# Firstly we centered the 'agea' variable to make intercept interpretable
NL$agea_centered <- NL$agea - mean(NL$agea)
ggplot(NL, 
       aes(x = as.numeric(agea_centered), y = as.numeric(netustm))) +
  geom_point(shape = 19, 
             color = "tomato") + 
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "purple4") +
  labs(subtitle="Age Vs Internet Usage", 
       y="Time Spent On the Internet", 
       x="Age", 
       title="Scatterplot") +
  theme_minimal()

ggplot(NL, 
       aes(x = as.numeric(agea), y = as.numeric(netustm))) +
  geom_point(shape = 19, 
             color = "orange") + 
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "purple4") +
  labs(subtitle="Age Centered Vs Internet Usage", 
       y="Time Spent On the Internet", 
       x="Age Centered", 
       title="Scatterplot") +
  theme_minimal()

Orange scatterplot is built with centered variable. You can also have a glance on the graph with no centring of the variable of respondents’ age - it is red-colored.

In general, it can be seen that the majority of respondents spend less than 60 minutes on the Internet per day. Also, people from ~65 years spend less than 60 minutes per day or do not use the Internet at all. Age and number of minutes spent on the Internet are negatively linearly related, means that with the increase of age the number of minutes spent on the Internet decreases. Still, the relationship is quite weak (as it was assumed by both of the correlation tests). Now, let’s proceed to the first linear regression model.

First linear regression model and its summary

By this model, we are going to predict how much time people spend on the Internet in a day (‘netustm’ variable) by age variable predictor(‘agea’).

model1 <- lm(netustm ~ agea, data = NL) 
summary(model1) 
## 
## Call:
## lm(formula = netustm ~ agea, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.387 -12.309  -3.835   8.522  76.483 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  42.2313     1.0178   41.49 <0.0000000000000002 ***
## agea         -0.3896     0.0265  -14.70 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1299 
## F-statistic: 216.1 on 1 and 1440 DF,  p-value: < 0.00000000000000022

By looking at the summary, we can see that the model is a significant fit to the data and it is better than no model, as F-statistics line shows the p-value (< 2.2e-16) is much smaller than the conventional cut off of 0.05.

R-squared value, i.e. the coefficient of determination, equals 0.1299 and indicates that our model (model1) explains about 12% of the variability of Internet usage in the Netherlands. At the same time, there is a residual standard error (RSE) value that can tell us about the difference between the observed time spent on the Internet and the time spent on the Internet predicted by the model (model1). This difference is 17.66 percentage points.

Linear model equation

Looking at the coefficients in the summary, we can construct a linear model equation. To accomplish that, we should know the coefficients for our equation, namely, the slope value and the intercept value, which have been discovered by now. To quickly remind ourselves of these numbers, we can call the following:

coef(model1)
## (Intercept)        agea 
##   42.231334   -0.389613

So, the intercept value is 42.22, while the slope value is -0,39. Thus, the linear model is the following:

\[\hat{InternetUsageInMinutes} = 42.23 - 0.39*age\]

It shows us that, when one’s age equals to 0, he/she typically spends 42.23 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.

As our age variable doesn’t contain meaningful 0 values, we can also make the same model with the centered age variable as a predictor. That means that we move the mean attendance to zero by subtracting the mean from the variable.

model1_centered <- lm(netustm ~ agea_centered, data = NL)
summary(model1_centered)
## 
## Call:
## lm(formula = netustm ~ agea_centered, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.387 -12.309  -3.835   8.522  76.483 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    28.9334     0.4666   62.01 <0.0000000000000002 ***
## agea_centered  -0.3896     0.0265  -14.70 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1299 
## F-statistic: 216.1 on 1 and 1440 DF,  p-value: < 0.00000000000000022

Now, the intercept point in the model was changed and equals 28.93. The centered linear model is the following:

\[\hat{InternetUsageInMinutes} = 28.93 - 0.39*age\]

It suggests that when one’s age equals to 0, he/she typically spends 28.93 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.

Model 2: age, hamperedness in daily activities and frequency of participation in social activities vs. internet usage

Visualisation of the 2nd model

From the visualization it can be seen that slopes of the line of the best fit are different for categories of the ‘hlthhmp’ variable. Still, the general tendency of negative linear relationship between respondents’ age and Internet usage is observed for all htree groups.

ggplot(NL, 
       aes(x = agea, y = netustm)) +
  geom_point(aes(color = hlthhmp)) + 
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "purple4") +
  facet_wrap(~hlthhmp) +
  labs(subtitle="Age Vs Internet Usage", 
       y="Time Spent On the Internet", 
       x="Age", 
       col = "Hampered in\ndaily activity",
       title="Scatterplot") +
  theme_bw()

The same patterns are observed in the following visualization, with ‘sclact’ variable this time.

ggplot(NL, 
       aes(x = as.numeric(agea), y = as.numeric(netustm))) +
  geom_point(aes(color = sclact)) + 
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "purple4") +
  facet_wrap(~sclact) +
  labs(subtitle="Age Vs Internet Usage", 
       y="Time Spent On the Internet", 
       x="Age", 
       col = "Take part in\nsocial activities",
       title="Scatterplot") +
  theme_minimal()

It is possible to visualize linear relationships between several predictors and an outcome via 3d scatterplot, but we did that by facets. The fcat that in cases of both being hampered in daily activities and participating in social activities different categories having different slopes of linear regression line is also reflected in a linear model equation.

Second linear regression model and its summary

Going further, let’s construct our second regression model, using age, hamperedness in daily activities and frequency of participation in social activities variables as predictors for how much people spend on the Internet.

model2 <- lm(netustm ~ agea + hlthhmp + sclact, data = NL)
summary(model2)
## 
## Call:
## lm(formula = netustm ~ agea + hlthhmp + sclact, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.057 -11.714  -3.800   8.049  76.063 
## 
## Coefficients:
##                           Estimate Std. Error t value            Pr(>|t|)
## (Intercept)               44.81548    2.84378  15.759 <0.0000000000000002
## agea                      -0.38888    0.02696 -14.426 <0.0000000000000002
## hlthhmpYes to some extent -3.49172    2.36995  -1.473               0.141
## hlthhmpNo                 -1.20876    2.24681  -0.538               0.591
## sclactLess than most      -1.00335    2.04908  -0.490               0.624
## sclactAbout the same      -2.39317    1.98919  -1.203               0.229
## sclactMore than most       1.38576    2.12743   0.651               0.515
## sclactMuch more than most  2.39442    3.30360   0.725               0.469
##                              
## (Intercept)               ***
## agea                      ***
## hlthhmpYes to some extent    
## hlthhmpNo                    
## sclactLess than most         
## sclactAbout the same         
## sclactMore than most         
## sclactMuch more than most    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.66 on 1434 degrees of freedom
## Multiple R-squared:  0.1401, Adjusted R-squared:  0.1359 
## F-statistic: 33.38 on 7 and 1434 DF,  p-value: < 0.00000000000000022

In the summary, we see that p-value for the F-statistics is smaller than 0.05 that makes this model significant fit to the data. The coefficient of determination (R^2) value is 0.136, showing that the second model explains about 13% of the outcome variable, i.e. Internet usage.

Linear model equation

Looking at the coefficients in the summary, we can construct linear model equation.

coef(model2)
##               (Intercept)                      agea 
##                44.8154848                -0.3888826 
## hlthhmpYes to some extent                 hlthhmpNo 
##                -3.4917222                -1.2087594 
##      sclactLess than most      sclactAbout the same 
##                -1.0033456                -2.3931728 
##      sclactMore than most sclactMuch more than most 
##                 1.3857626                 2.3944152

The linear model, a bit overloaded but still, is:

\[\hat{InternetUsageInMinutes} = 44.8 - 0.38*age - 3.49*hamperedToSomeExtent + \\ - 1.2*notHampered - SociallyMeetLessThanMost - 2.39*SociallyMeetAboutTheSame + \\ + 1.39*SociallyMeetMoreThanMost + 2.39*SociallyMeetMuchMoreThanMost\]

That suggests that when one’s age, hamperedness and social activity equal to 0, he/she typically spends 44.8 minutes on the Internet. With every additional year of age, a person spends 0.38 minutes less. The more the person is hampered in their daily activities, the less they spend time on the Net: the presence of hamperedness in daily life decreases the amount of time spent in the Internet by 3.49 minutes. It’s interesting to look at the positive correlation: the more person is socially active, the more time they spend in the Net: being more active in social life, than others of the same age, increase the time spend in the Internet by 1.38 minutes and being much more active in social life increase the time spend in the Internet by 2.39 minutes.

Comparison of model fit

Now that we have two models that may predict Internet usage in the Netherlands, we are going to compare them and see which model fits better to our data. In order to do so, we apply the ANOVA testing.

This way, we are going to compare the first model with one predictor, age, and the second model with 3 predictors, age, being hampered in daily activities, and frequency of social meetings compared to others of the same age.

  • H0: removed factors are not different from 0, full model is not better;
  • H1: removed factors are different from 0, full model is better.

In our case, full model is model2, as it has three predictors, while model1 has only one predictor.

anova(model1, model2)

The anova testing reveals a small p-value, as it is less than 0.05, that allows us to reject the null hypothesis. Also we can have a look on RSS, which is smaller for the second model. This way, we can conclude that model2 is significantly better at capturing the data.

Conclusion

Taking all the analyses into consideration we may say that our substantive hypotheses are supported, but the correlation excessively weak. That was seen after testing with Pearson and Spearman tests. The same was confirmed with our models, which explained about 12-13% of the variability of Internet usage in the Netherlands. Also we found out that it’s better to analyse internet usage not only with variable of reprondents’ age - ANOVA testing have shown that additon of 2 new predictors is a good idea.

Is is also important to remember, that we explored the data for 2016 year, not the most recent one! Maybe things have changed throughout the last 3 years. It would be interesting to see if they have and how exactly, but (unfortunately) we have no such oportunity yet. All in all it was an interesting investigation!

Both models can be seen in a more delicate way in the following table:

tab_model(model1, model2)
  netustm netustm
Predictors Estimates CI p Estimates CI p
(Intercept) 42.23 40.24 – 44.23 <0.001 44.82 39.24 – 50.39 <0.001
agea -0.39 -0.44 – -0.34 <0.001 -0.39 -0.44 – -0.34 <0.001
Yes to some extent -3.49 -8.14 – 1.15 0.141
No -1.21 -5.61 – 3.19 0.591
Less than most -1.00 -5.02 – 3.01 0.624
About the same -2.39 -6.29 – 1.51 0.229
More than most 1.39 -2.78 – 5.56 0.515
Much more than most 2.39 -4.08 – 8.87 0.469
Observations 1442 1442
R2 / R2 adjusted 0.130 / 0.130 0.140 / 0.136

Not the end yet… stay tuned…

Odoo’s upgraded 4th project with new predictors, research question and hypotheses! = the LAST project!

Nevertheless we were not satisfied with the results of our linear regression models. For this reason we decided to try other predistors, for example working hours per week and gender.

We assumed that the effect of the variable “wkhtot” (working hours per week) may differ based on the level of variable “gndr” (respondents’ gender) because, firstly, women generally use the Internet less than men (Thanuskodi, 2013). Moreover, males and females tend to have different types of jobs, and the internet usage may depend on person`s type of job.

In this project our research question was if the relationship between the number of working hours per week and number of minutes spent on the Internet per day was different based on gender? In others words, is the effect of “wkhtot” variable differs based on the level of “gndr” variable?

  • H0: the effect of the variable “wkhtot” (working hours per week) does not differ based on the level of variable “gndr”;
  • H1: the effect of the variable “wkhtot” (working hours per week) differs based on the level of variable “gndr”.

Contribution and loading the data

In this part of project our contribution is:

  • Anna Moroz - editing previous projects, producing interaction plots and interpreting them;
  • Anna Podolskaya - testing the improvement in R-squared due to interaction, knitting EVERYTHING together (including all the previous projects), arranging in a nice file, correcting some little mistakes + writing additional intro and conclusion, because I can (giggles mysteriously);
  • Anastasia Prokudina - description of the distribution of the outcome and predictors, calculating the correlation between continuous predictors, checking significance of variables` effects and the presence of interaction effect, interpretation all the regression coefficients;
  • Olga Chemodanova - writing introduction, research question with hypotheses, СЃonclusion.

Let’s upgrade our dataset with new variables:

NL <- ESS %>% select (agea, netustm, sclact, hlthhmp, gndr, wkhtot, wkhtotp)
NL = na.omit(NL)
NL$netustm <- as.numeric(NL$netustm)
NL$wkhtot<- as.numeric(NL$wkhtot)
dim(NL)
## [1] 587   7

The resulting dataset that we are going to work with contains 7 variables with 587 observations each.

Visualised distributions of the outcome and predictors

  • Distribution of respondents’ total hours normally worked per week in main job:
ggplot(NL, aes(x = wkhtot)) +
  ggtitle("Respondents total hours normally worked per week in main job distribution") +
  xlab("Hours") + 
  ylab("Count") +
  geom_histogram(binwidth = 4, fill = "sienna1", col= "sienna2", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(NL$wkhtot)), linetype="solid", color="#8B0000", size=1) +
  geom_vline(aes(xintercept = median(NL$wkhtot)), linetype="dashed", color="white", size=1) +
  theme_bw()

ggqqplot(NL$wkhtot)

summary(NL$wkhtot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   24.00   35.00   33.72   42.00   68.00

From the histogram, and Q-Q plot it can be seen that the distribution is close to normal being a little bit right-skewed with some people who works much more hours than the mean number(solid line on the histogram). The largest number of hours worked per weer is 68 while the mean number is 34 hours per week.

  • Distribution of the number of minutes spent on the Internet on a typical day among the respondents:
ggplot(NL, aes(x = netustm)) +
  geom_histogram(binwidth = 7, fill = "sienna1", col= "sienna2", alpha = 0.7) + 
  labs(x = "Number of minutes spent on the Internet on a typical day", y = "count") +
  theme_bw() +
  ggtitle("The distribution of number of minutes spent on the Internet on a typical day") +
  geom_vline(aes(xintercept = mean(NL$netustm)), linetype="solid", color="#8B0000", size=1) +
  geom_vline(aes(xintercept = median(NL$netustm)), linetype="dashed", color="white", size=1) 

#checking the normality of distribution
ggqqplot(NL$netustm)

summary(NL$netustm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   26.00   29.05   39.00  102.00

From the histogram, and Q-Q plot it can be seen that the distribution is not normal and right-skewed, so many respondents spent less minutes on the Internet compared to the mean score (solid line on the histogram). The largest number of minutes spent on the Internet on a typical day was 102 minutes while the average number of minutes on the Internet among respondents was approximately 29 minutes.

  • Distribution of respondets’ gender:
ggplot() + 
  geom_bar(data = NL, aes(x = gndr), fill = "red2", alpha = 0.6) +
  ylab("Number of respondents") +
  xlab("Gender") +
  ggtitle("Gender distribution") +
  theme_bw() 

From the bar plot, it is seen that there is more females than males in the sample.

Correlation test

First of all, let’s run a correlation test to check if any correlative relationship is possible between our chosen variables. It can be done with the Parametric Pearson correlation method.

Hypotheses:

  • H0: the population correlation coefficient between working hours and Internet usage is not significantly different from zero;
  • H1: the population correlation coefficient between working hours and Internet usage is significantly different from zero.
cor.test(NL$wkhtot, NL$netustm, method = "pearson") 
## 
##  Pearson's product-moment correlation
## 
## data:  NL$wkhtot and NL$netustm
## t = 3.5709, df = 585, p-value = 0.000385
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06590721 0.22432974
## sample estimates:
##       cor 
## 0.1460547

The p-value is smaller than 0.05 that allows us to reject the null hypothesis meaning that there is a correlation between Working hours and Internet usage. The correlation coefficient is equal to 0.14, indicating a positive and weak correlation between variables.

let’s visualize the relationship between Number of working hours and Internet usage with the “best” fit line that approximates all points on the scatterplot to one tendency line.

ggplot(NL, aes(x = wkhtot , y = as.numeric(netustm))) +
  geom_point(shape = 19, color = "tomato") + 
  geom_smooth(method = "lm", se = FALSE, color = "purple4") +
  labs(subtitle="Number of working hours Vs Internet Usage", 
       y="Time Spent On the Internet", 
       x="Number of working hours", 
       title="Scatterplot") +
  theme_bw()

Additive linear regression model

By this model, we are going to predict how much time people spend on the Internet in a day (‘netustm’ variable) by Number of working hours per week variable predictor(‘wkhtot’) and gender (‘gndr’ variable).

model1 <- lm(netustm ~ wkhtot + gndr, data = NL)
summary(model1)
## 
## Call:
## lm(formula = netustm ~ wkhtot + gndr, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.741 -12.562  -3.769   9.607  74.540 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)  19.7144     3.1498   6.259 0.000000000751 ***
## wkhtot        0.2416     0.0679   3.558       0.000405 ***
## gndrFemale    1.9484     1.8432   1.057       0.290908    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.12 on 584 degrees of freedom
## Multiple R-squared:  0.0232, Adjusted R-squared:  0.01986 
## F-statistic: 6.936 on 2 and 584 DF,  p-value: 0.001055

In this model, we can see that p-value is less than 0.05 meaning that our model is better than having no model. Adjusted R-squared value is equal to 0.02 meaning that the model explains about 2% of variance of the predicted variable.

The equation linear model is the following:

\[\hat{InternetUsageInMinutes} = 133.8 + 2.3*WorkingHours + 17.5*IfFemale\] It shows us that, when one’s working hours per week equals to 0, a person typically spends 133.8 minutes on the Internet. With every additional working hour, a person spends 2.3 minutes more and if it is a woman, number of minutes increases by 17.5.

Interaction linear regression model

In this model we predict how much time people spend on the Internet in a day (‘netustm’ variable) by Number of working hours per week variable predictor(‘wkhtot’) whcih relationship is moderated by gender (‘gndr’ variable). In means that effect of working hours per week variable differs based on the level of gender variable.

model2 <- lm(netustm ~  wkhtot * gndr   , data = NL)
summary(model2)
## 
## Call:
## lm(formula = netustm ~ wkhtot * gndr, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.831 -12.841  -4.163   9.362  75.075 
## 
## Coefficients:
##                    Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)        31.42028    5.11836   6.139 0.00000000154 ***
## wkhtot             -0.03127    0.11603  -0.270       0.78763    
## gndrFemale        -13.63973    5.69532  -2.395       0.01694 *  
## wkhtot:gndrFemale   0.41229    0.14263   2.891       0.00399 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.01 on 583 degrees of freedom
## Multiple R-squared:  0.037,  Adjusted R-squared:  0.03205 
## F-statistic: 7.467 on 3 and 583 DF,  p-value: 0.00006522

Looking in the summary, we can see that p-value less than 0.05 meaning that our model is better than having no model. Adjusted R-squared value is equal to 0.03 meaning that the model explains about 3% of variance of the predicted variable. The p-value for coefficient of interaction effect is smaller than 0.05 meaning that there is an effect of Gender on Working hours for preddiction of Internet usage.

The equations of linear model for both Females and Males are :

\[\hat{InternetUsageInMinutes(F)} = 228.6 + 0.14*WorkingHours - 122*IfFemale + 3.77*WorkingHours*IfFemale\]

\[\hat{InternetUsageInMinutes(M)} = 228.5 + 0.14*WorkingHours\]

In other words, a female with 10 working hours per week is predicted to spend on internet:

\[228.6 + 0.14*10 - 122*1 + 3.77*10*1 = 145.7(minutes)\]

And a male with 10 working hours will spend:

\[228.6 + 0.14*10 = 230(minutes)\]

Comparing models

let’s compare the additive model and the interaction model.

  • H0: means of the groups are equal, interaction model is not better than additive model;
  • H1: means of the groups are not equal, interaction model is better than additive model.
anova(model1, model2)
tab_model(model1, model2)
  netustm netustm
Predictors Estimates CI p Estimates CI p
(Intercept) 19.71 13.54 – 25.89 <0.001 31.42 21.39 – 41.45 <0.001
wkhtot 0.24 0.11 – 0.37 <0.001 -0.03 -0.26 – 0.20 0.788
Female 1.95 -1.66 – 5.56 0.291 -13.64 -24.80 – -2.48 0.017
wkhtot:gndrFemale 0.41 0.13 – 0.69 0.004
Observations 587 587
R2 / R2 adjusted 0.023 / 0.020 0.037 / 0.032

The anova testing reveals a small p-value less than 0.05, that allows us to reject the null hypothesis meaning that we can conclude that Interation model2 is significantly better at capturing the data than additive model. Moreover, if we take a look on adjusted R-squared, which is bigger for the model2, we may conclude that the variation of the outcome variable (netuse) is better explained by the second (interaction) model.

The marginal-effect plot

# here we did some manipulations to save our variables with prettier names

Gender <- NL$gndr
cbind(NL, Gender)
Working_hours_per_week <- NL$wkhtot
cbind(NL, Working_hours_per_week)
Internet_usage <- NL$netustm
cbind(NL, Internet_usage)
modelC <- lm(Internet_usage ~ Working_hours_per_week * Gender, data=NL)
plot_model(modelC, type="int", title = "Predicted values of Internet usage variable", colors = "Dark2") + theme_sjplot2()  

The marginal-effect plot above depicts predicted values of our model by gender group (male or female). The lines showing general trends of predicted values for each group are also supplemented with their confidence intervals, which seem to overlap everywhere except the interval from 49 to 70 points of working hours variable. This mere fact signifies that the amount of time spent on the Internet per week statistically signifficantly varies when we consider different gender groups in our prediction by total working hours per week in the interval from 49 to 70 points. At the same time, this difference in prediction of Internet usage by working hours for two groups is statistically insignifficant in the interval from 0 to 49 points. Besides, the interaction between gender and working hours is clearly observed, as the lines on the graph have diverging slopes.

The overall conclusion for the last project

To conclude, the alternative hypothesis is confirmed: the effect of the variabe “wkhtot” (working hours per week) does differ based on the level of the variable “gndr”, so the relationship between the nuber of working hours and the number of minutes spent on the Internet among the population of the Netherlands depends on the gender of people. The interaction model`s (which is the final one) adjusted R-square is about 0.04 which means that the model explaines 4% of variance in the number of minutes spent on the Internet. Generally, men spend on the Internet about the same number of minutes no matter how many hours a week they work. The number of minutes on the Internet decreases slightly with an increase in working hours. The number of minutes spent on the Internet for women, by contrast, varies depending on the number of working hours per week. In particular, with the increase in the number of working hours, the time spent on the Internet increases, as well. As we have mentioned earlier, his may be due to the specifics of the jobs held by women in the Netherlands.

And here’s the end for all our project.

Congrats to all of us!

IN CONCLUSION…

We would like to say THANK YOU to all of the team of our teachers and teaching assistants for all their help and knowledge. You are great guys, honestly :) Send you this project with lots’ of love!!!

dat<- data.frame(t=seq(0, 2*pi, by=0.1))
 xhrt <- function(t) 16*sin(t)^3
 yhrt <- function(t) 13*cos(t)-5*cos(2*t)-2*cos(3*t)-cos(4*t)
 dat$y=yhrt(dat$t)
 dat$x=xhrt(dat$t)
 with(dat, plot(x,y, type="l", col = "hotpink"))
 with(dat, polygon(x,y, col="pink"))