Hello, there!
As usual, short intro about us.
Our team-members and their responsibilities:
Our data:
As usual, here are all the packages that we need. Let’s open them (or install and open):
library(foreign)
library(dplyr)
library(magrittr)
library(lubridate)
library(ggplot2)
library(psych)
library(knitr)
library(lsr)
library(vcd)
library(corrplot)
library(RColorBrewer)
library(ggpubr)
library(sjPlot)
Recently, our team has found out an interesting fact: the country we chose to study, that is Netherlands, was at the top of the ranking of the European Union countries with the highest Internet access at home in 2018 (source: https://www.statista.com/statistics/379130/internet-usage-at-home-netherlands/). Further exploration (source: https://www.cbs.nl/en-gb/news/2018/05/the-netherlands-leads-europe-in-internet-access) revealed that, in 2017, 98% of Dutch households had Internet access, compared to the average 87% in Europe. Also, the article from the cbs.nl web-site points out the popularity of mobile Internet in Netherlands, saying that about 87% of population, which is quite a big number, uses the Internet on their mobile devices outside of home or work.
We got curious about this prevalence of the Internet in Netherlands and decided to give it a bit of research. We wanted to check if there are any factors that contribute to the Internet usage among the Dutch. Thus, we settled on applying the regression analysis on data that we got from European Social Survey (ESS) database. As the ESS database allows us to perform the analysis using data only from the 2016 round, or even the later ones, we follow the idea to predict the Internet usage variability in Netherlands just before it became so popular in this country. We assume, it could give us some insights into the prerequisites of such a tendency.
In the regression analysis we used the ‘netustm’ variable that speaks for how much time, in minutes, a respondent spends using the internet on a computer, tablet, smartphone or other device, whether for work or personal use. To predict this as an outcome, we, first of all, are trying a respondent’s age (‘agea’) as an explanatory variable. Further, to see a bigger picture, we are adding to the model respondents being hampered in daily activities by illness/disability/infirmity/mental problem (‘hlthhmp’ variable) and the frequency of taking part in social activities compared to others of same age (‘sclact’ variable) as additional predictors. This way we suggest that if there is any inconvenience for a person to take part in daily activities, then he/she may spend more time in the Internet as a distraction from issues of the day. We also support this model with the frequency of taking part in social activities as it may contibute to the proposition a lot.
Here we want to find out if there were are any relationships between minutes spent on the Internet on a typical day (dependent variable or outcome) and age of respondents (independent variable or predictor) in the Netherlands. In the second model, we also added another predictor, health, to age. We also wanted to predict minutes spent on the Internet on a typical day based on these two predictors in the Netherlands.
Our hypotheses were that the older the respondents, the less time they spent on the Internet on a typical day. Moreover, we assumed that healthy respondents spent less time on the Internet, as they devoted time to other activities. Also, respondents with bad health may spend more time on the Internet as they search for information about their diseases, or physically do not have the opportunity to actively spend time and surf the Internet instead.
Loading our dataset:
ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL <- ESS %>% select (agea, netustm, sclact, hlthhmp)
NL = na.omit(NL)
NL$netustm <- as.numeric(NL$netustm)
NL$agea <- as.numeric(NL$agea)
dim(NL)
## [1] 1442 4
summary(NL)
## agea netustm sclact
## Min. : 1.00 Min. : 1.00 Much less than most: 93
## 1st Qu.:20.00 1st Qu.: 12.00 Less than most :386
## Median :34.00 Median : 26.00 About the same :631
## Mean :34.13 Mean : 28.93 More than most :290
## 3rd Qu.:48.00 3rd Qu.: 39.00 Much more than most: 42
## Max. :75.00 Max. :102.00
## hlthhmp
## Yes a lot : 69
## Yes to some extent: 310
## No :1063
##
##
##
ggplot(NL, aes(x = agea)) +
ggtitle("Respondents' age distribution") +
xlab("Age") +
ylab("Count") +
geom_histogram(binwidth = 3, fill = "sienna1", col= "sienna2", alpha = 0.7) +
geom_vline(aes(xintercept = mean(NL$agea)), linetype="solid", color="#8B0000", size=1) +
geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) +
theme_bw()
#Checking the normality of distribution
ggqqplot(NL$agea)
summary(NL$agea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 20.00 34.00 34.13 48.00 75.00
From the histogram and Q-Q plot it can be concluded that the distribution is not normal, bit close to it (mean and median are almost the same). The mean age of respondents is roughly 34 years. The oldest respondent is 75 years old. Generally, the majoruty of respondents are younger than 60 years.
ggplot(NL, aes(x = netustm)) +
geom_histogram(binwidth = 7, fill = "sienna1", col= "sienna2", alpha = 0.7) +
labs(x = "Number of minutes spent on the Internet on a typical day", y = "count") +
theme_bw() +
ggtitle("The distribution of number of minutes spent on the Internet on a typical day") +
geom_vline(aes(xintercept = mean(NL$netustm)), linetype="solid", color="#8B0000", size=1) +
geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1)
#checking the normality of distribution
library(ggpubr)
ggqqplot(NL$netustm)
summary(NL$netustm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 26.00 28.93 39.00 102.00
From the histogram, and Q-Q plot it can be seen that the distribution is not normal and right-skewed, so many respondents spent less minutes on the Internet compared to the mean score (solid line on the histogram). The largest number of minutes spent on the Internet on a typical day was 102 minutes while the average number of minutes on the Internet among respondents was approximately 29 minutes.
ggplot() +
geom_bar(data = NL, aes(x = sclact), fill = "red2", alpha = 0.7) +
xlab("The frequency of taking part in social activities compared to others of same age") +
ggtitle("Distribution of the frequency of taking part in social activities\ncompared to others of same age") +
theme_bw()
It can be seen that more than 600 people take part in social activity about the same often as others of the same age. Almost 400 people spend less than most while almost 300 people - more than most. There are much less respondents who take part in social activities much less as well as much more frequent than those of the same age, almost 100 people and less than 50 people correspondingly.
ggplot() +
geom_bar(data = NL, aes(x = hlthhmp), fill = "red2", alpha = 0.7) +
xlab("Respondents being hampered in daily activities by illness/disability/infirmity/mental problem") +
ggtitle("Distribution of respondents being hampered in daily activities\nby illness/disability/infirmity/mental problem") +
theme_bw()
The majority of respondents are not hampered in daily activities by illness/disability/infirmity/mental problem. About 310 people are humpered to some extent and only ~70 people are humpered in daily activities due to these reasons.
ggplot() +
geom_bar(data = NL, aes(x = sclact, fill = hlthhmp), position = "fill") +
xlab("The frequency of taking part in social activities compared to others of same age") +
ggtitle("The distribution of frequency of taking part in social activities compared to others of same age\nwith the fact if respondents being hampered\nin daily activities by illness/disability/infirmity/mental problem") +
scale_fill_brewer(palette="Reds")+
theme_bw()
In general, the greater percentage of respondents who take part in social activities much less than others of the same age are hampered by illness/disability/infirmity/mental problem either a lot or to some extent compared to others.
ggplot() +
geom_boxplot(data = NL, aes(x = sclact, y = netustm))+
xlab("The frequency of taking part in social activities compared to others of same age") +
ylab("Number of minutes spent on the Internet on a typical day") +
ggtitle("Numer of minutes spent on the Internet on a typical day\nby the frequency of taking part in social activities compared to others of same age") +
theme_bw()
NL %>% group_by(sclact) %>% dplyr::summarise(mean_netuse = mean(netustm))
As can be seen, boxplots are almost the same: their medians are around 20-27 minutes among all categories of taking part of social activities compared toothers of same age.The same can be said about boxplots whiskers and the 1st and the 3rd quatiles. All boxplots have outliers, but one that represents number of minutes on the Internet of people who partcicpate in social activities more often than most of the same age, there are less of them.
Also, from the table we can see that the mean number of minutes spent on the Internet is almost the same among respondents of different groups by frequency of taking part in social activities.
ggplot() +
geom_boxplot(data = NL, aes(x = hlthhmp, y = netustm)) +
xlab("Respondents being hampered in daily activities by illness/disability/infirmity/mental problem") +
ylab("Number of minutes spent on the Internet on a typical day") +
ggtitle("Distribution of respondents being hampered in daily activities by illness/disability/infirmity/mental problem\nby number of minutes spent on the Internet on a typical day") +
theme_bw()
NL %>% group_by(hlthhmp) %>% dplyr::summarise(mean_internetuse = mean(netustm))
So, the medians and the 1st and the 3rd quartiles of the boxplots are almost the same, but the last boxplot has larger upper whisker and more outliers. That means that more people who are not hampered in daily activities by illness/disability/infirmity/mental problem spend spend ~60-80 minutes on the Internet per day than those who are not hampered or only to some extent. Still, that may be due to the fact that this group of people is larger than the other two.
From the table it can be concluded that people who are not hampered in daily activities habe the highest mean number of minutes spent on the Internet.
First of all, let’s run a correlation test to check if any correlative relationship is possible between our chosen variables. It can be done with the parametric Pearson correlation method and the non-parametric Spearman correlation method which test the correlation coefficient for significance.
As we have close to a normal distribution of the age variable, we can apply the Pearson test.
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "pearson")
##
## Pearson's product-moment correlation
##
## data: as.numeric(NL$age) and as.numeric(NL$netustm)
## t = -14.701, df = 1440, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4053000 -0.3154979
## sample estimates:
## cor
## -0.3612362
The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.36, indicating a negative, relatively weak correlation between variables.
To double-check the significance of the correlation coefficient we apply the Spearman test. As well…
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "spearman")
##
## Spearman's rank correlation rho
##
## data: as.numeric(NL$age) and as.numeric(NL$netustm)
## S = 705640000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4120086
The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.41, indicating negative, relatively weak, yet stronger than shown with the Pearson test, correlation between variables.
Now, let’s visualize our first model which includes the relationship between age and Internet usage with the “best” fit line that approximates all points on the scatterplot to one tendency line.
# Firstly we centered the 'agea' variable to make intercept interpretable
NL$agea_centered <- NL$agea - mean(NL$agea)
Orange scatterplot is built with centered variable. You can also have a glance on the graph with no centration for the variable of respondents’ age - it is red-colored.
In general, it can be seen that the majority of respondents spend less than 60 minutes on the Internet per day. Also, people from ~65 years spend less than 60 minutes per day or do not use the Internet at all. Age and number of minutes spent on the Internet are negatively linearly related, means that with the increase of age the numer of minutes spent on the Internet decreases. Still, the relationship is quite weak (as it was assumed by both correlation tests). Now, let’s proceed to the first linear regression model.
By this model, we are going to predict how much time people spend on the Internet in a day (‘netustm’ variable) by age variable predictor(‘agea’).
model1 <- lm(netustm ~ agea, data = NL)
summary(model1)
##
## Call:
## lm(formula = netustm ~ agea, data = NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.387 -12.309 -3.835 8.522 76.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.2313 1.0178 41.49 <2e-16 ***
## agea -0.3896 0.0265 -14.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1299
## F-statistic: 216.1 on 1 and 1440 DF, p-value: < 2.2e-16
By looking at the summary, we can see that the model is a significant fit to the data and it is better than no model, as F-statistics line shows the p-value (< 2.2e-16) is much smaller than the conventional cut off of 0.05.
R-squared value, i.e. the coefficient of determination, equals 0.1299 and indicates that our model (model1) explains about 12% of the variability of Internet usage in the Netherlands. At the same time, there is a residual standard error (RSE) value that can tell us about the difference between the observed time spent on the Internet and the time spent on the Internet predicted by the model (model1). This difference is 17.66 percentage points.
Looking at the coefficients in the summary, we can construct a linear model equation. To accomplish that, we should know the coefficients for our equation, namely, the slope value and the intercept value, which have been discovered by now. To quickly remind ourselves of these numbers, we can call the following:
coef(model1)
## (Intercept) agea
## 42.231334 -0.389613
So, the intercept value is 42.22, while the slope value is -0,39. Thus, the linear model is the following: \[\hat{InternetUsageInMinutes} = 42.23 - 0.39*age\]
It shows us that, when one’s age equals to 0, he/she typically spends 42.23 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.
As our age variable doesn’t contain meaningful 0 values, we can also make the same model with the centered age variable as a predictor. That means that we move the mean attendance to zero by subtracting the mean from the variable.
model1_centered <- lm(netustm ~ agea_centered, data = NL)
summary(model1_centered)
##
## Call:
## lm(formula = netustm ~ agea_centered, data = NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.387 -12.309 -3.835 8.522 76.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.9334 0.4666 62.01 <2e-16 ***
## agea_centered -0.3896 0.0265 -14.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1299
## F-statistic: 216.1 on 1 and 1440 DF, p-value: < 2.2e-16
Now, the intercept point in the model was changed and equals 28.93. The centered linear model is the following: \[\hat{InternetUsageInMinutes} = 28.93 - 0.39*age\] It suggests that when one’s age equals to 0, he/she typically spends 28.93 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.
Now that we have two models that may predict Internet usage in Netherlands, we are going to compare them and see which model fits better to our data. In order to do so, we apply the ANOVA testing.
Firstly, we compare the first model with one predictor, age, and the second model.
In our case, full model is model2, as it has three predictors, while model1 has only one predictor.
anova(model1, model2)
The anova testing reveals a small p-value, as it is less than 0.05, that allows us to reject the null hypothesis. Also we can have a look on RSS, which is smaller for the second model. This way, we can conclude that model2 is significantly better at capturing the data.
Taking all the analyses into consideration we may say that our substantive hypotheses are supported, but the correlation excessively weak. That was seen after testing with Pearson and Spearman tests. The same was confirmed with our models, which explained about 12-13% of the variability of Internet usage in the Netherlands. Also we found out that it’s better to analyse internet usage not only with variable of reprondents’ age - ANOVA testing have shown that additon of 2 new predictors is a good idea.
Is is also important to remember, that we explored the data for 2016 year, not the most recent one! Maybe things have changed throughout 3 years. It would be interesting to see if it is and how, but (unfortunately) for today we have no such oportunity. All in all it was an interesting investigation!
Both models can be seen in the table:
| netustm | netustm | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 42.23 | 40.24 – 44.23 | <0.001 | 44.82 | 39.24 – 50.39 | <0.001 |
| agea | -0.39 | -0.44 – -0.34 | <0.001 | -0.39 | -0.44 – -0.34 | <0.001 |
| Yes to some extent | -3.49 | -8.14 – 1.15 | 0.141 | |||
| No | -1.21 | -5.61 – 3.19 | 0.591 | |||
| Less than most | -1.00 | -5.02 – 3.01 | 0.624 | |||
| About the same | -2.39 | -6.29 – 1.51 | 0.229 | |||
| More than most | 1.39 | -2.78 – 5.56 | 0.515 | |||
| Much more than most | 2.39 | -4.08 – 8.87 | 0.469 | |||
| Observations | 1442 | 1442 | ||||
| R2 / R2 adjusted | 0.130 / 0.130 | 0.140 / 0.136 | ||||
Thank you for attention and have a nice weekends!!!