~ Introduction

Hello, there!

As usual, short intro about us.

Our team-members and their responsibilities:

  • Anna Moroz - introduction, second model with 3 predictors (including its summary, interpretation and linear equation), comparison of two models via ANOVA, creating scatterplots;
  • Anna Podolskaya - explanation whether the hypotheses are falsified or supported, creating an output table, formulating a conclusion, coloring graphs, arranging the final layout;
  • Anastasia Prokudina - selection of variables, first model with 1 predictor (including its summary, interpretation and linear equation);
  • Olga Chemodanova - research question and hypotheses, descriptive statistics for the variables, scatterplot description.

Our data:

  • country, we are interested in is Netherlands;
  • the data was collected in 2016;
  • the theme is “Subjective well-being”.

As usual, here are all the packages that we need. Let’s open them (or install and open):

library(foreign)
library(dplyr)
library(magrittr)
library(lubridate)
library(ggplot2)
library(psych)
library(knitr)
library(lsr)
library(vcd)
library(corrplot)
library(RColorBrewer)
library(ggpubr)
library(sjPlot)

Going deeper into the topic

Recently, our team has found out an interesting fact: the country we chose to study, that is Netherlands, was at the top of the ranking of the European Union countries with the highest Internet access at home in 2018 (source: https://www.statista.com/statistics/379130/internet-usage-at-home-netherlands/). Further exploration (source: https://www.cbs.nl/en-gb/news/2018/05/the-netherlands-leads-europe-in-internet-access) revealed that, in 2017, 98% of Dutch households had Internet access, compared to the average 87% in Europe. Also, the article from the cbs.nl web-site points out the popularity of mobile Internet in Netherlands, saying that about 87% of population, which is quite a big number, uses the Internet on their mobile devices outside of home or work.

We got curious about this prevalence of the Internet in Netherlands and decided to give it a bit of research. We wanted to check if there are any factors that contribute to the Internet usage among the Dutch. Thus, we settled on applying the regression analysis on data that we got from European Social Survey (ESS) database. As the ESS database allows us to perform the analysis using data only from the 2016 round, or even the later ones, we follow the idea to predict the Internet usage variability in Netherlands just before it became so popular in this country. We assume, it could give us some insights into the prerequisites of such a tendency.

In the regression analysis we used the ‘netustm’ variable that speaks for how much time, in minutes, a respondent spends using the internet on a computer, tablet, smartphone or other device, whether for work or personal use. To predict this as an outcome, we, first of all, are trying a respondent’s age (‘agea’) as an explanatory variable. Further, to see a bigger picture, we are adding to the model respondents being hampered in daily activities by illness/disability/infirmity/mental problem (‘hlthhmp’ variable) and the frequency of taking part in social activities compared to others of same age (‘sclact’ variable) as additional predictors. This way we suggest that if there is any inconvenience for a person to take part in daily activities, then he/she may spend more time in the Internet as a distraction from issues of the day. We also support this model with the frequency of taking part in social activities as it may contibute to the proposition a lot.

Research question and hypotheses

Here we want to find out if there were are any relationships between minutes spent on the Internet on a typical day (dependent variable or outcome) and age of respondents (independent variable or predictor) in the Netherlands. In the second model, we also added another predictor, health, to age. We also wanted to predict minutes spent on the Internet on a typical day based on these two predictors in the Netherlands.

Our hypotheses were that the older the respondents, the less time they spent on the Internet on a typical day. Moreover, we assumed that healthy respondents spent less time on the Internet, as they devoted time to other activities. Also, respondents with bad health may spend more time on the Internet as they search for information about their diseases, or physically do not have the opportunity to actively spend time and surf the Internet instead.

Dataset

Loading our dataset:

ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL <- ESS %>% select (agea, netustm, sclact, hlthhmp)
NL = na.omit(NL)
NL$netustm <- as.numeric(NL$netustm)
NL$agea <- as.numeric(NL$agea)
dim(NL)
## [1] 1442    4

Descriptive statistics for the variables

Summary of the data:

summary(NL)
##       agea          netustm                       sclact   
##  Min.   : 1.00   Min.   :  1.00   Much less than most: 93  
##  1st Qu.:20.00   1st Qu.: 12.00   Less than most     :386  
##  Median :34.00   Median : 26.00   About the same     :631  
##  Mean   :34.13   Mean   : 28.93   More than most     :290  
##  3rd Qu.:48.00   3rd Qu.: 39.00   Much more than most: 42  
##  Max.   :75.00   Max.   :102.00                            
##                hlthhmp    
##  Yes a lot         :  69  
##  Yes to some extent: 310  
##  No                :1063  
##                           
##                           
## 

Distribution of our variables

  • Distribution of respondents` age:
ggplot(NL, aes(x = agea)) +
  ggtitle("Respondents' age distribution") +
  xlab("Age") + 
  ylab("Count") +
  geom_histogram(binwidth = 3, fill = "sienna1", col= "sienna2", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(NL$agea)), linetype="solid", color="#8B0000", size=1) +
  geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) +
  theme_bw()

#Checking the normality of distribution
ggqqplot(NL$agea)

summary(NL$agea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   20.00   34.00   34.13   48.00   75.00

From the histogram and Q-Q plot it can be concluded that the distribution is not normal, bit close to it (mean and median are almost the same). The mean age of respondents is roughly 34 years. The oldest respondent is 75 years old. Generally, the majoruty of respondents are younger than 60 years.

  • Distribution of the number of minutes spent on the Internet on a typical day among the respondents:
ggplot(NL, aes(x = netustm)) +
  geom_histogram(binwidth = 7, fill = "sienna1", col= "sienna2", alpha = 0.7) + 
  labs(x = "Number of minutes spent on the Internet on a typical day", y = "count") +
  theme_bw() +
  ggtitle("The distribution of number of minutes spent on the Internet on a typical day") +
  geom_vline(aes(xintercept = mean(NL$netustm)), linetype="solid", color="#8B0000", size=1) +
  geom_vline(aes(xintercept = median(NL$agea)), linetype="dashed", color="white", size=1) 

#checking the normality of distribution
library(ggpubr)
ggqqplot(NL$netustm)

summary(NL$netustm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   26.00   28.93   39.00  102.00

From the histogram, and Q-Q plot it can be seen that the distribution is not normal and right-skewed, so many respondents spent less minutes on the Internet compared to the mean score (solid line on the histogram). The largest number of minutes spent on the Internet on a typical day was 102 minutes while the average number of minutes on the Internet among respondents was approximately 29 minutes.

  • Distribution of the frequency of taking part in social activities compared to others of same age:
ggplot() + 
  geom_bar(data = NL, aes(x = sclact), fill = "red2", alpha = 0.7) +
  xlab("The frequency of taking part in social activities compared to others of same age") +
  ggtitle("Distribution of the frequency of taking part in social activities\ncompared to others of same age") +
  theme_bw() 

It can be seen that more than 600 people take part in social activity about the same often as others of the same age. Almost 400 people spend less than most while almost 300 people - more than most. There are much less respondents who take part in social activities much less as well as much more frequent than those of the same age, almost 100 people and less than 50 people correspondingly.

  • Distribution of respondents being hampered in daily activities by illness/disability/infirmity/mental problems:
ggplot() + 
  geom_bar(data = NL, aes(x = hlthhmp), fill = "red2", alpha = 0.7) +
  xlab("Respondents being hampered in daily activities by illness/disability/infirmity/mental problem") +
  ggtitle("Distribution of respondents being hampered in daily activities\nby illness/disability/infirmity/mental problem") +
  theme_bw()

The majority of respondents are not hampered in daily activities by illness/disability/infirmity/mental problem. About 310 people are humpered to some extent and only ~70 people are humpered in daily activities due to these reasons.

  • This barchart illustrates the distribution of frequency of taking part in social activities compared to others of same age along with the fact if respondents being hampered in daily activities by illness/disability/infirmity/mental problem.
ggplot() + 
  geom_bar(data = NL, aes(x = sclact, fill = hlthhmp), position = "fill") +
  xlab("The frequency of taking part in social activities compared to others of same age") +
  ggtitle("The distribution of frequency of taking part in social activities compared to others of same age\nwith the fact if respondents being hampered\nin daily activities by illness/disability/infirmity/mental problem") + 
  scale_fill_brewer(palette="Reds")+
  theme_bw()

In general, the greater percentage of respondents who take part in social activities much less than others of the same age are hampered by illness/disability/infirmity/mental problem either a lot or to some extent compared to others.

Boxplots for internet activity

  • The boxplots showing number of minutes spent on the Internet on a typical day by the frequency of taking part in social activities compared to others of same age:
ggplot() + 
  geom_boxplot(data = NL, aes(x = sclact, y = netustm))+
  xlab("The frequency of taking part in social activities compared to others of same age") +
  ylab("Number of minutes spent on the Internet on a typical day") +
  ggtitle("Numer of minutes spent on the Internet on a typical day\nby the frequency of taking part in social activities compared to others of same age") +
  theme_bw()

NL %>% group_by(sclact) %>% dplyr::summarise(mean_netuse = mean(netustm))

As can be seen, boxplots are almost the same: their medians are around 20-27 minutes among all categories of taking part of social activities compared toothers of same age.The same can be said about boxplots whiskers and the 1st and the 3rd quatiles. All boxplots have outliers, but one that represents number of minutes on the Internet of people who partcicpate in social activities more often than most of the same age, there are less of them.

Also, from the table we can see that the mean number of minutes spent on the Internet is almost the same among respondents of different groups by frequency of taking part in social activities.

  • boxplots representing the distribution of respondents being hampered in daily activities by illness/disability/infirmity/mental problem by number of minutes spent on the Internet on a typical day:
ggplot() + 
  geom_boxplot(data = NL, aes(x = hlthhmp, y = netustm)) +
  xlab("Respondents being hampered in daily activities by illness/disability/infirmity/mental problem") +
  ylab("Number of minutes spent on the Internet on a typical day") +
  ggtitle("Distribution of respondents being hampered in daily activities by illness/disability/infirmity/mental problem\nby number of minutes spent on the Internet on a typical day") +
  theme_bw()

NL %>% group_by(hlthhmp) %>% dplyr::summarise(mean_internetuse = mean(netustm))

So, the medians and the 1st and the 3rd quartiles of the boxplots are almost the same, but the last boxplot has larger upper whisker and more outliers. That means that more people who are not hampered in daily activities by illness/disability/infirmity/mental problem spend spend ~60-80 minutes on the Internet per day than those who are not hampered or only to some extent. Still, that may be due to the fact that this group of people is larger than the other two.

From the table it can be concluded that people who are not hampered in daily activities habe the highest mean number of minutes spent on the Internet.

Correlation test

First of all, let’s run a correlation test to check if any correlative relationship is possible between our chosen variables. It can be done with the parametric Pearson correlation method and the non-parametric Spearman correlation method which test the correlation coefficient for significance.

As we have close to a normal distribution of the age variable, we can apply the Pearson test.

  • The null hypothesis (H0): the population correlation coefficient between age and Internet usage is not significantly different from zero;
  • the alternative hypothesis (H1): the population correlation coefficient between age and Internet usage is significantly different from zero.
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "pearson") 
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(NL$age) and as.numeric(NL$netustm)
## t = -14.701, df = 1440, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4053000 -0.3154979
## sample estimates:
##        cor 
## -0.3612362

The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.36, indicating a negative, relatively weak correlation between variables.

To double-check the significance of the correlation coefficient we apply the Spearman test. As well…

  • The null hypothesis (H0): the population correlation coefficient between age and Internet usage is not significantly different from zero;
  • the alternative hypothesis (H1): the population correlation coefficient between age and Internet usage is significantly different from zero.
cor.test(as.numeric(NL$age), as.numeric(NL$netustm), method = "spearman") 
## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(NL$age) and as.numeric(NL$netustm)
## S = 705640000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4120086

The resulting p-value is much smaller than 0.05 that allows us to reject the null hypothesis. Besides, we can see that the correlation coefficient is equal to -0.41, indicating negative, relatively weak, yet stronger than shown with the Pearson test, correlation between variables.

Model 1: age vs. internet usage

Visualisation of the 1st model

Now, let’s visualize our first model which includes the relationship between age and Internet usage with the “best” fit line that approximates all points on the scatterplot to one tendency line.

# Firstly we centered the 'agea' variable to make intercept interpretable
NL$agea_centered <- NL$agea - mean(NL$agea)

Orange scatterplot is built with centered variable. You can also have a glance on the graph with no centration for the variable of respondents’ age - it is red-colored.

In general, it can be seen that the majority of respondents spend less than 60 minutes on the Internet per day. Also, people from ~65 years spend less than 60 minutes per day or do not use the Internet at all. Age and number of minutes spent on the Internet are negatively linearly related, means that with the increase of age the numer of minutes spent on the Internet decreases. Still, the relationship is quite weak (as it was assumed by both correlation tests). Now, let’s proceed to the first linear regression model.

First linear regression model and its summary

By this model, we are going to predict how much time people spend on the Internet in a day (‘netustm’ variable) by age variable predictor(‘agea’).

model1 <- lm(netustm ~ agea, data = NL) 
summary(model1) 
## 
## Call:
## lm(formula = netustm ~ agea, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.387 -12.309  -3.835   8.522  76.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.2313     1.0178   41.49   <2e-16 ***
## agea         -0.3896     0.0265  -14.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1299 
## F-statistic: 216.1 on 1 and 1440 DF,  p-value: < 2.2e-16

By looking at the summary, we can see that the model is a significant fit to the data and it is better than no model, as F-statistics line shows the p-value (< 2.2e-16) is much smaller than the conventional cut off of 0.05.

R-squared value, i.e. the coefficient of determination, equals 0.1299 and indicates that our model (model1) explains about 12% of the variability of Internet usage in the Netherlands. At the same time, there is a residual standard error (RSE) value that can tell us about the difference between the observed time spent on the Internet and the time spent on the Internet predicted by the model (model1). This difference is 17.66 percentage points.

Linear model equation

Looking at the coefficients in the summary, we can construct a linear model equation. To accomplish that, we should know the coefficients for our equation, namely, the slope value and the intercept value, which have been discovered by now. To quickly remind ourselves of these numbers, we can call the following:

coef(model1)
## (Intercept)        agea 
##   42.231334   -0.389613

So, the intercept value is 42.22, while the slope value is -0,39. Thus, the linear model is the following: \[\hat{InternetUsageInMinutes} = 42.23 - 0.39*age\]

It shows us that, when one’s age equals to 0, he/she typically spends 42.23 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.

As our age variable doesn’t contain meaningful 0 values, we can also make the same model with the centered age variable as a predictor. That means that we move the mean attendance to zero by subtracting the mean from the variable.

model1_centered <- lm(netustm ~ agea_centered, data = NL)
summary(model1_centered)
## 
## Call:
## lm(formula = netustm ~ agea_centered, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.387 -12.309  -3.835   8.522  76.483 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28.9334     0.4666   62.01   <2e-16 ***
## agea_centered  -0.3896     0.0265  -14.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.72 on 1440 degrees of freedom
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1299 
## F-statistic: 216.1 on 1 and 1440 DF,  p-value: < 2.2e-16

Now, the intercept point in the model was changed and equals 28.93. The centered linear model is the following: \[\hat{InternetUsageInMinutes} = 28.93 - 0.39*age\] It suggests that when one’s age equals to 0, he/she typically spends 28.93 minutes on the Internet. With every additional year of age, a person spends 0.39 minutes less.

Model 2: age, hamperedness in daily activities and frequency of participation in social activities vs. internet usage

Visualisation of the 2nd model

Here on the graphs the whole picture can be seen (which does not differ from the graph you have already seen) and also faceting for each variable. It is possible to visualize linear relationships between several predictors and an outcome via 3d scatterplot, but we did that by facets. It can be seen that different in cases of both being hampered in daily activities and participating in social activities different categories have different slopes of linear regression line, that is also reflected in linear model equation. All the graphs whow negative relation.

Second linear regression model and its summary

Going further, let’s construct our second regression model, using age, hamperedness in daily activities and frequency of participation in social activities variables as predictors for how much people spend on the Internet.

model2 <- lm(netustm ~ agea + hlthhmp + sclact, data = NL)
summary(model2)
## 
## Call:
## lm(formula = netustm ~ agea + hlthhmp + sclact, data = NL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.057 -11.714  -3.800   8.049  76.063 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               44.81548    2.84378  15.759   <2e-16 ***
## agea                      -0.38888    0.02696 -14.426   <2e-16 ***
## hlthhmpYes to some extent -3.49172    2.36995  -1.473    0.141    
## hlthhmpNo                 -1.20876    2.24681  -0.538    0.591    
## sclactLess than most      -1.00335    2.04908  -0.490    0.624    
## sclactAbout the same      -2.39317    1.98919  -1.203    0.229    
## sclactMore than most       1.38576    2.12743   0.651    0.515    
## sclactMuch more than most  2.39442    3.30360   0.725    0.469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.66 on 1434 degrees of freedom
## Multiple R-squared:  0.1401, Adjusted R-squared:  0.1359 
## F-statistic: 33.38 on 7 and 1434 DF,  p-value: < 2.2e-16

In the summary, we see that p-value for the F-statistics is smaller than 0.05 that makes this model significant fit to the data. The coefficient of determination (R^2) value is 0.136, showing that the second model explains about 13% of the outcome variable, i.e. Internet usage.

Linear model equation

Looking at the coefficients in the summary, we can construct linear model equation.

coef(model2)
##               (Intercept)                      agea 
##                44.8154848                -0.3888826 
## hlthhmpYes to some extent                 hlthhmpNo 
##                -3.4917222                -1.2087594 
##      sclactLess than most      sclactAbout the same 
##                -1.0033456                -2.3931728 
##      sclactMore than most sclactMuch more than most 
##                 1.3857626                 2.3944152

The linear model, a bit overloaded but still, is: \[\hat{InternetUsageInMinutes} = 44.8 - 0.38*age - 3.49*hamperedToSomeExtent + \\ - 1.2*notHampered - SociallyMeetLessThanMost - 2.39*SociallyMeetAboutTheSame + \\ + 1.39*SociallyMeetMoreThanMost + 2.39*SociallyMeetMuchMoreThanMost\]

That suggests that when one’s age, hamperedness and social activity equal to 0, he/she typically spends 44.8 minutes on the Internet. With every additional year of age, a person spends 0.38 minutes less. The more the person is hampered in their daily activities, the less they spend time on the Net: the presence of hamperedness in daily life decreases the amount of time spent in the Internet by 3.49 minutes. It’s interesting to look at the positive correlation: the more person is socially active, the more time they spend in the Net: being more active in social life, than others of the same age, increase the time spend in the Internet by 1.38 minutes and being much more active in social life increase the time spend in the Internet by 2.39 minutes.

Comparison of model fit

Now that we have two models that may predict Internet usage in Netherlands, we are going to compare them and see which model fits better to our data. In order to do so, we apply the ANOVA testing.

Firstly, we compare the first model with one predictor, age, and the second model.

  • The null hypothesis (H0): removed factors are not different from 0, full model is not better;
  • the alternative hypothesis (H1): removed factors are different from 0, full model is better.

In our case, full model is model2, as it has three predictors, while model1 has only one predictor.

anova(model1, model2)

The anova testing reveals a small p-value, as it is less than 0.05, that allows us to reject the null hypothesis. Also we can have a look on RSS, which is smaller for the second model. This way, we can conclude that model2 is significantly better at capturing the data.

Conclusion

Taking all the analyses into consideration we may say that our substantive hypotheses are supported, but the correlation excessively weak. That was seen after testing with Pearson and Spearman tests. The same was confirmed with our models, which explained about 12-13% of the variability of Internet usage in the Netherlands. Also we found out that it’s better to analyse internet usage not only with variable of reprondents’ age - ANOVA testing have shown that additon of 2 new predictors is a good idea.

Is is also important to remember, that we explored the data for 2016 year, not the most recent one! Maybe things have changed throughout 3 years. It would be interesting to see if it is and how, but (unfortunately) for today we have no such oportunity. All in all it was an interesting investigation!

Both models can be seen in the table:

  netustm netustm
Predictors Estimates CI p Estimates CI p
(Intercept) 42.23 40.24 – 44.23 <0.001 44.82 39.24 – 50.39 <0.001
agea -0.39 -0.44 – -0.34 <0.001 -0.39 -0.44 – -0.34 <0.001
Yes to some extent -3.49 -8.14 – 1.15 0.141
No -1.21 -5.61 – 3.19 0.591
Less than most -1.00 -5.02 – 3.01 0.624
About the same -2.39 -6.29 – 1.51 0.229
More than most 1.39 -2.78 – 5.56 0.515
Much more than most 2.39 -4.08 – 8.87 0.469
Observations 1442 1442
R2 / R2 adjusted 0.130 / 0.130 0.140 / 0.136

Thank you for attention and have a nice weekends!!!