INTRODUCTION

Roles

Anna Gorobtsova - Build and interpret the first model, interpret coefficients, write out regression equations

Elizaveta Dyachenko - Build and interpret the second model, interpret coefficients, write out regression equations, edit and format final document

Marina Romanova - Formulate research question, describe and visualise the variables, explain whether the hypothesis has been falsified or not

Alex Stephenson - Compare the models, test accuracy of the models through prediction, create an output table, edit and format final document

Downloading Libraries & Data

library(foreign)
library(ggplot2)
library(sjPlot)
library(dplyr)  
library(sjlabelled)  
library(lubridate)
library(GGally)
library(moderndive)
library(scales)
library(stargazer)
library(broom)
library(psych)

ESS <- read.spss("ESS7SI.sav", na.rm = T, to.data.frame = T, use.value.labels = T)
data <- select(ESS, alcwkdy, alcwknd, gndr)
data <- na.omit(data)

Research Question & Hypothesis

The health problems associated with drinking are both well documented and vary from country to country. Given this, when looking at improving public health having culturally and socially relative targets are vital.

We know that Slovenians drink, on average, 11.6L of alcohol a year, putting them 24th ranked globally - just above the United Kingdom and slightly below Germany. We also know that alcohol consumption is roughly evenly split along beer and wine (44.5% and 46.9% respectively) with only 8.6% being made up by spirits.

What isn’t as well documented is how this alcohol is consumed - over the course of a Friday or Saturday night or evenly thorough out the week. Consequences for the former can be severe (in short, in moderation the body can filter out the harmful products of a few drinks - however once you start to over-consume your liver can’t cope and these harmful products spread around your body (Cullen et al, 2011)).

To get an idea of drinking within Slovenia we decided to take a look at their weekday drinking and compare it to their weekend drinking and then look at how we can use this data to make predictions and solve health problems. If there was a large difference between weekend drinking and weekday drinking this would indicate a culture of binge drinking - and thus influence public health campaigns. On the other hand, if there is only a moderate to no increase this suggests Slovenian drinking culture to be evenly spread, indicating different public health priorities.

Our research question aims to study the relationship between drinking during the week and gender, and drinking at the weekend. We expect that the results will show that men are more likely than women to drink at all as drinking bahaviour is much more widespread among men (Wilsnack et al., 2009). Also, the fact that people drink much more on weekends that on week days (Lau-Barraco et al., 2016) can lead us to the thought that the more people drink on a weekday the more they drink on a weeekend day. Thus, our hypothesis is:

The gender of a person and the amount of alcohol drunk by him/her during the week can predict the amount of alcohol drunk at the weekend by this person.

Variables

For our report we chose the following variables:

gndr - Gender of the respondent
alcwkdy - Grams of alcohol, last time drinking on a weekday, Monday to Thursday
alcwknd - Grams of alcohol, last time drinking on a weekend day, Friday to Sunday

Firstly, we check if our variables are of the correct type.

str(data)
## 'data.frame':    938 obs. of  3 variables:
##  $ alcwkdy: Factor w/ 80 levels "0","4","7","9",..: 15 1 2 1 29 9 32 19 11 19 ...
##  $ alcwknd: Factor w/ 87 levels "0","4","7","9",..: 12 5 2 11 38 9 5 32 18 25 ...
##  $ gndr   : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 2 1 1 1 1 ...
##  - attr(*, "variable.labels")= Named chr  "Title of dataset" "ESS round" "Edition" "Production date" ...
##   ..- attr(*, "names")= chr  "name" "essround" "edition" "proddate" ...
##  - attr(*, "codepage")= int 65001
##  - attr(*, "na.action")= 'omit' Named int  1 3 10 11 12 13 15 16 17 18 ...
##   ..- attr(*, "names")= chr  "1" "3" "10" "11" ...

All the variables except for ‘gender’ should be numeric. Thus, we changed the type of these variables.

for (i in 1:2)
{
  data[,i] <- as.numeric(as.character(data[,i]))
}  

summary(data)
##     alcwkdy          alcwknd           gndr    
##  Min.   :  0.00   Min.   :  0.00   Male  :466  
##  1st Qu.: 10.00   1st Qu.: 10.00   Female:472  
##  Median : 13.00   Median : 19.00               
##  Mean   : 21.63   Mean   : 27.81               
##  3rd Qu.: 26.00   3rd Qu.: 38.00               
##  Max.   :354.00   Max.   :286.00

Now, it looks fine!

DESCRIPTIVE STATISTICS

We decided to present our description of chosen variables in form of the table.

Data frame: data
ID Name Label Values Value Labels
1 alcwkdy Grams of alcohol drunk on last weekday range: 0-354
2 alcwknd Grams of alcohol drunk on last weekend day range: 0-286
3 gndr Gender Male
Female

Then we made graphs for all of our variables and needed pairs of variables.

Gender

As we can see, amount of males and females in our data is almost the same, so we do not need to apply set.seed() function in order to make it equal.

Alcohol Consumption

Let’s firstly build graphs of alcohol consumption on weekday and weekends.

Both distributions are not destributed normaly and both right-tailed. Let’s also build qqplot separated for both variables.

qqnorm(data$alcwknd); qqline(data$alcwknd, col= "red", lty = 5, lwd = 2)

qqnorm(data$alcwkdy); qqline(data$alcwkdy, col = "red", lty = 5, lwd = 2)

Distributions are not normal as the line is not straight.

t.test(data$alcwkdy, data$alcwknd, paired = F, var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  data$alcwkdy and data$alcwknd
## t = -4.5873, df = 1862.5, p-value = 4.789e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.816316 -3.535497
## sample estimates:
## mean of x mean of y 
##  21.63326  27.80917

Difference in the amount of alcohol last consumed on a weekday and weekend does has statistical significance, with a p value >0.001.

RELATIONSHIP PLOTS

Males VS Females

As it is seen from the boxplots males drink more than females of both weekends and weekdays.

Weekdays VS Weekend Days

Finally, lets look at the relationship between amount of alcohol drunk on weekdays and weekend days.

There is a positive weak correlation between amount of alcohol drank on weekends and weekdays and it is 0.31.

MODELS

model1 <- lm(alcwknd ~ alcwkdy, data=data)
model2 <- lm(alcwknd ~ alcwkdy + gndr, data=data)  
## 
## ==============================================================================
##                                              Dependent variable:              
##                                -----------------------------------------------
##                                     Amount of alcohol drunk on a weekend      
##                                          (1)                     (2)          
## ------------------------------------------------------------------------------
## Amount of alcohol on a weekday        0.333***                0.288***        
##                                        (0.034)                 (0.034)        
##                                                                               
## Gender (Female)                                              -12.532***       
##                                                                (1.879)        
##                                                                               
## Constant                              20.598***               27.887***       
##                                        (1.190)                 (1.596)        
##                                                                               
## ------------------------------------------------------------------------------
## Observations                             938                     938          
## R2                                      0.095                   0.136         
## Adjusted R2                             0.094                   0.134         
## Residual Std. Error               28.823 (df = 936)       28.176 (df = 935)   
## F Statistic                    98.164*** (df = 1; 936) 73.601*** (df = 2; 935)
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01

Model 1

The F-statistic is significant(p-value: < 2.2e-16), therefore we can say that our model is better than no model at all. According to adjusted R-squared, which shows the explanatory power of a model and estimates 0.09, we can claim that the given model explains aproximately 9% of the variance. Here the intercept is the predicted value of grams of alcohol consumed at weekends, when amount of grams of alcohol consumed at weekdays is equal to 0, here it estimates 20.6 grams. Regression equation for the first model:

\[Predicted Grams Of Alcohol On Weekend Day = 20,6 + 0,33 * Grams Of Alcohol On Weekday\]

The interprtation:
With each increase in grams of alcohol drank on weekday by one, grams of alcohol drank on weekend day increases by 0.33.

plot(model1, which=1)  

The graph named ‘Residuals vs Fitted’ shows that residuals are distributed not quite linearly because the red line is not straight enough and that the residuals are also not homoscedastic as the points are not equally distributed around the 0 line (observations 836, 176, 524 are all above 0).

Model 2

F-statistic is statistically significant as the p-value < 2.2e-16 and to be significant it needs to be less than 0.05. Thus, we can conclude that our model is better than the one that predicts the amount of alcohol that people drink on weekends with mean value. From the adjusted R-squared we can conclude that our second model explains 13% of variation in grams of alcohol people drink on weekends. The intercept is the predicted value of grams of alcohol drunk on weekends when the amount of grams of alcohol drank on weekdays is equal to 0 and a person is a male, and it is equal to 27,89 grams. So, our regression equation looks like this:

\[Predicted Grams Of Alcohol On Weekend Day = 27,89 + 0,29 * Grams Of Alcohol On Weekday - 12,53 * Female\]

The interpretation is:

With each increase in grams of alcohol drank on weekday by one, grams of alcohol drank on weekend day increases by 0.29 and if a respondent is a female, the amount of grams of alcohol drank on weekend day decreases by 12.53.

plot(model2, which = 1)  

The graph illustrates that residuals are distributed almost linearly because the red line is very close to the straight one and that the residuals are not homoscedastic as the points are not equally distributed around the 0 line (observations 836, 176, 524 are all above 0).

Model of Best Fit

Having made our models and examined the details of our various models we should try and deduce which model is the best fit. To do this we will use both the ANOVA function, to compare the statistical difference in the means, and then look at quantifying how well the model fits the data. Firstly, lets run ANOVA.

anova(model1, model2)

We have a p value of 4.298e-11 ( <0.001). This suggests that there is a significant difference between the mean values of our ANOVAs. In other words - our second model is better than our first. But by how much?

One way of understanding what this actually means - and by how much our models differ - is to run a comparison on the sum of square residuals and calculate the square root of the mean of residuals^2. The first gives us an overall figure to compare how much our model is out by and the second gives the average difference between our model and the data point.

rgrs <- list(model1, model2)

ssq_function <- function(lm) {
  get_regression_points(lm) %>%
    mutate(sq_residuals = residual^2) %>%
    summarize(sum_sq_residuals = sum(sq_residuals), rmse = sqrt(mean(sq_residuals)))
}

for(i in 1:2) {
  print(ssq_function(rgrs[[i]]))
}
## # A tibble: 1 x 2
##   sum_sq_residuals  rmse
##              <dbl> <dbl>
## 1          777586.  28.8
## # A tibble: 1 x 2
##   sum_sq_residuals  rmse
##              <dbl> <dbl>
## 1          742279.  28.1

As we see, the sum of square residuals is 35306.9 lower for the second model than the first.

Example

In accordance with Bernardi et al’s best practice for reporting statistical significance we’re going to demonstrate the difference in our models.

An (imaginary) man comes to us and tells us he’s worried about how much his father drinks. We enquire how much his father usually drinks on a weekday. The man tells us he drinks 25g. We use our model to predict how much he is likely to be drinking on a weekend day (the man, owing to working on the weekend, is unable to closely monitor his father, but the empty vodka bottles he finds the following morning are worrying…) and thus whether this man has any cause to be concerned.

model1_pred <- data.frame("gndr" = 'Male', "alcwkdy" = 25)
augment(model1, newdata = model1_pred, type.predict = "response")

196g of alcohol is the recommended weekly allowance according to US health guidelines. So far we know he drinks 100g of alcohol, on average, during the week. Based on our first (more inaccurate) model we would conclude that he drinks a further 87g over the weekend (3 * 29). This totals at 187g a week - under the limit.

model2_pred <- data.frame("gndr" = 'Male', "alcwkdy" = 25)
augment(model2, newdata = model2_pred, type.predict = "response")

However, when we run the same data through our more accurate model we reach a different conclusion. Instead, we predict he drinks 105g (3 * 35) of alcohol over the weekend. This means his father drinks 205g, over the recommended limit.

In this instance, our more accurate model has allowed us to determine whether someone’s lifestyle has become harmful or not.

Finally, lets compare how accurate our prediction for this man has been compared to what our real data tells us.

man <- data %>%
  filter(gndr == 'Male') %>%
  filter(alcwkdy >= 24) %>%
  filter(alcwkdy <= 26) %>%
  summarize("Number of observations" = n(), "Grams of alcohol drunk on the weekend" = mean(alcwknd))
man

Comparing the real data to our prediction shows that the amount of alcohol drunk on the weekend was actually 35.6, compared to our models prediction of 35.1. Pretty accurate! Whilst our small sample size is only 13 it nonetheless shows the accuracy of using linear models for predictions.

CONCLUSION

We found alcohol consumption to increase by 28% at the weekend - an increase but not a drastic one. Furthermore, we found there to be a correlation between weekday and weekend consumption, albeit a weak one of 0.31.

Using our model of weekday consumption and gender we could explain 13% of the variance in our predictor variable - weekend consumption. When controlling for gender we expect to see a increase of 0.29g of weekend alcohol consumption for every gram of weekday consumption. The difference between weekend and weekday alcohol consumption was statistically significant. Finally, using ANOVA, sum of squares and square root of mean residuals we concluded the incorportation of gender into our model improved the accuracy of it, as demonstrated through our examples.

What does this mean for our understanding of Slovenian alcohol consumption? Understanding weekday consumption plays a part in weekend consumption yet other explanatory variables clearly play a role in the increase. As a whole, however, we do not believe this to be a pressing area for further study. The relatively modest increase suggests that binge drinking isn’t a huge problem and whilst other factors clearly play a role in the increase we do not believe it is substantial enough to warrant further investigation, and will instead look at other public health issues in the future.

Maybe add in some more info about our regression models but this is the minimum we need, I think. Tbh its not really obvious why we need to make a linear model for this but hopefully by including an example earlier on where we predict something we’ve demonstrated how our data can be applied