Anna Gorobtsova - Build and interpret the first model, interpret coefficients, write out regression equations
Elizaveta Dyachenko - Build and interpret the second model, interpret coefficients, write out regression equations, edit and format final document
Marina Romanova - Formulate research question, describe and visualise the variables, explain whether the hypothesis has been falsified or not
Alex Stephenson - Compare the models, test accuracy of the models through prediction, create an output table, edit and format final document
library(foreign)
library(ggplot2)
library(sjPlot)
library(dplyr)
library(sjlabelled)
library(lubridate)
library(GGally)
library(moderndive)
library(scales)
library(stargazer)
library(broom)
ESS <- read.spss("ESS7SI.sav", na.rm = T, to.data.frame = T, use.value.labels = T)
data <- select(ESS, alcwkdy, alcwknd, gndr)
data <- na.omit(data)Our research question aims to study the relationship between drinking during the week and gender and drinking at the weekend. We expect that the results will show that men are more likely than women to drink at all. Also, we think that the more a person drinks on a weekday the more he/she drinks on a weeekend day. Thus, our hypothesis is:
The gender of a person and the amount of alcohol drunk by him/her during the week can predict the amount of alcohol drunk at the weekend by this person.
For our report we chose the following variables:
gndr - Gender of the respondent
alcwkdy - Grams of alcohol, last time drinking on a weekday, Monday to Thursday
alcwknd - Grams of alcohol, last time drinking on a weekend day, Friday to Sunday
Firstly, we check if our variables are of the correct type.
str(data)## 'data.frame': 938 obs. of 3 variables:
## $ alcwkdy: Factor w/ 80 levels "0","4","7","9",..: 15 1 2 1 29 9 32 19 11 19 ...
## $ alcwknd: Factor w/ 87 levels "0","4","7","9",..: 12 5 2 11 38 9 5 32 18 25 ...
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 2 1 1 1 1 ...
## - attr(*, "variable.labels")= Named chr "Title of dataset" "ESS round" "Edition" "Production date" ...
## ..- attr(*, "names")= chr "name" "essround" "edition" "proddate" ...
## - attr(*, "codepage")= int 65001
## - attr(*, "na.action")= 'omit' Named int 1 3 10 11 12 13 15 16 17 18 ...
## ..- attr(*, "names")= chr "1" "3" "10" "11" ...
All the variables except for ‘gender’ should be numeric. Thus, we changed the type of these variables.
for (i in 1:2)
{
data[,i] <- as.numeric(as.character(data[,i]))
}
summary(data)## alcwkdy alcwknd gndr
## Min. : 0.00 Min. : 0.00 Male :466
## 1st Qu.: 10.00 1st Qu.: 10.00 Female:472
## Median : 13.00 Median : 19.00
## Mean : 21.63 Mean : 27.81
## 3rd Qu.: 26.00 3rd Qu.: 38.00
## Max. :354.00 Max. :286.00
Now, it looks fine!
We decided to present our description of chosen variables in form of the table.
labs <- c("Grams of alcohol drunk on last weekday", "Grams of alcohol drunk on last weekend day", "Gender")
data <- set_label(data, label = labs)
view_df(data, show.prc = F, verbose = F) | ID | Name | Label | Values | Value Labels |
|---|---|---|---|---|
| 1 | alcwkdy | Grams of alcohol drunk on last weekday | range: 0-354 | |
| 2 | alcwknd | Grams of alcohol drunk on last weekend day | range: 0-286 | |
| 3 | gndr | Gender |
Male Female |
|
Then we made graphs for all of our variables and needed pairs of variables.
As we can see, amount of males and females in our data is almost the same, so we do not need to apply set.seed() function in order to make it equal.
qqnorm(data$alcwknd); qqline(data$alcwknd, col= "red", lty = 5, lwd = 2)qqnorm(data$alcwkdy); qqline(data$alcwkdy, col = "red", lty = 5, lwd = 2)Distributions are not normal.
t.test(data$alcwkdy, data$alcwknd, paired = F, var.equal = F)##
## Welch Two Sample t-test
##
## data: data$alcwkdy and data$alcwknd
## t = -4.5873, df = 1862.5, p-value = 4.789e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.816316 -3.535497
## sample estimates:
## mean of x mean of y
## 21.63326 27.80917
Difference in the amount of alcohol last consumed on a weekday and weekend does has statistical significance, with a p value >0.001.
ggplot(data, aes(y = alcwkdy, x = gndr)) +
geom_boxplot(fill = c("skyblue", "salmon")) +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
labs(title = "Weekday Drinking Between Males and Females", x = "Gender", y = "Amount of alcohol last drunk on a weekday")ggplot(data, aes(y = alcwknd, x = gndr)) +
geom_boxplot(fill = c("skyblue", "salmon")) +
stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
theme_classic() +
labs(title = "Weekend Drinking Between Males and Females", x = "Gender", y = "Amount of alcohol last drunk on a weekend")Finally, lets look at the relationship between amount of alcohol drunk on weekdays and weekend days.
corrl <- data.frame(cor(data$alcwkdy, data$alcwknd))
ggplot(data, aes(x = alcwkdy, y = alcwknd)) +
geom_point() +
geom_smooth(method = "lm") +
labs(caption = "Correlation = 0.31", y = "Amount of alcohol drunk on a weekend", x = "Amount of alcohol drunk on a weekday")model1 <- lm(alcwknd ~ alcwkdy, data=data)
model2 <- lm(alcwknd ~ alcwkdy + gndr, data=data)
stargazer(model1, model2, type="text",
dep.var.labels=c("Amount of alcohol drunk on a weekend"),
covariate.labels=c("Amount of alcohol on a weekday", "Gender (Female)"),
out="models.txt") ##
## ==============================================================================
## Dependent variable:
## -----------------------------------------------
## Amount of alcohol drunk on a weekend
## (1) (2)
## ------------------------------------------------------------------------------
## Amount of alcohol on a weekday 0.333*** 0.288***
## (0.034) (0.034)
##
## Gender (Female) -12.532***
## (1.879)
##
## Constant 20.598*** 27.887***
## (1.190) (1.596)
##
## ------------------------------------------------------------------------------
## Observations 938 938
## R2 0.095 0.136
## Adjusted R2 0.094 0.134
## Residual Std. Error 28.823 (df = 936) 28.176 (df = 935)
## F Statistic 98.164*** (df = 1; 936) 73.601*** (df = 2; 935)
## ==============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The F-statistic is significant(p-value: < 2.2e-16), therefore we can say that our model is better than no model at all. According to adjusted R-squared, which shows the explanatory power of a model and estimates 0.09, we can claim that the given model explains aproximately 9% of the variance. Here the intercept is the predicted value of grams of alcohol consumed at weekends, when amount of grams of alcohol consumed at weekdays is equal to 0, here it estimates 20.6 grams. Regression equation for the first model:
\[Predicted Grams Of Alcohol On Weekend Day = 20,6 + 0,33 * Grams Of Alcohol On Weekday\]
The interprtation:
With each increase in grams of alcohol drank on weekday by one, grams of alcohol drank on weekend day increases by 0.33.
plot(model1) The first graph named Residuals vs Fitted shows that residuals are distributed almost linearly because the points are almost equally spread around 0.
F-statistic is statistically significant as the p-value < 2.2e-16 and to be significant it needs to be less than 0.05. Thus, we can conclude that our model is better than the one that predicts the amount of alcohol that people drink on weekends with mean value. From the adjusted R-squared we can conclude that our second model explains 13% of variation in grams of alcohol people drink on weekends. The intercept is the predicted value of grams of alcohol drunk on weekends when the amount of grams of alcohol drank on weekdays is equal to 0 and a person is a male, and it is equal to 27,89 grams. So, our regression equation looks like this:
\[Predicted Grams Of Alcohol On Weekend Day = 27,89 + 0,29 * Grams Of Alcohol On Weekday - 12,53 * Female\]
The interpretation is:
With each increase in grams of alcohol drank on weekday by one, grams of alcohol drank on weekend day increases by 0.29 and if a respondent is a female, the amount of grams of alcohol drank on weekend day decreases by 12.53.
plot(model2) Having made our models and examined the details of our various models we should try and deduce which model is the best fit. To do this we will use both the ANOVA function, to compare the statistical difference in the means, and then look at quantifying how well the model fits the data. Firstly, lets run ANOVA.
anova(model1, model2)We have a p value of 4.298e-11 ( <0.001). This suggests that there is a significant difference between the mean values of our ANOVAs. In other words - our second model is better than our first. But by how much?
One way of understanding what this actually means - and by how much our models differ - is to run a comparison on the sum of square residuals and calculate the square root of the mean of residuals^2. The first gives us an overall figure to compare how much our model is out by and the second gives the average difference between our model and the data point.
rgrs <- list(model1, model2)
ssq_function <- function(lm) {
get_regression_points(lm) %>%
mutate(sq_residuals = residual^2) %>%
summarize(sum_sq_residuals = sum(sq_residuals), rmse = sqrt(mean(sq_residuals)))
}
for(i in 1:2) {
print(ssq_function(rgrs[[i]]))
}## # A tibble: 1 x 2
## sum_sq_residuals rmse
## <dbl> <dbl>
## 1 777586. 28.8
## # A tibble: 1 x 2
## sum_sq_residuals rmse
## <dbl> <dbl>
## 1 742279. 28.1
As we see, the sum of square residuals is 35306.9 lower for the second model than the first.
In accordance with Bernardi et al’s best practice for reporting statistical significance we’re going to demonstrate the difference in our models.
An (imaginary) man comes to us and tells us he’s worried about how much his father drinks. We enquire how much his father usually drinks on a weekday. The man tells us he drinks 25g. We use our model to predict how much he is likely to be drinking on a weekend day (the man, owing to working on the weekend, is unable to closely monitor his father, but the empty vodka bottles he finds the following morning are worrying…) and thus whether this man has any cause to be concerned.
model1_pred <- data.frame("gndr" = 'Male', "alcwkdy" = 25)
augment(model1, newdata = model1_pred, type.predict = "response")196g of alcohol is the recommended weekly allowance according to US health guidelines. So far we know he drinks 100g of alcohol, on average, during the week. Based on our first (more inaccurate) model we would conclude that he drinks a further 87g over the weekend (3 * 29). This totals at 187g a week - under the limit.
model2_pred <- data.frame("gndr" = 'Male', "alcwkdy" = 25)
augment(model2, newdata = model2_pred, type.predict = "response")However, when we run the same data through our more accurate model we reach a different conclusion. Instead, we predict he drinks 105g (3 * 35) of alcohol over the weekend. This means his father drinks 205g, over the recommended limit.
In this instance, our more accurate model has allowed us to determine whether someone’s lifestyle has become harmful or not.
Finally, lets compare how accurate our prediction for this man has been compared to what our real data tells us.
data %>%
filter(gndr == 'Male') %>%
filter(alcwkdy >= 24) %>%
filter(alcwkdy <= 26) %>%
summarize("Number of observations" = n(), "Grams of alcohol drunk on the weekend" = mean(alcwknd))Comparing the real data to our prediction shows that the amount of alcohol drunk on the weekend was actually 35.6, compared to our models prediction of 35.1. Pretty accurate! Whilst our small sample size is only 13 it nonetheless shows the accuracy of using linear models for predictions.
Lets summarize our findings! Our data roughly looks as follows:
There is a correlation of 0.31 between weekend and weekday alcohol consumption.
The difference between weekend and weekday alcohol consumption was statistically significant
Using ANOVA, sum of squares and square root of mean residuals we concluded the incorportation of gender into our model improved the accuracy of it, as demonstrated through our examples.
Our r / f statistics were
Our substantive hypothesis was falsified / supported