library(haven)
library(dplyr)
library(labelled)
library(lattice)
library(ggplot2)
#tinytex::install_tinytex()
This case study is an integrated set of exercises in R that build on the knowledge you gained during the first eight days of the course. The goal of the exercises is to determine whether physical activity is associated with alcohol use. To get to the answer to this question, you will first have to generate new variables from the data available and check the data before analysis.
Read in the entire dataset and select the data for your country
Dataset_casestudy <- read_sav("Dataset_casestudy1.sav") # Read .sav file
Eurobar_ES <- subset(Dataset_casestudy, v7 == "ES") # Select Spain
We are interested in the variables on alcohol use (alc: [v150-v154]) and on physical activity (pa: [v119-v122]). Furthermore, you will need information on age (v331), gender (v330), and country (v6). Names of the variables in relation to original Eurobarometer coding are given in table 1 of the exercises of Day 5.
Make a smaller dataset for you country with only the variables mentioned above
ES <- subset(Eurobar_ES, select = c(v6, v330, v150:v154, v119:v122, v331)) # Select variables
Rename the variables to the names mentioned in table 1 on Day 5
ES <- ES %>% # Rename using dplyr
rename(
country = v6,
alc12m = v150,
alcfreq_5dr = v151,
alc_30da = v152,
alcfreq_30d = v153,
alc_am_dd = v154,
pa7d_work = v119,
pa7d_mov = v120,
pa7d_house = v121,
pa7d_recr = v122,
gender = v330,
age = v331
)
Describe the data, using R to help you find information on number of observations, number of variables, number of missing values for each variable.
str(ES) # See variables and sample size
missing_counts <- sapply(ES, function(x) sum(is.na(x))) # Function to compute missing counts
print(missing_counts)
## country gender alc12m alcfreq_5dr alc_30da alcfreq_30d
## 0 0 0 337 337 409
## alc_am_dd pa7d_work pa7d_mov pa7d_house pa7d_recr age
## 410 21 0 2 1 0
There are 1003 observations in the sample with 12 variables in total. See missing counts table above for number of missing values for each variable.
Determine which of variables are actually categorical. Hint: determine the number of unique values per variable.
sapply(ES, function(x) length(unique(x))) # Function to see amount of unique values per variable.
## country gender alc12m alcfreq_5dr alc_30da alcfreq_30d
## 1 2 2 6 3 7
## alc_am_dd pa7d_work pa7d_mov pa7d_house pa7d_recr age
## 8 5 4 5 5 79
str(ES) # Check type of variable.
Every variable except for age is categorical.
Create a function that can convert the numerical variables that are categorical ones into factors and drop the levels that are non-informative (determine from the labels). Use the solution you found on day 5, where you already converted numerical variables into categorical ones. Function input arguments should be the variable and the levels to be dropped / set to missing. Inside the body of the function, use the variable to determine its levels and labels. Output should be the factor variable with the right labels and without non-informative levels.
var_names <- colnames(ES[,-12]) # Create vector of all variable names except for age.
# Loop through each variable name
for (x in var_names) {
lv <- attr(ES[[x]], "labels") # Get labels of ES
# Remove non-informative levels (9 and 99)
lv <- lv[!lv %in% c(9, 99)]
ES[[x]] <- factor(ES[[x]],
levels = lv,
labels = names(lv))
}
# Loop through each variable name
for (x in var_names) {
lv2 <- attr(ES[[x]], "levels") # Get levels of ES
# Remove non-informative levels (DK/Refusal/It depends/Don't remember etc.)
lv2 <- lv2[!lv2 %in% c("DK/Refusal", "DK/ Refusal", "DK", "Don't remember/Refusal (SPONT.)", "It depends (SPONT.)" )]
ES[[x]] <- factor(ES[[x]],
levels = lv2)
}
Use the function created in 1f to convert the categorical variables determined in exercise 1e to factors and check if the function worked correctly. NB. Use different names for the new variables to avoid that the original one is lost! In the remainder we will however refer to the original names.
str(ES) # Check if transformation succesful
## tibble [1,003 × 12] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 33 levels "France","Belgium",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ gender : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ alc12m : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 1 1 2 1 1 ...
## $ alcfreq_5dr: Factor w/ 5 levels "Several times a week",..: 5 NA NA 5 5 5 5 NA 2 4 ...
## $ alc_30da : Factor w/ 2 levels "Yes","No": 2 NA NA 1 1 1 1 NA 1 1 ...
## $ alcfreq_30d: Factor w/ 6 levels "Daily","4 - 5 times a week",..: NA NA NA 6 1 5 1 NA 1 4 ...
## $ alc_am_dd : Factor w/ 6 levels "Less than 1 drink",..: NA NA NA 2 2 1 1 NA 5 NA ...
## $ pa7d_work : Factor w/ 4 levels "A lot","Some",..: 3 4 4 4 1 4 4 4 1 2 ...
## $ pa7d_mov : Factor w/ 4 levels "A lot","Some",..: 2 3 4 3 3 3 3 4 3 1 ...
## $ pa7d_house : Factor w/ 4 levels "A lot","Some",..: 2 3 4 3 2 3 3 2 2 1 ...
## $ pa7d_recr : Factor w/ 4 levels "A lot","Some",..: 4 4 4 1 1 4 4 4 4 1 ...
## $ age : num [1:1003] 45 86 83 76 45 86 84 33 52 42 ...
## ..- attr(*, "label")= chr "D11 AGE EXACT"
## ..- attr(*, "format.spss")= chr "F2.0"
## ..- attr(*, "display_width")= int 6
Function worked correctly transforming all variables except age to factor and removing inappropriate labels.
Calculate descriptive statistics of all variables. Think well on what kind of statistic that would be depending on the type of variable.
hist(ES$age) # Check distribution of ages
summary(ES$age) # Calculate descriptive stats
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 33.00 47.00 48.55 64.00 96.00
table1 <- colnames(ES[,-c(1, 12)]) # Create vector of all variable names except for age and country.
# Loop through each variable name
for (x in table1) {
cat("\nVariable:", x, "\n") # Get variable name
table <- table(ES[[x]]) # Get the frequency table
table2 <- prop.table(table) # Make prop table with frequency table
print(table2)
}
##
## Variable: gender
##
## Male Female
## 0.441675 0.558325
##
## Variable: alc12m
##
## Yes No
## 0.665005 0.334995
##
## Variable: alcfreq_5dr
##
## Several times a week Once a week Once a month
## 0.1501502 0.1816817 0.1201201
## Less than once a month Never
## 0.2042042 0.3438438
##
## Variable: alc_30da
##
## Yes No
## 0.8918919 0.1081081
##
## Variable: alcfreq_30d
##
## Daily 4 - 5 times a week 2 - 3 times a week Once a week
## 0.23737374 0.09932660 0.21548822 0.26094276
## 2 - 3 times a month Once
## 0.11616162 0.07070707
##
## Variable: alc_am_dd
##
## Less than 1 drink 1 - 2 drinks 3 - 4 drinks 5 - 6 drinks
## 0.096774194 0.641765705 0.203735144 0.040747029
## 7 - 9 drinks 10 drinks or more
## 0.015280136 0.001697793
##
## Variable: pa7d_work
##
## A lot Some Little None
## 0.1344196 0.1771894 0.1079430 0.5804481
##
## Variable: pa7d_mov
##
## A lot Some Little None
## 0.1744766 0.4376869 0.2771685 0.1106680
##
## Variable: pa7d_house
##
## A lot Some Little None
## 0.1708292 0.3826174 0.2747253 0.1718282
##
## Variable: pa7d_recr
##
## A lot Some Little None
## 0.1586826 0.2804391 0.1816367 0.3792415
For age, which is a continuous variable, we choose a histogram to check the distribution and we calculate the median and the IQR.
For the categorical variables we calculate proportional tables for each one of them.
Make a bar chart of pa7d_work to visualize this variable. Use both
hist() and barplot() as a command to accomplish this and describe which
plot is preferred.
N.B. using barplot(pa7d_work) won’t
properly display the variable. Check what input the function
needs.
ggplot(ES, aes(x = pa7d_work)) + # Bar plot
geom_bar()
Bar chart works for categorical variables. Histogram is only for continuous, which pa7d_work is not and therefore we were not able to create a histogram. Thus, we prefer the bar chart.
Since we are interested in problematic alcohol use, the data on alcohol use will have to be processed to produce a new binary variable “problematic alcohol use”. Alcohol use may be defined as problematic when it is chronically too much (more than 40 g on average per day for male or 20 g per day for female persons) or when too much is consumed on a single occasion, i.e. more than 5 glasses (binge drinking). Questions on both issues are in the dataset. Define someone as a problematic drinker if he/she either chronically drinks too much and/or does binge drinking. You will now produce a new variable “problematic drinker” in a couple of steps. Do this for your own small dataset with only the records from your chosen country.
Define a new numeric variable “number of drinking days/month” (e.g. called n_drinksmonth) and use the information from alc_30da and freq_30d to fill this variable with numeric values on the number of days a persons was drinking alcohol per month. You may need to make some pragmatic assumptions here. E.g. concerning the number of days in category “Daily” should be converted to 30.
State your assumptions in a comment. There are many approaches exist
to do this conversion, e.g. like in exercise 24 of Day 1, using the
ifelse() function, or using the revalue() or
mapvalues() functions from plyr package.
NB. There are multiple solutions to do this conversion. If you
want to use conditional statements, remember that if() does
not work on a vector, while ifelse() does. Alternatively
use mapply() for loop or another solution.
levels(ES$alc_30da) # Drinking yes or no
## [1] "Yes" "No"
levels(ES$alcfreq_30d) # Frequency of drinking per 30 days
## [1] "Daily" "4 - 5 times a week" "2 - 3 times a week"
## [4] "Once a week" "2 - 3 times a month" "Once"
ES <- ES %>%
mutate(n_drinksmonth = # Use dplyr to create new variable n_drinksmonth.
ifelse(alc_30da == "No", 0, # Differentiate between drinkers and non-drinkers
ifelse(alcfreq_30d == "Daily", 30, # Specify numerical values for how much someone drinks
ifelse(alcfreq_30d == "4 - 5 times a week", 20,
ifelse(alcfreq_30d == "2 - 3 times a week", 12,
ifelse(alcfreq_30d == "Once a week", 4,
ifelse(alcfreq_30d == "2 - 3 times a month", 3,
ifelse(alcfreq_30d == "Once", 1, 0))))))))
Define a new variable “quantity of alcohol (in grams) consumed on a drinking day” (e.g. called quantday) and use the information from alc_30da and alc_am_dd to do so. Use the following assumption: 1 drink/consumption contains 12 g alcohol.
levels(ES$alc_am_dd) # Check levels that we need to quantify
## [1] "Less than 1 drink" "1 - 2 drinks" "3 - 4 drinks"
## [4] "5 - 6 drinks" "7 - 9 drinks" "10 drinks or more"
ES <- ES %>%
mutate(quantday = # Use dplyr to create new variable quantday.
ifelse(alc_30da == "No", 0, # Differentiate between drinkers and non-drinkers
ifelse(alc_am_dd == "Less than 1 drink", 6, # Specify numerical values for how much someone drinks in grams
ifelse(alc_am_dd == "1 - 2 drinks", 18,
ifelse(alc_am_dd == "3 - 4 drinks", 36,
ifelse(alc_am_dd == "5 - 6 drinks", 54,
ifelse(alc_am_dd == "7 - 9 drinks", 72,
ifelse(alc_am_dd == "10 drinks or more", 120, 0))))))))
Now calculate a third new variable “average quantity of alcohol (in grams) per day over the last month”, multiplying the two variables above and dividing the result by 30.
ES <- ES %>%
mutate(avgquant = (quantday * n_drinksmonth) / 30) # Calculate new variable
Make a histogram or barplot of daily intake created in exercise 2c.
hist(ES$avgquant) # Make histogram
Yes.
Yes because it can take on any value within a range.
Do the same as in exercise 2d, but now for men and women separately.
Hint: think of the trellis plots from the lattice
package.
histogram(~ avgquant | gender , data = ES) # Make charts for males and females.
Compute descriptive statistics for the daily (average) alcohol use in grams.
descriptives_quantday <- summary(ES$quantday) # Make descriptives
print(descriptives_quantday)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 18.00 18.00 20.47 18.00 120.00 342
Create a variable “heavy drinker” for men and women, using the cut-off values mentioned above (40 and 20 g daily, respectively).
levels(ES$gender) # Check levels
## [1] "Male" "Female"
ES <- ES %>% # use dplyr to create new variable heavydrinker
mutate(heavydrinker = ifelse(gender == "Male" & avgquant > 40, "Heavy drinker", # Specificy when someone is heavy drinker, differentiating between genders.
ifelse(gender == "Female" & avgquant > 20, "Heavy drinker", "Not")))
Now consider binge drinking in addition to drinking too much
Define a new variable “binge” that gets value 1 if someone is a binge drinker (i.e. >=5 glasses on a single occasion) and value 0 else. This has the same definition for men and women.
ES <- ES %>% # Create new variable with dplyr
mutate(binge =
ifelse(alc_am_dd %in% c("5 - 6 drinks", "7 - 9 drinks", "10 drinks or more"), 1, 0)) # Specify values when someone is binge drinker
Use these two variables (heavy drinker and binge) to fill a new
binary variable “problem drinker” using logistic operators
(|, &) in the appropriate way.
ES <- ES %>% # Create new variable with dplyr
mutate(problemdrinker =
ifelse(heavydrinker == "Heavy drinker" | binge == 1, 1, 0))
# Specify values when someone is a problem drinker
Check the contents of this new variable
table(ES$problemdrinker)
##
## 0 1
## 617 44
Yes.
Define a new variable “total physical activity”, adding relevant information from all “pa variables” on physical activity in various settings. Clearly explain your definitions and assumptions. Categorize people as “inactive” or “moderately active” or “active”.
levels(ES$pa7d_work) # Check levels
## [1] "A lot" "Some" "Little" "None"
levels(ES$pa7d_mov)
## [1] "A lot" "Some" "Little" "None"
levels(ES$pa7d_house)
## [1] "A lot" "Some" "Little" "None"
levels(ES$pa7d_recr)
## [1] "A lot" "Some" "Little" "None"
pa <- colnames(ES[, c(8:11)])
ES <- ES %>%
mutate(across(all_of(pa), # Recode into score
~ recode(.,
"A Lot" = 3, # Linear scoring system up to 3
"Some" = 2,
"Little" = 1,
"None" = 0), # 0 because there is no activity
.names = "num_{.col}")) %>% # Creates new variable with prefix "num_"
mutate(total_pa = num_pa7d_work + num_pa7d_mov + num_pa7d_house + num_pa7d_recr) %>%
mutate(total_pa = ifelse(total_pa < 4, "Inactive", # Assign categories of physical activity
ifelse(total_pa < 7, "Moderately Active", "Active"))) %>%
mutate(total_pa = as.factor(total_pa))
# Check if it went well
summary(ES$total_pa)
## Active Inactive Moderately Active NA's
## 88 248 257 410
We assign a score to the total physical activity based on the individual physical activity variables. This ranges from 0 to 3, indicating “none” up to “a lot” of activity. All 4 individual components add up to the final score, where under 4 is considered “Inactive”, from 4 to 6 is “Moderately Active”, and 7 and above is ” Active”.
Now these new variables will be used in a regression analysis to
analyze the effects of age, gender, and physical activity on daily
alcohol use and the assumptions of this analysis will be tested. Please
remember that we are using categorical variables here and therefore you
have check how your data have been stored in R, as numeric values or
factors, e.g. by using summary() or table().
In case of the first, the variable will be considered numeric. Sometimes
it matters how your data are stored and will it be worth trying the
alternative if you run into trouble or your results don’t make
sense.
Create a table of gender versus problem drinkers (yes/no). Apply a test to check for differences in percentages of problematic drinking between males and females.
table_gpb <- table(ES$gender, ES$problemdrinker) # Make contingency table
prop_gpb <- prop.table(table_gpb, margin = 1) * 100 # Make prop.table for percentages
chisq.test(prop_gpb) # Perform chi-square on prop table
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: prop_gpb
## X-squared = 0.010108, df = 1, p-value = 0.9199
P-value is 0.9199. This means there is no difference in the percentages of problematic drinkers between males and females.
Do the same for the variable on physical activity created in 2k versus problem drinkers, that is present and test for differences in problematic drinking in the three different categories of physical activity.
table(ES$total_pa, ES$problemdrinker)
##
## 0 1
## Active 66 1
## Inactive 124 4
## Moderately Active 171 6
# Some rows have less than 5, we use fisher exact test
fisher.test(ES$total_pa, ES$problemdrinker) # Fisher test
##
## Fisher's Exact Test for Count Data
##
## data: ES$total_pa and ES$problemdrinker
## p-value = 0.8482
## alternative hypothesis: two.sided
Fisher’s exact test was used to compare the difference between total physical activity and whether someone is a problem drinker or not. These are 2 categorical variables. We had to use this because some rows have less than 5.
P-value is 0.8482, which means that there is no difference in the total physical activity score between problem drinkers and non-problem drinkers.
Make a graph of percentages of problem drinkers versus age (in 10-year age categories). Test whether the distribution of age (continuous) is different between problem drinkers and non-problem drinkers. Check the summary statistics between the two groups.
ES <- ES %>%
mutate(agegroup10 = cut(age, seq(10, 100, 10))) # Make age categories
problemdrinker_percent <- ES %>%
group_by(agegroup10) %>%
summarize(
total = n(), # Total number of people in each age group
problem_drinkers = sum(problemdrinker == 1, na.rm = TRUE), # Number of problem drinkers
percentage = (problem_drinkers / total) * 100 # Percentage of problem drinkers
)
ggplot(problemdrinker_percent, aes(x = agegroup10, y = percentage)) +
geom_col() +
theme_classic() # Make plot to check distribution
# Normality tests
shapiro.test(ES$age) # Not significant
##
## Shapiro-Wilk normality test
##
## data: ES$age
## W = 0.9729, p-value = 9.821e-13
qqnorm(ES$age)
qqline(ES$age) # Deviates from line
wilcox.test(ES$age, ES$problemdrinker) # No normality so we use alternative test
##
## Wilcoxon rank sum test with continuity correction
##
## data: ES$age and ES$problemdrinker
## W = 662983, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Checking the bar graph of percentages of problem drinkers across age groups, we see that the percentages of problem drinkers are concentrated in the younger ages (<40). Performing normality tests show that the age variable is not normally distributed. This means we need to use a non-parametric alternative. In this case, the Wilcoxon rank-sum test. Outcome: P = < 0.001. This means there is a difference in terms of age between the amount of problem drinkers.
Perform a multivariable test to determine the effects of age (continuous), gender and physical activity on problematic drinking. Think well about the type of regression. Explain what you found.
m1 <- glm(problemdrinker ~ age + gender + total_pa, data = ES)
summary(m1) # Fit logistic regression model
##
## Call:
## glm(formula = problemdrinker ~ age + gender + total_pa, data = ES)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0773309 0.0317278 2.437 0.0153 *
## age -0.0013105 0.0005087 -2.576 0.0104 *
## genderFemale -0.0121391 0.0179172 -0.678 0.4985
## total_paInactive 0.0235012 0.0258104 0.911 0.3631
## total_paModerately Active 0.0262364 0.0243740 1.076 0.2825
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.02850198)
##
## Null deviance: 10.675 on 371 degrees of freedom
## Residual deviance: 10.460 on 367 degrees of freedom
## (631 observations deleted due to missingness)
## AIC: -260.84
##
## Number of Fisher Scoring iterations: 2
exp(-0.0013105) # Calculate odds-ratio
## [1] 0.9986904
The problem drinker variable is binary, so we use multivariate logistic regression. With this, we test the associations between problem drinkers and three predictor variables: age, gender, and total physical activity.
The results show that age is associated with being a problem drinker (P = 0.0104), while gender and physical activity are not (P = > 0.05).
Calculating the odds ration using the coefficient for age, we determine for each one-year increase in age, the odds of being a problem drinker decrease by approximately 0.13%.
Now regress total daily alcohol use on age, gender, and physical activity (multivariable model).
m2 <- lm(avgquant ~ age + gender + total_pa, data = ES) # Fit linear regression
summary(m2)
##
## Call:
## lm(formula = avgquant ~ age + gender + total_pa, data = ES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.964 -4.565 -2.502 3.832 108.314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.407652 1.861504 5.591 4.41e-08 ***
## age -0.007109 0.029847 -0.238 0.812
## genderFemale -6.735645 1.051219 -6.407 4.54e-10 ***
## total_paInactive 1.691109 1.514324 1.117 0.265
## total_paModerately Active 0.849405 1.430047 0.594 0.553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.905 on 367 degrees of freedom
## (631 observations deleted due to missingness)
## Multiple R-squared: 0.11, Adjusted R-squared: 0.1003
## F-statistic: 11.34 on 4 and 367 DF, p-value: 1.086e-08
# Check normality of residuals
shapiro.test(resid(m2)) # Not significant
##
## Shapiro-Wilk normality test
##
## data: resid(m2)
## W = 0.72592, p-value < 2.2e-16
qqnorm(resid(m2))
qqline(resid(m2))
# Check homoscedasticity
plot(m2, 1)
avgquant is a continuous variable, so we use linear regression accordingly. We test associations between average quantity of alcohol and age, gender and total physical activity.
Only gender is significant (P = < 0.001).
However, when we check the assumptions of linear regression, most importantly normality of residuals and homoscedasticity, it reveals that both are violated. The result is thus not applicable and data transformation is necessary.
Check the assumptions of the regression analysis in the previous exercise. Hint: plot(model). Explain what you see in statistical terms. Adapt your analysis and repeat 3e if the assumptions are not met.
ES$log_avgquant <- log(1+ES$avgquant) # Transform to log
m3 <- lm(log_avgquant ~ age + gender + total_pa, data = ES) # Fit new model
summary(m3)
##
## Call:
## lm(formula = log_avgquant ~ age + gender + total_pa, data = ES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24605 -0.83082 0.00653 0.85037 2.79210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.234083 0.182276 12.257 <2e-16 ***
## age -0.003791 0.002923 -1.297 0.195
## genderFemale -0.908890 0.102934 -8.830 <2e-16 ***
## total_paInactive -0.010502 0.148280 -0.071 0.944
## total_paModerately Active 0.068837 0.140028 0.492 0.623
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9699 on 367 degrees of freedom
## (631 observations deleted due to missingness)
## Multiple R-squared: 0.1773, Adjusted R-squared: 0.1683
## F-statistic: 19.77 on 4 and 367 DF, p-value: 9.422e-15
# Test normality of residuals
shapiro.test(resid(m3))
##
## Shapiro-Wilk normality test
##
## data: resid(m3)
## W = 0.9816, p-value = 0.0001105
qqnorm(resid(m3))
qqline(resid(m3))
# Check homoscedasticity
plot(m3, 1)
We checked the assumptions and they are not met. Thus, we make a transformation of the outcome variable using log transformation.
Log transformation worked, but residuals are not normally distributed and homoscedasticity is violated. This means that assumptions are violated and that the result of this regression is not valid.
We conclude that the data is not suitable for this type of analysis, and that we should consider a non-linear regression, for example GAM models.