install.packages(“haven”) install.packages(“lattice”) install.packages(“dplyr”) install.packages(“ggplot2”)
#### Preparation
library(haven)
## Warning: package 'haven' was built under R version 4.4.3
library(lattice)
## Warning: package 'lattice' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.1
This case study is an integrated set of exercises in R that build on the knowledge you gained during the first eight days of the course. The goal of the exercises is to determine whether physical activity is associated with alcohol use. To get to the answer to this question, you will first have to generate new variables from the data available and check the data before analysis.
Read in the entire dataset and select the data for your country.
# Set the working directory
#setwd("X:/1. R course/end assignment")
# Load dataset from SPSS file
dataset_case_study <- read_sav("C:/Users/tungl/Downloads/Dataset_casestudy1.sav")
# View labels of variable 6
attr(dataset_case_study$v6,"labels")
## France Belgium The Netherlands
## 1 2 3
## Germany West Italy Luxembourg
## 4 5 6
## Denmark Ireland Great Britain
## 7 8 9
## Northern Ireland Greece Spain
## 10 11 12
## Portugal Germany East Norway (not included)
## 13 14 15
## Finland Sweden Austria
## 16 17 18
## Cyprus (Republic) Czech Republic Estonia
## 19 20 21
## Hungary Latvia Lithuania
## 22 23 24
## Malta Poland Slovakia
## 25 26 27
## Slovenia Bulgaria Romania
## 28 29 30
## Turkey Croatia Cyprus (CY-TCC)
## 31 32 33
## Macedonia (FYROM)
## 34
# Select West Germany
Eurobar_WG <- subset(dataset_case_study,subset=(v6==4))
We are interested in the variables on alcohol use (alc: [v150-v154]) and on physical activity (pa: [v119-v122]). Furthermore, you will need information on age (v331), gender (v330), and country (v6). Names of the variables in relation to original Eurobarometer coding are given in table 1 of the exercises of Day 5.
Make a smaller dataset for you country with only the variables mentioned above
# Select variables of interest
WG <- subset(Eurobar_WG,select=paste("v",c(6,150:154,119:122,331,330),sep=""))
Rename the variables to the names mentioned in table 1 on Day 5
#Rename variables
names(WG) <- c("country","alc_12m","alcfreq_5dr","alc_30da","alcfreq_30d","alc_am_dd","pa7d_work","pa7d_mov","pa7d_house","pa7d_recr","age","gender")
Describe the data, using R to help you find information on number of observations, number of variables, number of missing values for each variable.
View(WG) # View the data
nrow(WG) # Check number of observations
## [1] 1004
summary(WG) # Check summary
## country alc_12m alcfreq_5dr alc_30da alcfreq_30d
## Min. :4 Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.:4 1st Qu.:1.000 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:3.000
## Median :4 Median :1.000 Median :3.00 Median :1.00 Median :4.000
## Mean :4 Mean :1.221 Mean :3.25 Mean :1.13 Mean :3.708
## 3rd Qu.:4 3rd Qu.:1.000 3rd Qu.:5.00 3rd Qu.:1.00 3rd Qu.:5.000
## Max. :4 Max. :2.000 Max. :5.00 Max. :2.00 Max. :6.000
## NA's :3 NA's :228 NA's :226 NA's :334
## alc_am_dd pa7d_work pa7d_mov pa7d_house
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :4.000 Median :2.000 Median :2.000
## Mean :2.257 Mean :2.946 Mean :1.952 Mean :1.889
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :7.000 Max. :4.000 Max. :4.000 Max. :4.000
## NA's :332 NA's :49
## pa7d_recr age gender
## Min. :1.000 Min. :15.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:35.00 1st Qu.:1.000
## Median :2.000 Median :51.00 Median :2.000
## Mean :2.344 Mean :50.49 Mean :1.549
## 3rd Qu.:3.000 3rd Qu.:67.00 3rd Qu.:2.000
## Max. :4.000 Max. :94.00 Max. :2.000
## NA's :3
missing_counts <- sapply(WG, function(x) sum(is.na(x))) # Check missing values
print(missing_counts)
## country alc_12m alcfreq_5dr alc_30da alcfreq_30d alc_am_dd
## 0 3 228 226 334 332
## pa7d_work pa7d_mov pa7d_house pa7d_recr age gender
## 49 0 0 3 0 0
The number of observations is 1004 and there are 12 variables in the dataset. The missing counts table as obtained by the command above shows the number of missing values for each variable.
Determine which of variables are actually categorical. Hint: determine the number of unique values per variable.
sapply(WG, function(x) length(unique(x))) # Function to determine the number of unique values per variable
## country alc_12m alcfreq_5dr alc_30da alcfreq_30d alc_am_dd
## 1 3 6 3 7 8
## pa7d_work pa7d_mov pa7d_house pa7d_recr age gender
## 5 4 4 5 77 2
str(WG) # Check variable type
## tibble [1,004 × 12] (S3: tbl_df/tbl/data.frame)
## $ country : dbl+lbl [1:1004] 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## ..@ label : chr "NATION - ALL SAMPLES"
## ..@ format.spss : chr "F2.0"
## ..@ display_width: int 4
## ..@ labels : Named num [1:34] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..- attr(*, "names")= chr [1:34] "France" "Belgium" "The Netherlands" "Germany West" ...
## $ alc_12m : dbl+lbl [1:1004] 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1,...
## ..@ label : chr "QC1A DRINKING ALCOHOL - PAST 12 MONTHS"
## ..@ format.spss : chr "F1.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:4] 1 2 3 9
## .. ..- attr(*, "names")= chr [1:4] "Yes" "No" "DK/Refusal" "Inap. (not 1 to 30 in V6)"
## $ alcfreq_5dr: dbl+lbl [1:1004] NA, 5, NA, 5, 5, 4, 5, 5, 4, 4, 5, 5, NA, N...
## ..@ label : chr "QC1B DRINKING ALCOHOL - 5 OR MORE DRINKS"
## ..@ format.spss : chr "F2.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:8] 1 2 3 4 5 6 9 99
## .. ..- attr(*, "names")= chr [1:8] "Several times a week" "Once a week" "Once a month" "Less than once a month" ...
## $ alc_30da : dbl+lbl [1:1004] NA, 1, NA, 1, 2, 1, 1, 1, 2, 2, 2, 1, NA, N...
## ..@ label : chr "QC1C DRINKING ALCOHOL - LAST 30 DAYS"
## ..@ format.spss : chr "F2.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:5] 1 2 3 9 99
## .. ..- attr(*, "names")= chr [1:5] "Yes" "No" "DK/Refusal" "Inap. (not 1 in V150)" ...
## $ alcfreq_30d: dbl+lbl [1:1004] NA, 3, NA, 6, NA, 5, 2, 3, NA, NA, NA, 5, NA, N...
## ..@ label : chr "QC2 DRINKING ALCOHOL - FREQ LAST 30 DAYS"
## ..@ format.spss : chr "F2.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:9] 1 2 3 4 5 6 7 9 99
## .. ..- attr(*, "names")= chr [1:9] "Daily" "4 - 5 times a week" "2 - 3 times a week" "Once a week" ...
## $ alc_am_dd : dbl+lbl [1:1004] NA, 2, NA, 1, NA, 2, 2, 1, NA, NA, NA, 2, NA, N...
## ..@ label : chr "QC3 DRINKING ALCOHOL - N OF DRINKS USUALLY"
## ..@ format.spss : chr "F2.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:10] 1 2 3 4 5 6 7 8 9 99
## .. ..- attr(*, "names")= chr [1:10] "Less than 1 drink" "1 - 2 drinks" "3 - 4 drinks" "5 - 6 drinks" ...
## $ pa7d_work : dbl+lbl [1:1004] 4, 3, 4, 4, NA, 3, 4, NA, 2, NA, NA, 1, NA, ...
## ..@ label : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: AT WORK"
## ..@ format.spss : chr "F1.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:6] 1 2 3 4 5 9
## .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
## $ pa7d_mov : dbl+lbl [1:1004] 3, 3, 1, 2, 2, 2, 1, 1, 2, 1, 4, 2, 2, 2, 1, 2, 2, 4,...
## ..@ label : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: MOVING"
## ..@ format.spss : chr "F1.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:6] 1 2 3 4 5 9
## .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
## $ pa7d_house : dbl+lbl [1:1004] 2, 1, 3, 1, 1, 2, 4, 1, 2, 1, 4, 2, 2, 2, 2, 2, 1, 3,...
## ..@ label : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: AT HOME"
## ..@ format.spss : chr "F1.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:6] 1 2 3 4 5 9
## .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
## $ pa7d_recr : dbl+lbl [1:1004] 4, 1, 2, 1, 4, 4, 2, 1, 2, 1, 4, 4, 4, 3, 2, 4, 1, 4,...
## ..@ label : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: RECREATION"
## ..@ format.spss : chr "F1.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:6] 1 2 3 4 5 9
## .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
## $ age : num [1:1004] 79 50 56 35 70 63 73 66 29 55 ...
## ..- attr(*, "label")= chr "D11 AGE EXACT"
## ..- attr(*, "format.spss")= chr "F2.0"
## ..- attr(*, "display_width")= int 6
## $ gender : dbl+lbl [1:1004] 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2,...
## ..@ label : chr "D10 GENDER"
## ..@ format.spss : chr "F1.0"
## ..@ display_width: int 6
## ..@ labels : Named num [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "Male" "Female"
summary(WG) # Check summary
## country alc_12m alcfreq_5dr alc_30da alcfreq_30d
## Min. :4 Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.:4 1st Qu.:1.000 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:3.000
## Median :4 Median :1.000 Median :3.00 Median :1.00 Median :4.000
## Mean :4 Mean :1.221 Mean :3.25 Mean :1.13 Mean :3.708
## 3rd Qu.:4 3rd Qu.:1.000 3rd Qu.:5.00 3rd Qu.:1.00 3rd Qu.:5.000
## Max. :4 Max. :2.000 Max. :5.00 Max. :2.00 Max. :6.000
## NA's :3 NA's :228 NA's :226 NA's :334
## alc_am_dd pa7d_work pa7d_mov pa7d_house
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :4.000 Median :2.000 Median :2.000
## Mean :2.257 Mean :2.946 Mean :1.952 Mean :1.889
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :7.000 Max. :4.000 Max. :4.000 Max. :4.000
## NA's :332 NA's :49
## pa7d_recr age gender
## Min. :1.000 Min. :15.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:35.00 1st Qu.:1.000
## Median :2.000 Median :51.00 Median :2.000
## Mean :2.344 Mean :50.49 Mean :1.549
## 3rd Qu.:3.000 3rd Qu.:67.00 3rd Qu.:2.000
## Max. :4.000 Max. :94.00 Max. :2.000
## NA's :3
Every variable except age is categorical, but in the dataset they are regarded as continuous variables.
Create a function that can convert the numerical variables that are categorical ones into factors and drop the levels that are non-informative (determine from the labels). Use the solution you found on day 5, where you already converted numerical variables into categorical ones. Function input arguments should be the variable and the levels to be dropped / set to missing. Inside the body of the function, use the variable to determine its levels and labels. Output should be the factor variable with the right labels and without non-informative levels.
convert_to_factors <- function(WG) {
# Convert the numerical variables that are categorical into factors
WG$country_f <- as_factor(WG$country)
WG$alc_12m_f <- as_factor(WG$alc_12m)
WG$alcfreq_5dr_f <- as_factor(WG$alcfreq_5dr)
WG$alc_30da_f <- as_factor(WG$alc_30da)
WG$alcfreq_30d_f <- as_factor(WG$alcfreq_30d)
WG$alc_am_dd_f <- as_factor(WG$alc_am_dd)
WG$pa7d_work_f <- as_factor(WG$pa7d_work)
WG$pa7d_mov_f <- as_factor(WG$pa7d_mov)
WG$pa7d_house_f <- as_factor(WG$pa7d_house)
WG$pa7d_recr_f <- as_factor(WG$pa7d_recr)
WG$gender_f <- as_factor(WG$gender)
# Specify the levels that are non-informative
na_values <- list(
alc_12m_f = c("DK/Refusal", "Inap. (not 1 to 30 in V6)"),
alcfreq_5dr_f = c("DK/Refusal", "Inap. (not 1 in V150)", "Inap. (not 1 to 30 in V6)"),
alc_30da_f = c("DK/Refusal", "Inap. (not 1 in V150)", "Inap. (not 1 to 30 in V6)"),
alcfreq_30d_f = c("Don't remember/Refusal (SPONT.)", "Inap. (not 1 in V152)", "Inap. (not 1 to 30 in V6)"),
alc_am_dd_f = c("It depends (SPONT.)", "DK/ Refusal", "Inap. (not 1 in V152)", "Inap. (not 1 to 30 in V6)"),
pa7d_work_f = c("DK", "Inap. (not 1 to 30 in V6)"),
pa7d_mov_f = c("DK", "Inap. (not 1 to 30 in V6)"),
pa7d_house_f = c("DK", "Inap. (not 1 to 30 in V6)"),
pa7d_recr_f = c("DK", "Inap. (not 1 to 30 in V6)")
)
# Convert non-informative levels to NA
for (var in names(na_values)) {
if (var %in% names(WG)) {
current_levels <- levels(WG[[var]])
na_indices <- which(current_levels %in% na_values[[var]])
current_levels[na_indices] <- NA
levels(WG[[var]]) <- current_levels
}
}
return(WG)
}
Use the function created in 1f to convert the categorical variables determined in exercise 1e to factors and check if the function worked correctly. NB. Use different names for the new variables to avoid that the original one is lost! In the remainder we will however refer to the original names.
# Check if the conversion worked
WG <- convert_to_factors(WG)
summary(WG)
## country alc_12m alcfreq_5dr alc_30da alcfreq_30d
## Min. :4 Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.:4 1st Qu.:1.000 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:3.000
## Median :4 Median :1.000 Median :3.00 Median :1.00 Median :4.000
## Mean :4 Mean :1.221 Mean :3.25 Mean :1.13 Mean :3.708
## 3rd Qu.:4 3rd Qu.:1.000 3rd Qu.:5.00 3rd Qu.:1.00 3rd Qu.:5.000
## Max. :4 Max. :2.000 Max. :5.00 Max. :2.00 Max. :6.000
## NA's :3 NA's :228 NA's :226 NA's :334
## alc_am_dd pa7d_work pa7d_mov pa7d_house
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :4.000 Median :2.000 Median :2.000
## Mean :2.257 Mean :2.946 Mean :1.952 Mean :1.889
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :7.000 Max. :4.000 Max. :4.000 Max. :4.000
## NA's :332 NA's :49
## pa7d_recr age gender country_f
## Min. :1.000 Min. :15.00 Min. :1.000 Germany West :1004
## 1st Qu.:1.000 1st Qu.:35.00 1st Qu.:1.000 France : 0
## Median :2.000 Median :51.00 Median :2.000 Belgium : 0
## Mean :2.344 Mean :50.49 Mean :1.549 The Netherlands: 0
## 3rd Qu.:3.000 3rd Qu.:67.00 3rd Qu.:2.000 Italy : 0
## Max. :4.000 Max. :94.00 Max. :2.000 Luxembourg : 0
## NA's :3 (Other) : 0
## alc_12m_f alcfreq_5dr_f alc_30da_f alcfreq_30d_f
## Yes :780 Several times a week :113 Yes :677 Daily : 62
## No :221 Once a week :168 No :101 4 - 5 times a week : 67
## NA's: 3 Once a month :114 NA's:226 2 - 3 times a week :145
## Less than once a month:174 Once a week :206
## Never :207 2 - 3 times a month:111
## NA's :228 Once : 79
## NA's :334
## alc_am_dd_f pa7d_work_f pa7d_mov_f pa7d_house_f pa7d_recr_f
## Less than 1 drink:129 A lot :205 A lot :364 A lot :402 A lot :318
## 1 - 2 drinks :383 Some :137 Some :390 Some :381 Some :269
## 3 - 4 drinks : 96 Little:118 Little:184 Little:151 Little:166
## 5 - 6 drinks : 28 None :495 None : 66 None : 70 None :248
## 7 - 9 drinks : 10 NA's : 49 NA's : 3
## 10 drinks or more: 10
## NA's :348
## gender_f
## Male :453
## Female:551
##
##
##
##
##
The function worked, as it converted all variables except age to factor and removed inappropriate labels.
Calculate descriptive statistics of all variables. Think well on what kind of statistic that would be depending on the type of variable.
hist(WG$age) #Check distribution of age (continuous)
summary(WG$age) #Descriptive statistics of age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 35.00 51.00 50.49 67.00 94.00
median(WG$age) #Calculate median for age because of non-normal distribution
## [1] 51
IQR(WG$age) #Calculate IQR for age because of non-normal distribution
## [1] 32
We calculated the median and IQR for age since the histogram showed a non-normal distribution.
#Calculate descriptive statistics for all categorical variables
calculate_proportions_and_frequencies_table <- function(dataset) {
variables <- c("alc_12m_f", "alcfreq_5dr_f", "alc_30da_f", "alcfreq_30d_f",
"alc_am_dd_f", "pa7d_work_f", "pa7d_mov_f", "pa7d_house_f",
"pa7d_recr_f", "gender_f")
result_list <- list()
for (var in variables) {
freq <- table(dataset[[var]])
prop <- prop.table(freq)
df <- data.frame(
Category = names(freq),
Frequency = as.vector(freq),
Proportion = as.vector(prop)
)
result_list[[var]] <- df
}
for (var in variables) {
cat("\nTable for", var, ":\n")
print(result_list[[var]])
cat("\n------------------------\n")
}
}
calculate_proportions_and_frequencies_table(WG)
##
## Table for alc_12m_f :
## Category Frequency Proportion
## 1 Yes 780 0.7792208
## 2 No 221 0.2207792
##
## ------------------------
##
## Table for alcfreq_5dr_f :
## Category Frequency Proportion
## 1 Several times a week 113 0.1456186
## 2 Once a week 168 0.2164948
## 3 Once a month 114 0.1469072
## 4 Less than once a month 174 0.2242268
## 5 Never 207 0.2667526
##
## ------------------------
##
## Table for alc_30da_f :
## Category Frequency Proportion
## 1 Yes 677 0.8701799
## 2 No 101 0.1298201
##
## ------------------------
##
## Table for alcfreq_30d_f :
## Category Frequency Proportion
## 1 Daily 62 0.09253731
## 2 4 - 5 times a week 67 0.10000000
## 3 2 - 3 times a week 145 0.21641791
## 4 Once a week 206 0.30746269
## 5 2 - 3 times a month 111 0.16567164
## 6 Once 79 0.11791045
##
## ------------------------
##
## Table for alc_am_dd_f :
## Category Frequency Proportion
## 1 Less than 1 drink 129 0.19664634
## 2 1 - 2 drinks 383 0.58384146
## 3 3 - 4 drinks 96 0.14634146
## 4 5 - 6 drinks 28 0.04268293
## 5 7 - 9 drinks 10 0.01524390
## 6 10 drinks or more 10 0.01524390
##
## ------------------------
##
## Table for pa7d_work_f :
## Category Frequency Proportion
## 1 A lot 205 0.2146597
## 2 Some 137 0.1434555
## 3 Little 118 0.1235602
## 4 None 495 0.5183246
##
## ------------------------
##
## Table for pa7d_mov_f :
## Category Frequency Proportion
## 1 A lot 364 0.36254980
## 2 Some 390 0.38844622
## 3 Little 184 0.18326693
## 4 None 66 0.06573705
##
## ------------------------
##
## Table for pa7d_house_f :
## Category Frequency Proportion
## 1 A lot 402 0.40039841
## 2 Some 381 0.37948207
## 3 Little 151 0.15039841
## 4 None 70 0.06972112
##
## ------------------------
##
## Table for pa7d_recr_f :
## Category Frequency Proportion
## 1 A lot 318 0.3176823
## 2 Some 269 0.2687313
## 3 Little 166 0.1658342
## 4 None 248 0.2477522
##
## ------------------------
##
## Table for gender_f :
## Category Frequency Proportion
## 1 Male 453 0.4511952
## 2 Female 551 0.5488048
##
## ------------------------
Make a bar chart of pa7d_work to visualize this variable. Use both
hist() and barplot() as a command to accomplish this and describe which
plot is preferred.
N.B. using barplot(pa7d_work) won’t
properly display the variable. Check what input the function
needs.
#Making histogram of pa7d_work
ggplot(WG, aes(x = pa7d_work)) +
geom_histogram(binwidth = 1, fill = "lightblue", color = "black", na.rm = TRUE) +
labs(title = "Histogram of Physical activity at work", x = "Physical activity at work (pa7d_work)", y = "Frequency") +
theme_minimal()
#Making barplot of pa7d_work
ggplot(WG, aes(x = pa7d_work)) +
geom_bar(fill = "lightblue", color = "black", na.rm = TRUE) +
labs(title = "Barplot of Physical activity at work during the last 7 days",
x = "Physical activity at work (pa7d_work)",
y = "Frequency") +
theme_minimal()
We prefer a bar plot, since the variable pa7d_work is a categorical variable. For categorical variables a bar plot is the first choice to represent the frequencies for each group separately. A histogram is preferred for continuous variables.
Since we are interested in problematic alcohol use, the data on alcohol use will have to be processed to produce a new binary variable “problematic alcohol use”. Alcohol use may be defined as problematic when it is chronically too much (more than 40 g on average per day for male or 20 g per day for female persons) or when too much is consumed on a single occasion, i.e. more than 5 glasses (binge drinking). Questions on both issues are in the dataset. Define someone as a problematic drinker if he/she either chronically drinks too much and/or does binge drinking. You will now produce a new variable “problematic drinker” in a couple of steps. Do this for your own small dataset with only the records from your chosen country.
Define a new numeric variable “number of drinking days/month” (e.g. called n_drinksmonth) and use the information from alc_30da and freq_30d to fill this variable with numeric values on the number of days a persons was drinking alcohol per month. You may need to make some pragmatic assumptions here. E.g. concerning the number of days in category “Daily” should be converted to 30.
State your assumptions in a comment. There are many approaches exist
to do this conversion, e.g. like in exercise 24 of Day 1, using the
ifelse() function, or using the revalue() or
mapvalues() functions from plyr package.
NB. There are multiple solutions to do this conversion. If you
want to use conditional statements, remember that if() does
not work on a vector, while ifelse() does. Alternatively
use mapply() for loop or another solution.
#Define numeric variable n_drinksmonth
WG$n_drinksmonth <- ifelse(
WG$alc_30da_f == "Yes",
ifelse(WG$alcfreq_30d_f == "Daily", 30,
ifelse(WG$alcfreq_30d_f == "4 - 5 times a week", 18,
ifelse(WG$alcfreq_30d_f == "2 - 3 times a week", 10,
ifelse(WG$alcfreq_30d_f == "Once a week", 4,
ifelse(WG$alcfreq_30d_f == "2 - 3 times a month", 2.5,
ifelse(WG$alcfreq_30d_f == "Once", 1, 0)
)
)
)
)
),
0
)
Define a new variable “quantity of alcohol (in grams) consumed on a drinking day” (e.g. called quantday) and use the information from alc_30da and alc_am_dd to do so. Use the following assumption: 1 drink/consumption contains 12 g alcohol.
#Convert variable alc_am_dd to a numeric variable
WG$alc_am_dd_numeric <- ifelse(WG$alc_am_dd_f == "Less than 1 drink", 0.5,
ifelse(WG$alc_am_dd_f == "1 - 2 drinks", 1.5,
ifelse(WG$alc_am_dd_f == "3 - 4 drinks", 3.5,
ifelse(WG$alc_am_dd_f == "5 - 6 drinks", 5.5,
ifelse(WG$alc_am_dd_f == "7 - 9 drinks", 8,
ifelse(WG$alc_am_dd_f == "10 drinks or more", 10, NA))))))
#Compute variable quantday (quantity of alcohol in g consumed on a drinking day)
WG$quantday <- ifelse(
WG$alc_30da_f == "Yes", # Only calculate when alc_30_da_f is "Yes" (drinking alcohol in the last 30 days)
WG$alc_am_dd_numeric * 12, # Multiply by 12 (grams of alcohol per drink)
0 # If alc_30_da_f is "No", than quantday is 0
)
#Check the first values of the new variable quantday
head(WG$quantday)
## [1] NA 18 NA 6 0 18
In this case we state that if alc_30da == 2 (no alcohol is consumed in the last 30 days), the return will be 0 for quantday, and if alc_30da == 1 (alcohol consumed in the last 30 days), the return will be the number of drinks multiplied by 12 g to calculate the total grams of alcohol consumed.Hereby we assume that the number of drinks on a drinking day is the average of the range given.
Now calculate a third new variable “average quantity of alcohol (in grams) per day over the last month”, multiplying the two variables above and dividing the result by 30.
# Calculate the average quantity of alcohol (in grams) per day over the last month
WG$average_quant_day <- (WG$quantday * WG$n_drinksmonth) / 30
#Check the first values of the new variable
head(WG$average_quant_day)
## [1] NA 6.0 NA 0.2 0.0 1.5
Make a histogram or barplot of daily intake created in exercise 2c.
# Checking the structure of the new variable average_quant_day
str(WG$average_quant_day)
## num [1:1004] NA 6 NA 0.2 0 1.5 10.8 2 0 0 ...
summary(WG$average_quant_day)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.800 2.400 5.702 6.000 66.000 252
#Remove NA values of the new variable by creating a clean dataset
WG_clean <- WG %>% filter(!is.na(average_quant_day))
#Perform a barplot
ggplot(WG_clean, aes(x = as.factor(average_quant_day))) +
geom_bar(fill = "lightblue", color = "black", width = 0.8) +
labs(title = "Barplot of daily alcohol intake (average_quant_day)",
x = "Average quantity of alcohol per day (grams)",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We chose to make a barplot instead of a histogram since the variable
average_quant_day is discrete and therefore a barplot is preferred.
The barplot shows that there is a large spread in the distribution of average quantity of alcohol a day. These results make sense, since the number of drinks during drinking days varies a lot, from less than 1 drinks a day to 10 drinks or more. Since this was multiplied by 12 (for grams), the total quantity shows a distribution between 0 and 66 grams of alcohol.
No, the variable is discrete instead of continuous, since it can not take all possible values. This is because the categories of number of drinks during drinking days was limited to a maximum of 10 (10 drinks or more). Therefore the total amount of alcohol in g can be max. 120 g. Another argument is the specified range in number of drinks (e.g. less than 1 or 7-9), for which we had to make assumptions by using the average value.
Do the same as in exercise 2d, but now for men and women separately,
in a single plot.
Hint: think of the trellis plots from the
lattice package.
# Perform a bar chart
barchart(average_quant_day ~ factor(gender, levels = c(1, 2), labels = c("Male", "Female")),
data = WG_clean,
main = "Barplot of Average Alcohol Intake per Day by Gender",
xlab = "Gender",
ylab = "Average Quantity of Alcohol per Day (grams)",
col = "lightblue",
auto.key = TRUE,
scales = list(y = list(relation = "free")),
layout = c(1, 1))
Compute descriptive statistics for the daily (average) alcohol use in grams.
calculate_average_quant_day_table <- function(dataset) {
freq <- table(WG_clean$average_quant_day)
prop <- prop.table(freq)
df <- data.frame(
Category = names(freq),
Frequency = as.vector(freq),
Proportion = as.vector(prop)
)
cat("\nTable for average_quant_day:\n")
print(df)
cat("\n------------------------\n")
}
calculate_average_quant_day_table(WG)
##
## Table for average_quant_day:
## Category Frequency Proportion
## 1 0 101 0.134308511
## 2 0.2 32 0.042553191
## 3 0.5 22 0.029255319
## 4 0.6 32 0.042553191
## 5 0.8 33 0.043882979
## 6 1.4 8 0.010638298
## 7 1.5 64 0.085106383
## 8 2 22 0.029255319
## 9 2.2 5 0.006648936
## 10 2.4 117 0.155585106
## 11 3.2 1 0.001329787
## 12 3.5 15 0.019946809
## 13 3.6 10 0.013297872
## 14 5.5 5 0.006648936
## 15 5.6 30 0.039893617
## 16 6 92 0.122340426
## 17 8 2 0.002659574
## 18 8.8 10 0.013297872
## 19 10 1 0.001329787
## 20 10.8 46 0.061170213
## 21 12.8 4 0.005319149
## 22 14 25 0.033244681
## 23 16 6 0.007978723
## 24 18 38 0.050531915
## 25 22 3 0.003989362
## 26 25.2 9 0.011968085
## 27 32 3 0.003989362
## 28 40 3 0.003989362
## 29 42 9 0.011968085
## 30 66 4 0.005319149
##
## ------------------------
Create a variable “heavy drinker” for men and women, using the cut-off values mentioned above (40 and 20 g daily, respectively).
WG$heavy_drinker <- ifelse(
(WG$gender == 1 & WG$average_quant_day > 40) | # For men the cut-off is set at above 40g a day
(WG$gender == 2 & WG$average_quant_day > 20), # For women the cut-off is set at above 20g a day
"Yes", # If the condition is met, the person is categorized as a heavy drinker
"No" # If the condition is not met, the person is not categorized as a heavy drinker
)
Now consider binge drinking in addition to drinking too much
Define a new variable “binge” that gets value 1 if someone is a binge drinker (i.e. >=5 glasses or 60 gram alcohol as defined in 2b on a single occasion) and value 0 else. This has the same definition for men and women.
#Compute new variable "binge"
WG$binge <- ifelse(
WG$alc_am_dd_f %in% c("5 - 6 drinks", "7 - 9 drinks", "10 drinks or more") |
WG$quantday >= 60,
1,
0
)
#Add labels
WG$binge <- factor(WG$binge, levels = c(0, 1), labels = c("No", "Yes"))
Use these two variables (heavy drinker and binge) to fill a new
binary variable “problem drinker” using logistic operators
(|, &) in the appropriate way.
#Compute new variable problem_drinker
WG$problem_drinker <- ifelse( # Create new variable on problem drinker
WG$heavy_drinker == "Yes" | WG$binge == "Yes", # Specify values when someone is a problem drinker
1,
0
)
#Add labels
WG$problem_drinker <- factor(WG$problem_drinker, levels = c(0, 1), labels = c("No", "Yes"))
Check the contents of this new variable
#Check contents of variable problem_drinker
summary(WG$problem_drinker)
## No Yes NA's
## 695 58 251
table(WG$problem_drinker, useNA = "always")
##
## No Yes <NA>
## 695 58 251
Yes, because there are three groups of people for whom problem_drinker can be “Yes”: people who are both binge drinker and heavy drinker (n= 4), people who are only a binge drinker (n=44) and people who are only a heavy drinker (n=10). The sum equals the total number of problem drinkers (n=58).
Define a new variable “total physical activity”, adding relevant information from all “pa variables” on physical activity in various settings. Clearly explain your definitions and assumptions. Categorize people as “inactive” or “moderately active” or “active”.
#Define variable total physical activity (total_pa)
WG$total_pa <- apply(WG[, c("pa7d_work_f", "pa7d_mov_f", "pa7d_house_f", "pa7d_recr_f")], 1, function(x) {
active_count <- sum(x %in% c("A lot"))
inactive_count <- sum(x %in% c("None"))
if (active_count >= 2) {
return("Active")
}
if (inactive_count >= 2) {
return("Inactive")
}
return("Moderately active")
})
#Checking content of new variable total_pa
table(WG$total_pa)
##
## Active Inactive Moderately active
## 403 187 414
While creating the new variable ‘total physical activity’ we did the following assumptions: ‘Inactive’ people have responded ‘None’ on 2 ore more PA variable, ‘Active’ people have responded ‘A lot’ on 2 or more PA variables. People who did not fulfil any of the previous criteria were categorized as ‘moderately active’.
Now these new variables will be used in a regression analysis to
analyze the effects of age, gender, and physical activity on daily
alcohol use and the assumptions of this analysis will be tested. Please
remember that we are using categorical variables here and therefore you
have check how your data have been stored in R, as numeric values or
factors, e.g. by using summary() or table().
In case of the first, the variable will be considered numeric. Sometimes
it matters how your data are stored and will it be worth trying the
alternative if you run into trouble or your results don’t make
sense.
Create a table of gender versus problem drinkers (yes/no). Apply a test to check for differences in percentages of problematic drinking between males and females.
#Creating table of gender vs problem drinkers excluding NA values
summary(WG$gender_f)
## Male Female
## 453 551
summary(na.omit(WG$problem_drinker))
## No Yes
## 695 58
#Create a contingency table for gender and problem drinking
gender_problem_drinker_table <- table(WG$gender_f, (WG$problem_drinker))
table(WG$gender_f, (WG$problem_drinker))
##
## No Yes
## Male 325 46
## Female 370 12
#Chi-square test for testing differences in problem drinking between males and females
chisq.test(gender_problem_drinker_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gender_problem_drinker_table
## X-squared = 21.405, df = 1, p-value = 3.717e-06
The summary() function shows that when using ‘gender_f’, both the ‘problem_drinker’ and ‘gender_f’ are stored as factors, while ‘gender’ is a numeric variable. So, we use ‘gender_f’. Since both problem drinking and gender are categorical variables, a chi-square test was conducted to assess whether the distribution of problem drinking differs across genders. The test yielded a p-value of 3.717e-06, which is below the significance threshold of 0.05. Therefore, the null hypothesis of independence between gender and problem drinking is rejected, indicating that there is a statistically significant association between the two variables in this dataset.
Do the same for the variable on physical activity created in 2k versus problem drinkers, that is present and test for differences in problematic drinking in the three different categories of physical activity.
#Creating table of total_pa vs problem drinkers
table(WG$total_pa)
##
## Active Inactive Moderately active
## 403 187 414
#Create a contingency table for total_pa and problem drinking
total_pa_problem_drinker_table <- table(WG$total_pa, WG$problem_drinker)
print(total_pa_problem_drinker_table)
##
## No Yes
## Active 290 29
## Inactive 115 9
## Moderately active 290 20
#Chi-square test for testing differences in problematic drinking in the 3 PA-categories
chisq.test(total_pa_problem_drinker_table)
##
## Pearson's Chi-squared test
##
## data: total_pa_problem_drinker_table
## X-squared = 1.5817, df = 2, p-value = 0.4535
Since both total physical activity and problem drinking are categorical variables, a chi-square test was conducted to examine whether the distribution of problem drinking differs across physical activity levels. The test yielded a p-value of 0.4535, which is above the significance threshold of 0.05. Therefore, the null hypothesis of independence between physical activity and problem drinking cannot be rejected, indicating that there is no statistically significant association between the two variables in this dataset.
Make a graph of percentages of problem drinkers versus age (in 10-year age categories). Here, think about what would be most informative: percentage of total or percentages per age category. Test whether the distribution of age (continuous) is different between problem drinkers and non-problem drinkers. Check the summary statistics between the two groups.
#Checking the summary and range of the variable age
summary(WG$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 35.00 51.00 50.49 67.00 94.00
#Convert age into 10-year categories (10-20, 20-30, ..., 90-100). Say right = FALSE to include 10 but exclude 20 in e.g. 10-20 as age group.
WG$age_groups <-cut(WG$age, breaks = seq(10, 100, by = 10), labels = paste(seq(10, 90, by = 10), seq(20, 100, by = 10), sep = "-"), right = FALSE)
#Check the frequency distribution of age groups
table (WG$age_groups)
##
## 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
## 56 132 117 167 174 159 144 52 3
#Print table of percentages of problem drinking for age groups
age_problem_drinker_table <- table(WG$age_groups, WG$problem_drinker)
table(WG$age_groups, WG$problem_drinker)
##
## No Yes
## 10-20 32 10
## 20-30 94 17
## 30-40 78 9
## 40-50 120 8
## 50-60 129 7
## 60-70 110 4
## 70-80 101 2
## 80-90 31 1
## 90-100 0 0
age_problem_drinker_perc <- prop.table(age_problem_drinker_table, 1) * 100
print(age_problem_drinker_perc)
##
## No Yes
## 10-20 76.190476 23.809524
## 20-30 84.684685 15.315315
## 30-40 89.655172 10.344828
## 40-50 93.750000 6.250000
## 50-60 94.852941 5.147059
## 60-70 96.491228 3.508772
## 70-80 98.058252 1.941748
## 80-90 96.875000 3.125000
## 90-100
#Create a bar plot to show the percentage of problem drinkers for each age group, excluding NA
age_problem_drinker_perc_df <- as.data.frame(age_problem_drinker_perc)
age_problem_drinker_perc_clean <- age_problem_drinker_perc_df[!is.na(age_problem_drinker_perc_df$Var2) & !is.na(age_problem_drinker_perc_df$Freq), ]
ggplot(data = age_problem_drinker_perc_clean,
aes(x = Var1, y = Freq, fill = Var2)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Distribution of Problem Drinkers by Age Group",
x = "Age Group",
y = "Percentage") +
scale_fill_manual(values = c("pink", "purple"), labels = c("No", "Yes")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = guide_legend(title = "Problem drinking"))
#Check distribution of age (continuous) and test for differences between groups
hist(WG$age)
shapiro.test(WG$age)
##
## Shapiro-Wilk normality test
##
## data: WG$age
## W = 0.97099, p-value = 2.832e-13
wilcox.test(age ~ problem_drinker, data = WG)
##
## Wilcoxon rank sum test with continuity correction
##
## data: age by problem_drinker
## W = 28597, p-value = 1.129e-07
## alternative hypothesis: true location shift is not equal to 0
We calculated percentages per age category, since we argued it would be most informative to see whether the likelihood of being a problem drinker changes across different age groups, instead of in the total population. The histogram of age (continous) demonstrated a non-normal distribution, which is why we performed a Mann-Whitney U test for assessing statistical differences between age groups. The Mann-Whitney U test demonstrated a P-value of 1.129e-07 which is <0.05. Therefore we can reject the null hypothesis that there are no significant differences in age between problem drinkers and non-problem drinkers.
Perform a multivariable test to determine the effects of age (continuous), gender and physical activity on problematic drinking. Think well about the type of regression. Explain what you found.
model <- glm(problem_drinker ~ age + gender_f + total_pa,
data = WG,
family = binomial)
summary(model)
##
## Call:
## glm(formula = problem_drinker ~ age + gender_f + total_pa, family = binomial,
## data = WG)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.309815 0.403001 0.769 0.4420
## age -0.043690 0.008421 -5.188 2.12e-07 ***
## gender_fFemale -1.615135 0.343013 -4.709 2.49e-06 ***
## total_paInactive -0.331714 0.420783 -0.788 0.4305
## total_paModerately active -0.665885 0.320321 -2.079 0.0376 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 408.79 on 752 degrees of freedom
## Residual deviance: 349.60 on 748 degrees of freedom
## (251 observations deleted due to missingness)
## AIC: 359.6
##
## Number of Fisher Scoring iterations: 6
We performed a logistic regression since problematic drinking, the dependent variable, is binary.The regression analysis shows that age (p=2.12e-07 <0.0001), being female (p=2.49e-06 <0.0001), and moderate physical activity (p=0.0376) have a significant negative association with problematic drinking in our dataset.So: people with higher age, females and people with moderate physical activity have lower odds of problematic drinking compared to others.
Now regress total daily alcohol use on age, gender, and physical activity (multivariable model).
model2 <- lm(average_quant_day ~ age + gender_f + total_pa, data = WG)
summary(model2)
##
## Call:
## lm(formula = average_quant_day ~ age + gender_f + total_pa, data = WG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.393 -3.916 -1.950 2.016 58.100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.84630 0.99145 7.914 9.00e-15 ***
## age 0.01983 0.01627 1.219 0.2232
## gender_fFemale -5.00868 0.60890 -8.226 8.59e-16 ***
## total_paInactive -1.66816 0.88780 -1.879 0.0606 .
## total_paModerately active -0.75911 0.66209 -1.147 0.2519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.233 on 747 degrees of freedom
## (252 observations deleted due to missingness)
## Multiple R-squared: 0.08551, Adjusted R-squared: 0.08062
## F-statistic: 17.46 on 4 and 747 DF, p-value: 1.041e-13
Since the outcome (average daily alcohol use) is a continuous variable, we performed a linear regression.
The linear regression analysis shows that gender (female), inactive PA and moderatie PA are all negatively associated with average daily alcohol use. However, only for female gender this association is significant (p= 8.59e-16). So: gender shows that females consume significantly less alcohol than males. The model explains 8.6% of the variation in daily alcohol consumption.The residuals suggest that the model may not fully capture the variation in alcohol consumption, as the R-squared is low (8.6%), suggesting that the residuals are large compared to the explained variance. A log transformation of the dependent variable could improve the model fit by addressing non-normality in the residuals.
Check the assumptions of the regression analysis in the previous exercise. Hint: plot(model). Explain what you see in statistical terms. Adapt your analysis and repeat 3e if the assumptions are not met.
plot(model2)
WG$log_average_quant_day <- log(1 + WG$average_quant_day)
model3 <- lm(log_average_quant_day ~ age + gender_f + total_pa, data = WG)
summary(model3)
##
## Call:
## lm(formula = log_average_quant_day ~ age + gender_f + total_pa,
## data = WG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86367 -0.66432 0.00601 0.76877 2.64545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.632597 0.112629 14.495 < 2e-16 ***
## age 0.002962 0.001848 1.603 0.10936
## gender_fFemale -0.670894 0.069171 -9.699 < 2e-16 ***
## total_paInactive -0.279912 0.100854 -2.775 0.00565 **
## total_paModerately active -0.045127 0.075213 -0.600 0.54870
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9353 on 747 degrees of freedom
## (252 observations deleted due to missingness)
## Multiple R-squared: 0.1178, Adjusted R-squared: 0.113
## F-statistic: 24.93 on 4 and 747 DF, p-value: < 2.2e-16
We conclude that the assumptions of the linear regression are not met, as observed in the plots. For instance, the Residuals vs Fitted plot shows a pattern, which indicates a non-linear relationship between the dependent variable and the independent variables. In addition, in the Q-Q plot there is a clear deviation from the diagonal line at the end of the plot, which suggests non-normality in the residuals. Therefore, a transformation (such as a log transformation) would be indicated in this case.This can be performed using the command above.