Assignment Introduction into R

install.packages(“haven”) install.packages(“lattice”) install.packages(“dplyr”) install.packages(“ggplot2”)

#### Preparation
library(haven)

## Warning: package 'haven' was built under R version 4.4.3

library(lattice)

## Warning: package 'lattice' was built under R version 4.4.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.1

Case study

This case study is an integrated set of exercises in R that build on the knowledge you gained during the first eight days of the course. The goal of the exercises is to determine whether physical activity is associated with alcohol use. To get to the answer to this question, you will first have to generate new variables from the data available and check the data before analysis.

Exercise 1

1a

Read in the entire dataset and select the data for your country.

# Set the working directory
#setwd("X:/1. R course/end assignment") 
# Load dataset from SPSS file
dataset_case_study <- read_sav("C:/Users/tungl/Downloads/Dataset_casestudy1.sav") 
# View labels of variable 6
attr(dataset_case_study$v6,"labels")

##                France               Belgium       The Netherlands 
##                     1                     2                     3 
##          Germany West                 Italy            Luxembourg 
##                     4                     5                     6 
##               Denmark               Ireland         Great Britain 
##                     7                     8                     9 
##      Northern Ireland                Greece                 Spain 
##                    10                    11                    12 
##              Portugal          Germany East Norway (not included) 
##                    13                    14                    15 
##               Finland                Sweden               Austria 
##                    16                    17                    18 
##     Cyprus (Republic)        Czech Republic               Estonia 
##                    19                    20                    21 
##               Hungary                Latvia             Lithuania 
##                    22                    23                    24 
##                 Malta                Poland              Slovakia 
##                    25                    26                    27 
##              Slovenia              Bulgaria               Romania 
##                    28                    29                    30 
##                Turkey               Croatia       Cyprus (CY-TCC) 
##                    31                    32                    33 
##     Macedonia (FYROM) 
##                    34

# Select West Germany
Eurobar_WG <- subset(dataset_case_study,subset=(v6==4))

We are interested in the variables on alcohol use (alc: [v150-v154]) and on physical activity (pa: [v119-v122]). Furthermore, you will need information on age (v331), gender (v330), and country (v6). Names of the variables in relation to original Eurobarometer coding are given in table 1 of the exercises of Day 5.

1b

Make a smaller dataset for you country with only the variables mentioned above

# Select variables of interest
WG <- subset(Eurobar_WG,select=paste("v",c(6,150:154,119:122,331,330),sep=""))

1c

Rename the variables to the names mentioned in table 1 on Day 5

#Rename variables
names(WG) <- c("country","alc_12m","alcfreq_5dr","alc_30da","alcfreq_30d","alc_am_dd","pa7d_work","pa7d_mov","pa7d_house","pa7d_recr","age","gender")

1d

Describe the data, using R to help you find information on number of observations, number of variables, number of missing values for each variable.

View(WG) # View the data
nrow(WG) # Check number of observations

## [1] 1004

summary(WG) # Check summary

##     country     alc_12m       alcfreq_5dr      alc_30da     alcfreq_30d   
##  Min.   :4   Min.   :1.000   Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:4   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:3.000  
##  Median :4   Median :1.000   Median :3.00   Median :1.00   Median :4.000  
##  Mean   :4   Mean   :1.221   Mean   :3.25   Mean   :1.13   Mean   :3.708  
##  3rd Qu.:4   3rd Qu.:1.000   3rd Qu.:5.00   3rd Qu.:1.00   3rd Qu.:5.000  
##  Max.   :4   Max.   :2.000   Max.   :5.00   Max.   :2.00   Max.   :6.000  
##              NA's   :3       NA's   :228    NA's   :226    NA's   :334    
##    alc_am_dd       pa7d_work        pa7d_mov       pa7d_house   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :4.000   Median :2.000   Median :2.000  
##  Mean   :2.257   Mean   :2.946   Mean   :1.952   Mean   :1.889  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :7.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##  NA's   :332     NA's   :49                                     
##    pa7d_recr          age            gender     
##  Min.   :1.000   Min.   :15.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:35.00   1st Qu.:1.000  
##  Median :2.000   Median :51.00   Median :2.000  
##  Mean   :2.344   Mean   :50.49   Mean   :1.549  
##  3rd Qu.:3.000   3rd Qu.:67.00   3rd Qu.:2.000  
##  Max.   :4.000   Max.   :94.00   Max.   :2.000  
##  NA's   :3

missing_counts <- sapply(WG, function(x) sum(is.na(x))) # Check missing values
print(missing_counts)

##     country     alc_12m alcfreq_5dr    alc_30da alcfreq_30d   alc_am_dd 
##           0           3         228         226         334         332 
##   pa7d_work    pa7d_mov  pa7d_house   pa7d_recr         age      gender 
##          49           0           0           3           0           0

Summarize the description here:

The number of observations is 1004 and there are 12 variables in the dataset. The missing counts table as obtained by the command above shows the number of missing values for each variable.

1e

Determine which of variables are actually categorical. Hint: determine the number of unique values per variable.

sapply(WG, function(x) length(unique(x))) # Function to determine the number of unique values per variable

##     country     alc_12m alcfreq_5dr    alc_30da alcfreq_30d   alc_am_dd 
##           1           3           6           3           7           8 
##   pa7d_work    pa7d_mov  pa7d_house   pa7d_recr         age      gender 
##           5           4           4           5          77           2

str(WG) # Check variable type

## tibble [1,004 × 12] (S3: tbl_df/tbl/data.frame)
##  $ country    : dbl+lbl [1:1004] 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
##    ..@ label        : chr "NATION - ALL SAMPLES"
##    ..@ format.spss  : chr "F2.0"
##    ..@ display_width: int 4
##    ..@ labels       : Named num [1:34] 1 2 3 4 5 6 7 8 9 10 ...
##    .. ..- attr(*, "names")= chr [1:34] "France" "Belgium" "The Netherlands" "Germany West" ...
##  $ alc_12m    : dbl+lbl [1:1004] 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1,...
##    ..@ label        : chr "QC1A DRINKING ALCOHOL - PAST 12 MONTHS"
##    ..@ format.spss  : chr "F1.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:4] 1 2 3 9
##    .. ..- attr(*, "names")= chr [1:4] "Yes" "No" "DK/Refusal" "Inap. (not 1 to 30 in V6)"
##  $ alcfreq_5dr: dbl+lbl [1:1004] NA,  5, NA,  5,  5,  4,  5,  5,  4,  4,  5,  5, NA, N...
##    ..@ label        : chr "QC1B DRINKING ALCOHOL - 5 OR MORE DRINKS"
##    ..@ format.spss  : chr "F2.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:8] 1 2 3 4 5 6 9 99
##    .. ..- attr(*, "names")= chr [1:8] "Several times a week" "Once a week" "Once a month" "Less than once a month" ...
##  $ alc_30da   : dbl+lbl [1:1004] NA,  1, NA,  1,  2,  1,  1,  1,  2,  2,  2,  1, NA, N...
##    ..@ label        : chr "QC1C DRINKING ALCOHOL - LAST 30 DAYS"
##    ..@ format.spss  : chr "F2.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:5] 1 2 3 9 99
##    .. ..- attr(*, "names")= chr [1:5] "Yes" "No" "DK/Refusal" "Inap. (not 1 in V150)" ...
##  $ alcfreq_30d: dbl+lbl [1:1004] NA,  3, NA,  6, NA,  5,  2,  3, NA, NA, NA,  5, NA, N...
##    ..@ label        : chr "QC2 DRINKING ALCOHOL - FREQ LAST 30 DAYS"
##    ..@ format.spss  : chr "F2.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:9] 1 2 3 4 5 6 7 9 99
##    .. ..- attr(*, "names")= chr [1:9] "Daily" "4 - 5 times a week" "2 - 3 times a week" "Once a week" ...
##  $ alc_am_dd  : dbl+lbl [1:1004] NA,  2, NA,  1, NA,  2,  2,  1, NA, NA, NA,  2, NA, N...
##    ..@ label        : chr "QC3 DRINKING ALCOHOL - N OF DRINKS USUALLY"
##    ..@ format.spss  : chr "F2.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:10] 1 2 3 4 5 6 7 8 9 99
##    .. ..- attr(*, "names")= chr [1:10] "Less than 1 drink" "1 - 2 drinks" "3 - 4 drinks" "5 - 6 drinks" ...
##  $ pa7d_work  : dbl+lbl [1:1004]  4,  3,  4,  4, NA,  3,  4, NA,  2, NA, NA,  1, NA,  ...
##    ..@ label        : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: AT WORK"
##    ..@ format.spss  : chr "F1.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:6] 1 2 3 4 5 9
##    .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
##  $ pa7d_mov   : dbl+lbl [1:1004] 3, 3, 1, 2, 2, 2, 1, 1, 2, 1, 4, 2, 2, 2, 1, 2, 2, 4,...
##    ..@ label        : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: MOVING"
##    ..@ format.spss  : chr "F1.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:6] 1 2 3 4 5 9
##    .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
##  $ pa7d_house : dbl+lbl [1:1004] 2, 1, 3, 1, 1, 2, 4, 1, 2, 1, 4, 2, 2, 2, 2, 2, 1, 3,...
##    ..@ label        : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: AT HOME"
##    ..@ format.spss  : chr "F1.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:6] 1 2 3 4 5 9
##    .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
##  $ pa7d_recr  : dbl+lbl [1:1004] 4, 1, 2, 1, 4, 4, 2, 1, 2, 1, 4, 4, 4, 3, 2, 4, 1, 4,...
##    ..@ label        : chr "QA4 PHYS ACTIVITY LAST 7 DAYS: RECREATION"
##    ..@ format.spss  : chr "F1.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:6] 1 2 3 4 5 9
##    .. ..- attr(*, "names")= chr [1:6] "A lot" "Some" "Little" "None" ...
##  $ age        : num [1:1004] 79 50 56 35 70 63 73 66 29 55 ...
##   ..- attr(*, "label")= chr "D11 AGE EXACT"
##   ..- attr(*, "format.spss")= chr "F2.0"
##   ..- attr(*, "display_width")= int 6
##  $ gender     : dbl+lbl [1:1004] 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2,...
##    ..@ label        : chr "D10 GENDER"
##    ..@ format.spss  : chr "F1.0"
##    ..@ display_width: int 6
##    ..@ labels       : Named num [1:2] 1 2
##    .. ..- attr(*, "names")= chr [1:2] "Male" "Female"

summary(WG) # Check summary

##     country     alc_12m       alcfreq_5dr      alc_30da     alcfreq_30d   
##  Min.   :4   Min.   :1.000   Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:4   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:3.000  
##  Median :4   Median :1.000   Median :3.00   Median :1.00   Median :4.000  
##  Mean   :4   Mean   :1.221   Mean   :3.25   Mean   :1.13   Mean   :3.708  
##  3rd Qu.:4   3rd Qu.:1.000   3rd Qu.:5.00   3rd Qu.:1.00   3rd Qu.:5.000  
##  Max.   :4   Max.   :2.000   Max.   :5.00   Max.   :2.00   Max.   :6.000  
##              NA's   :3       NA's   :228    NA's   :226    NA's   :334    
##    alc_am_dd       pa7d_work        pa7d_mov       pa7d_house   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :4.000   Median :2.000   Median :2.000  
##  Mean   :2.257   Mean   :2.946   Mean   :1.952   Mean   :1.889  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :7.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##  NA's   :332     NA's   :49                                     
##    pa7d_recr          age            gender     
##  Min.   :1.000   Min.   :15.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:35.00   1st Qu.:1.000  
##  Median :2.000   Median :51.00   Median :2.000  
##  Mean   :2.344   Mean   :50.49   Mean   :1.549  
##  3rd Qu.:3.000   3rd Qu.:67.00   3rd Qu.:2.000  
##  Max.   :4.000   Max.   :94.00   Max.   :2.000  
##  NA's   :3

Enter your answer here below:

Every variable except age is categorical, but in the dataset they are regarded as continuous variables.

1f

Create a function that can convert the numerical variables that are categorical ones into factors and drop the levels that are non-informative (determine from the labels). Use the solution you found on day 5, where you already converted numerical variables into categorical ones. Function input arguments should be the variable and the levels to be dropped / set to missing. Inside the body of the function, use the variable to determine its levels and labels. Output should be the factor variable with the right labels and without non-informative levels.

convert_to_factors <- function(WG) {
  # Convert the numerical variables that are categorical into factors 
  WG$country_f <- as_factor(WG$country)
  WG$alc_12m_f <- as_factor(WG$alc_12m)
  WG$alcfreq_5dr_f <- as_factor(WG$alcfreq_5dr)
  WG$alc_30da_f <- as_factor(WG$alc_30da)
  WG$alcfreq_30d_f <- as_factor(WG$alcfreq_30d)
  WG$alc_am_dd_f <- as_factor(WG$alc_am_dd)
  WG$pa7d_work_f <- as_factor(WG$pa7d_work)
  WG$pa7d_mov_f <- as_factor(WG$pa7d_mov)
  WG$pa7d_house_f <- as_factor(WG$pa7d_house)
  WG$pa7d_recr_f <- as_factor(WG$pa7d_recr)
  WG$gender_f <- as_factor(WG$gender)

  # Specify the levels that are non-informative
  na_values <- list(
    alc_12m_f = c("DK/Refusal", "Inap. (not 1 to 30 in V6)"),
    alcfreq_5dr_f = c("DK/Refusal", "Inap. (not 1 in V150)", "Inap. (not 1 to 30 in V6)"),
    alc_30da_f = c("DK/Refusal", "Inap. (not 1 in V150)", "Inap. (not 1 to 30 in V6)"),
    alcfreq_30d_f = c("Don't remember/Refusal (SPONT.)", "Inap. (not 1 in V152)", "Inap. (not 1 to 30 in V6)"),
    alc_am_dd_f = c("It depends (SPONT.)", "DK/ Refusal", "Inap. (not 1 in V152)", "Inap. (not 1 to 30 in V6)"),
    pa7d_work_f = c("DK", "Inap. (not 1 to 30 in V6)"),
    pa7d_mov_f = c("DK", "Inap. (not 1 to 30 in V6)"),
    pa7d_house_f = c("DK", "Inap. (not 1 to 30 in V6)"),
    pa7d_recr_f = c("DK", "Inap. (not 1 to 30 in V6)")
  )

  # Convert non-informative levels to NA
  for (var in names(na_values)) {
    if (var %in% names(WG)) {
      current_levels <- levels(WG[[var]])
          na_indices <- which(current_levels %in% na_values[[var]])
            current_levels[na_indices] <- NA
              levels(WG[[var]]) <- current_levels
    }
  }

  return(WG)
}

1g

Use the function created in 1f to convert the categorical variables determined in exercise 1e to factors and check if the function worked correctly. NB. Use different names for the new variables to avoid that the original one is lost! In the remainder we will however refer to the original names.

# Check if the conversion worked
WG <- convert_to_factors(WG) 
summary(WG)

##     country     alc_12m       alcfreq_5dr      alc_30da     alcfreq_30d   
##  Min.   :4   Min.   :1.000   Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:4   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:3.000  
##  Median :4   Median :1.000   Median :3.00   Median :1.00   Median :4.000  
##  Mean   :4   Mean   :1.221   Mean   :3.25   Mean   :1.13   Mean   :3.708  
##  3rd Qu.:4   3rd Qu.:1.000   3rd Qu.:5.00   3rd Qu.:1.00   3rd Qu.:5.000  
##  Max.   :4   Max.   :2.000   Max.   :5.00   Max.   :2.00   Max.   :6.000  
##              NA's   :3       NA's   :228    NA's   :226    NA's   :334    
##    alc_am_dd       pa7d_work        pa7d_mov       pa7d_house   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :4.000   Median :2.000   Median :2.000  
##  Mean   :2.257   Mean   :2.946   Mean   :1.952   Mean   :1.889  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :7.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##  NA's   :332     NA's   :49                                     
##    pa7d_recr          age            gender                country_f   
##  Min.   :1.000   Min.   :15.00   Min.   :1.000   Germany West   :1004  
##  1st Qu.:1.000   1st Qu.:35.00   1st Qu.:1.000   France         :   0  
##  Median :2.000   Median :51.00   Median :2.000   Belgium        :   0  
##  Mean   :2.344   Mean   :50.49   Mean   :1.549   The Netherlands:   0  
##  3rd Qu.:3.000   3rd Qu.:67.00   3rd Qu.:2.000   Italy          :   0  
##  Max.   :4.000   Max.   :94.00   Max.   :2.000   Luxembourg     :   0  
##  NA's   :3                                       (Other)        :   0  
##  alc_12m_f                 alcfreq_5dr_f alc_30da_f             alcfreq_30d_f
##  Yes :780   Several times a week  :113   Yes :677   Daily              : 62  
##  No  :221   Once a week           :168   No  :101   4 - 5 times a week : 67  
##  NA's:  3   Once a month          :114   NA's:226   2 - 3 times a week :145  
##             Less than once a month:174              Once a week        :206  
##             Never                 :207              2 - 3 times a month:111  
##             NA's                  :228              Once               : 79  
##                                                     NA's               :334  
##             alc_am_dd_f  pa7d_work_f   pa7d_mov_f  pa7d_house_f pa7d_recr_f 
##  Less than 1 drink:129   A lot :205   A lot :364   A lot :402   A lot :318  
##  1 - 2 drinks     :383   Some  :137   Some  :390   Some  :381   Some  :269  
##  3 - 4 drinks     : 96   Little:118   Little:184   Little:151   Little:166  
##  5 - 6 drinks     : 28   None  :495   None  : 66   None  : 70   None  :248  
##  7 - 9 drinks     : 10   NA's  : 49                             NA's  :  3  
##  10 drinks or more: 10                                                      
##  NA's             :348                                                      
##    gender_f  
##  Male  :453  
##  Female:551  
##              
##              
##              
##              
##

The function worked, as it converted all variables except age to factor and removed inappropriate labels.

1h

Calculate descriptive statistics of all variables. Think well on what kind of statistic that would be depending on the type of variable.

hist(WG$age) #Check distribution of age (continuous)

summary(WG$age) #Descriptive statistics of age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   35.00   51.00   50.49   67.00   94.00

median(WG$age) #Calculate median for age because of non-normal distribution

## [1] 51

IQR(WG$age) #Calculate IQR for age because of non-normal distribution

## [1] 32

We calculated the median and IQR for age since the histogram showed a non-normal distribution.

#Calculate descriptive statistics for all categorical variables 
calculate_proportions_and_frequencies_table <- function(dataset) {
  variables <- c("alc_12m_f", "alcfreq_5dr_f", "alc_30da_f", "alcfreq_30d_f", 
                 "alc_am_dd_f", "pa7d_work_f", "pa7d_mov_f", "pa7d_house_f", 
                 "pa7d_recr_f", "gender_f")
  result_list <- list()
  for (var in variables) {
    freq <- table(dataset[[var]])
    prop <- prop.table(freq)
    
    df <- data.frame(
      Category = names(freq),
      Frequency = as.vector(freq),
      Proportion = as.vector(prop)
    )
    
    result_list[[var]] <- df
  }
  
  for (var in variables) {
    cat("\nTable for", var, ":\n")
    print(result_list[[var]])
    cat("\n------------------------\n")
  }
}

calculate_proportions_and_frequencies_table(WG)

## 
## Table for alc_12m_f :
##   Category Frequency Proportion
## 1      Yes       780  0.7792208
## 2       No       221  0.2207792
## 
## ------------------------
## 
## Table for alcfreq_5dr_f :
##                 Category Frequency Proportion
## 1   Several times a week       113  0.1456186
## 2            Once a week       168  0.2164948
## 3           Once a month       114  0.1469072
## 4 Less than once a month       174  0.2242268
## 5                  Never       207  0.2667526
## 
## ------------------------
## 
## Table for alc_30da_f :
##   Category Frequency Proportion
## 1      Yes       677  0.8701799
## 2       No       101  0.1298201
## 
## ------------------------
## 
## Table for alcfreq_30d_f :
##              Category Frequency Proportion
## 1               Daily        62 0.09253731
## 2  4 - 5 times a week        67 0.10000000
## 3  2 - 3 times a week       145 0.21641791
## 4         Once a week       206 0.30746269
## 5 2 - 3 times a month       111 0.16567164
## 6                Once        79 0.11791045
## 
## ------------------------
## 
## Table for alc_am_dd_f :
##            Category Frequency Proportion
## 1 Less than 1 drink       129 0.19664634
## 2      1 - 2 drinks       383 0.58384146
## 3      3 - 4 drinks        96 0.14634146
## 4      5 - 6 drinks        28 0.04268293
## 5      7 - 9 drinks        10 0.01524390
## 6 10 drinks or more        10 0.01524390
## 
## ------------------------
## 
## Table for pa7d_work_f :
##   Category Frequency Proportion
## 1    A lot       205  0.2146597
## 2     Some       137  0.1434555
## 3   Little       118  0.1235602
## 4     None       495  0.5183246
## 
## ------------------------
## 
## Table for pa7d_mov_f :
##   Category Frequency Proportion
## 1    A lot       364 0.36254980
## 2     Some       390 0.38844622
## 3   Little       184 0.18326693
## 4     None        66 0.06573705
## 
## ------------------------
## 
## Table for pa7d_house_f :
##   Category Frequency Proportion
## 1    A lot       402 0.40039841
## 2     Some       381 0.37948207
## 3   Little       151 0.15039841
## 4     None        70 0.06972112
## 
## ------------------------
## 
## Table for pa7d_recr_f :
##   Category Frequency Proportion
## 1    A lot       318  0.3176823
## 2     Some       269  0.2687313
## 3   Little       166  0.1658342
## 4     None       248  0.2477522
## 
## ------------------------
## 
## Table for gender_f :
##   Category Frequency Proportion
## 1     Male       453  0.4511952
## 2   Female       551  0.5488048
## 
## ------------------------

1i

Make a bar chart of pa7d_work to visualize this variable. Use both hist() and barplot() as a command to accomplish this and describe which plot is preferred.
N.B. using barplot(pa7d_work) won’t properly display the variable. Check what input the function needs.

#Making histogram of pa7d_work 
ggplot(WG, aes(x = pa7d_work)) +
  geom_histogram(binwidth = 1, fill = "lightblue", color = "black", na.rm = TRUE) + 
  labs(title = "Histogram of Physical activity at work", x = "Physical activity at work (pa7d_work)", y = "Frequency") +
  theme_minimal()

#Making barplot of pa7d_work 
ggplot(WG, aes(x = pa7d_work)) +
  geom_bar(fill = "lightblue", color = "black", na.rm = TRUE) +  
  labs(title = "Barplot of Physical activity at work during the last 7 days", 
       x = "Physical activity at work (pa7d_work)", 
       y = "Frequency") +
  theme_minimal()

Which one do you prefer?

We prefer a bar plot, since the variable pa7d_work is a categorical variable. For categorical variables a bar plot is the first choice to represent the frequencies for each group separately. A histogram is preferred for continuous variables.

Exercise 2

Since we are interested in problematic alcohol use, the data on alcohol use will have to be processed to produce a new binary variable “problematic alcohol use”. Alcohol use may be defined as problematic when it is chronically too much (more than 40 g on average per day for male or 20 g per day for female persons) or when too much is consumed on a single occasion, i.e. more than 5 glasses (binge drinking). Questions on both issues are in the dataset. Define someone as a problematic drinker if he/she either chronically drinks too much and/or does binge drinking. You will now produce a new variable “problematic drinker” in a couple of steps. Do this for your own small dataset with only the records from your chosen country.

PROBLEMATIC ALCOHOL USE

2a

Define a new numeric variable “number of drinking days/month” (e.g. called n_drinksmonth) and use the information from alc_30da and freq_30d to fill this variable with numeric values on the number of days a persons was drinking alcohol per month. You may need to make some pragmatic assumptions here. E.g. concerning the number of days in category “Daily” should be converted to 30.

State your assumptions in a comment. There are many approaches exist to do this conversion, e.g. like in exercise 24 of Day 1, using the ifelse() function, or using the revalue() or mapvalues() functions from plyr package.

NB. There are multiple solutions to do this conversion. If you want to use conditional statements, remember that if() does not work on a vector, while ifelse() does. Alternatively use mapply() for loop or another solution.

#Define numeric variable n_drinksmonth
WG$n_drinksmonth <- ifelse(
  WG$alc_30da_f == "Yes",
  ifelse(WG$alcfreq_30d_f == "Daily", 30,  
    ifelse(WG$alcfreq_30d_f == "4 - 5 times a week", 18,  
      ifelse(WG$alcfreq_30d_f == "2 - 3 times a week", 10,
        ifelse(WG$alcfreq_30d_f == "Once a week", 4,                                
          ifelse(WG$alcfreq_30d_f == "2 - 3 times a month", 2.5, 
            ifelse(WG$alcfreq_30d_f == "Once", 1, 0)
          )
        )
      )
    )
  ),
  0
)

2b

Define a new variable “quantity of alcohol (in grams) consumed on a drinking day” (e.g. called quantday) and use the information from alc_30da and alc_am_dd to do so. Use the following assumption: 1 drink/consumption contains 12 g alcohol.

#Convert variable alc_am_dd to a numeric variable 
WG$alc_am_dd_numeric <- ifelse(WG$alc_am_dd_f == "Less than 1 drink", 0.5,
                        ifelse(WG$alc_am_dd_f == "1 - 2 drinks", 1.5,
                        ifelse(WG$alc_am_dd_f == "3 - 4 drinks", 3.5,
                        ifelse(WG$alc_am_dd_f == "5 - 6 drinks", 5.5,
                        ifelse(WG$alc_am_dd_f == "7 - 9 drinks", 8,
                        ifelse(WG$alc_am_dd_f == "10 drinks or more", 10, NA))))))

#Compute variable quantday (quantity of alcohol in g consumed on a drinking day)
WG$quantday <- ifelse(
  WG$alc_30da_f == "Yes",  # Only calculate when alc_30_da_f is "Yes" (drinking alcohol in the last 30 days)
  WG$alc_am_dd_numeric * 12,  # Multiply by 12 (grams of alcohol per drink) 
  0  # If alc_30_da_f is "No", than quantday is 0
)

#Check the first values of the new variable quantday
head(WG$quantday)

## [1] NA 18 NA  6  0 18

In this case we state that if alc_30da == 2 (no alcohol is consumed in the last 30 days), the return will be 0 for quantday, and if alc_30da == 1 (alcohol consumed in the last 30 days), the return will be the number of drinks multiplied by 12 g to calculate the total grams of alcohol consumed.Hereby we assume that the number of drinks on a drinking day is the average of the range given.

2c

Now calculate a third new variable “average quantity of alcohol (in grams) per day over the last month”, multiplying the two variables above and dividing the result by 30.

# Calculate the average quantity of alcohol (in grams) per day over the last month
WG$average_quant_day <- (WG$quantday * WG$n_drinksmonth) / 30

#Check the first values of the new variable 
head(WG$average_quant_day)

## [1]  NA 6.0  NA 0.2 0.0 1.5

2d

Make a histogram or barplot of daily intake created in exercise 2c.

# Checking the structure of the new variable average_quant_day 
str(WG$average_quant_day)

##  num [1:1004] NA 6 NA 0.2 0 1.5 10.8 2 0 0 ...

summary(WG$average_quant_day)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.800   2.400   5.702   6.000  66.000     252

#Remove NA values of the new variable by creating a clean dataset
WG_clean <- WG %>%  filter(!is.na(average_quant_day))

#Perform a barplot 
ggplot(WG_clean, aes(x = as.factor(average_quant_day))) +
  geom_bar(fill = "lightblue", color = "black", width = 0.8) +  
  labs(title = "Barplot of daily alcohol intake (average_quant_day)", 
       x = "Average quantity of alcohol per day (grams)", 
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We chose to make a barplot instead of a histogram since the variable average_quant_day is discrete and therefore a barplot is preferred.

Do the results make sense?

The barplot shows that there is a large spread in the distribution of average quantity of alcohol a day. These results make sense, since the number of drinks during drinking days varies a lot, from less than 1 drinks a day to 10 drinks or more. Since this was multiplied by 12 (for grams), the total quantity shows a distribution between 0 and 66 grams of alcohol.

Is this a continious variable? Explain why (not)

No, the variable is discrete instead of continuous, since it can not take all possible values. This is because the categories of number of drinks during drinking days was limited to a maximum of 10 (10 drinks or more). Therefore the total amount of alcohol in g can be max. 120 g. Another argument is the specified range in number of drinks (e.g. less than 1 or 7-9), for which we had to make assumptions by using the average value.

2e

Do the same as in exercise 2d, but now for men and women separately, in a single plot.
Hint: think of the trellis plots from the lattice package.

# Perform a bar chart
barchart(average_quant_day ~ factor(gender, levels = c(1, 2), labels = c("Male", "Female")), 
         data = WG_clean, 
         main = "Barplot of Average Alcohol Intake per Day by Gender",
         xlab = "Gender",
         ylab = "Average Quantity of Alcohol per Day (grams)",
         col = "lightblue",
         auto.key = TRUE,
         scales = list(y = list(relation = "free")), 
         layout = c(1, 1))

2f

Compute descriptive statistics for the daily (average) alcohol use in grams.

calculate_average_quant_day_table <- function(dataset) {
  freq <- table(WG_clean$average_quant_day)
  prop <- prop.table(freq)
  
  df <- data.frame(
    Category = names(freq),
    Frequency = as.vector(freq),
    Proportion = as.vector(prop)
  )
  
  cat("\nTable for average_quant_day:\n")
  print(df)
  cat("\n------------------------\n")
}

calculate_average_quant_day_table(WG)

## 
## Table for average_quant_day:
##    Category Frequency  Proportion
## 1         0       101 0.134308511
## 2       0.2        32 0.042553191
## 3       0.5        22 0.029255319
## 4       0.6        32 0.042553191
## 5       0.8        33 0.043882979
## 6       1.4         8 0.010638298
## 7       1.5        64 0.085106383
## 8         2        22 0.029255319
## 9       2.2         5 0.006648936
## 10      2.4       117 0.155585106
## 11      3.2         1 0.001329787
## 12      3.5        15 0.019946809
## 13      3.6        10 0.013297872
## 14      5.5         5 0.006648936
## 15      5.6        30 0.039893617
## 16        6        92 0.122340426
## 17        8         2 0.002659574
## 18      8.8        10 0.013297872
## 19       10         1 0.001329787
## 20     10.8        46 0.061170213
## 21     12.8         4 0.005319149
## 22       14        25 0.033244681
## 23       16         6 0.007978723
## 24       18        38 0.050531915
## 25       22         3 0.003989362
## 26     25.2         9 0.011968085
## 27       32         3 0.003989362
## 28       40         3 0.003989362
## 29       42         9 0.011968085
## 30       66         4 0.005319149
## 
## ------------------------

2g

Create a variable “heavy drinker” for men and women, using the cut-off values mentioned above (40 and 20 g daily, respectively).

WG$heavy_drinker <- ifelse(
  (WG$gender == 1 & WG$average_quant_day > 40) |  # For men the cut-off is set at above 40g a day
  (WG$gender == 2 & WG$average_quant_day > 20),   # For women the cut-off is set at above 20g a day
  "Yes", # If the condition is met, the person is categorized as a heavy drinker
  "No"  # If the condition is not met, the person is not categorized as a heavy drinker
)

Now consider binge drinking in addition to drinking too much

2h

Define a new variable “binge” that gets value 1 if someone is a binge drinker (i.e. >=5 glasses or 60 gram alcohol as defined in 2b on a single occasion) and value 0 else. This has the same definition for men and women.

#Compute new variable "binge"
WG$binge <- ifelse(
  WG$alc_am_dd_f %in% c("5 - 6 drinks", "7 - 9 drinks", "10 drinks or more") |
  WG$quantday >= 60, 
  1,  
  0  
)

#Add labels
WG$binge <- factor(WG$binge, levels = c(0, 1), labels = c("No", "Yes"))

2i

Use these two variables (heavy drinker and binge) to fill a new binary variable “problem drinker” using logistic operators (|, &) in the appropriate way.

#Compute new variable problem_drinker 
WG$problem_drinker <- ifelse( # Create new variable on problem drinker
  WG$heavy_drinker == "Yes" | WG$binge == "Yes", # Specify values when someone is a problem drinker
  1, 
  0 
)

#Add labels
WG$problem_drinker <- factor(WG$problem_drinker, levels = c(0, 1), labels = c("No", "Yes"))

2j

Check the contents of this new variable

#Check contents of variable problem_drinker  
summary(WG$problem_drinker)

##   No  Yes NA's 
##  695   58  251

table(WG$problem_drinker, useNA = "always")

## 
##   No  Yes <NA> 
##  695   58  251

Does it make sense to you?

Yes, because there are three groups of people for whom problem_drinker can be “Yes”: people who are both binge drinker and heavy drinker (n= 4), people who are only a binge drinker (n=44) and people who are only a heavy drinker (n=10). The sum equals the total number of problem drinkers (n=58).

PHYSICAL ACTIVITY

2k

Define a new variable “total physical activity”, adding relevant information from all “pa variables” on physical activity in various settings. Clearly explain your definitions and assumptions. Categorize people as “inactive” or “moderately active” or “active”.

#Define variable total physical activity (total_pa)
WG$total_pa <- apply(WG[, c("pa7d_work_f", "pa7d_mov_f", "pa7d_house_f", "pa7d_recr_f")], 1, function(x) {
  active_count <- sum(x %in% c("A lot"))
  inactive_count <- sum(x %in% c("None"))
  
  if (active_count >= 2) {
    return("Active")
  }
  
  if (inactive_count >= 2) {
    return("Inactive")
  }
  return("Moderately active")
})

#Checking content of new variable total_pa 
table(WG$total_pa)

## 
##            Active          Inactive Moderately active 
##               403               187               414

While creating the new variable ‘total physical activity’ we did the following assumptions: ‘Inactive’ people have responded ‘None’ on 2 ore more PA variable, ‘Active’ people have responded ‘A lot’ on 2 or more PA variables. People who did not fulfil any of the previous criteria were categorized as ‘moderately active’.

Exercise 3 Analyzing the data

Now these new variables will be used in a regression analysis to analyze the effects of age, gender, and physical activity on daily alcohol use and the assumptions of this analysis will be tested. Please remember that we are using categorical variables here and therefore you have check how your data have been stored in R, as numeric values or factors, e.g. by using summary() or table(). In case of the first, the variable will be considered numeric. Sometimes it matters how your data are stored and will it be worth trying the alternative if you run into trouble or your results don’t make sense.

3a

Create a table of gender versus problem drinkers (yes/no). Apply a test to check for differences in percentages of problematic drinking between males and females.

#Creating table of gender vs problem drinkers excluding NA values  
summary(WG$gender_f)

##   Male Female 
##    453    551

summary(na.omit(WG$problem_drinker))

##  No Yes 
## 695  58

#Create a contingency table for gender and problem drinking 
gender_problem_drinker_table <- table(WG$gender_f, (WG$problem_drinker))
table(WG$gender_f, (WG$problem_drinker))

##         
##           No Yes
##   Male   325  46
##   Female 370  12

#Chi-square test for testing differences in problem drinking between males and females
chisq.test(gender_problem_drinker_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender_problem_drinker_table
## X-squared = 21.405, df = 1, p-value = 3.717e-06

How do you interpret this analysis?

The summary() function shows that when using ‘gender_f’, both the ‘problem_drinker’ and ‘gender_f’ are stored as factors, while ‘gender’ is a numeric variable. So, we use ‘gender_f’. Since both problem drinking and gender are categorical variables, a chi-square test was conducted to assess whether the distribution of problem drinking differs across genders. The test yielded a p-value of 3.717e-06, which is below the significance threshold of 0.05. Therefore, the null hypothesis of independence between gender and problem drinking is rejected, indicating that there is a statistically significant association between the two variables in this dataset.

3b

Do the same for the variable on physical activity created in 2k versus problem drinkers, that is present and test for differences in problematic drinking in the three different categories of physical activity.

#Creating table of total_pa vs problem drinkers 
table(WG$total_pa)

## 
##            Active          Inactive Moderately active 
##               403               187               414

#Create a contingency table for total_pa and problem drinking 
total_pa_problem_drinker_table <- table(WG$total_pa, WG$problem_drinker)
print(total_pa_problem_drinker_table)

##                    
##                      No Yes
##   Active            290  29
##   Inactive          115   9
##   Moderately active 290  20

#Chi-square test for testing differences in problematic drinking in the 3 PA-categories
chisq.test(total_pa_problem_drinker_table)

## 
##  Pearson's Chi-squared test
## 
## data:  total_pa_problem_drinker_table
## X-squared = 1.5817, df = 2, p-value = 0.4535

Since both total physical activity and problem drinking are categorical variables, a chi-square test was conducted to examine whether the distribution of problem drinking differs across physical activity levels. The test yielded a p-value of 0.4535, which is above the significance threshold of 0.05. Therefore, the null hypothesis of independence between physical activity and problem drinking cannot be rejected, indicating that there is no statistically significant association between the two variables in this dataset.

3c

Make a graph of percentages of problem drinkers versus age (in 10-year age categories). Here, think about what would be most informative: percentage of total or percentages per age category. Test whether the distribution of age (continuous) is different between problem drinkers and non-problem drinkers. Check the summary statistics between the two groups.

#Checking the summary and range of the variable age 
summary(WG$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   35.00   51.00   50.49   67.00   94.00

#Convert age into 10-year categories (10-20, 20-30, ..., 90-100). Say right = FALSE to include 10 but exclude 20 in e.g. 10-20 as age group.
WG$age_groups <-cut(WG$age, breaks = seq(10, 100, by = 10), labels = paste(seq(10, 90, by = 10), seq(20, 100, by = 10), sep = "-"), right = FALSE)

#Check the frequency distribution of age groups
table (WG$age_groups)

## 
##  10-20  20-30  30-40  40-50  50-60  60-70  70-80  80-90 90-100 
##     56    132    117    167    174    159    144     52      3

#Print table of percentages of problem drinking for age groups 
age_problem_drinker_table <- table(WG$age_groups, WG$problem_drinker)
table(WG$age_groups, WG$problem_drinker)

##         
##           No Yes
##   10-20   32  10
##   20-30   94  17
##   30-40   78   9
##   40-50  120   8
##   50-60  129   7
##   60-70  110   4
##   70-80  101   2
##   80-90   31   1
##   90-100   0   0

age_problem_drinker_perc <- prop.table(age_problem_drinker_table, 1) * 100
print(age_problem_drinker_perc)

##         
##                 No       Yes
##   10-20  76.190476 23.809524
##   20-30  84.684685 15.315315
##   30-40  89.655172 10.344828
##   40-50  93.750000  6.250000
##   50-60  94.852941  5.147059
##   60-70  96.491228  3.508772
##   70-80  98.058252  1.941748
##   80-90  96.875000  3.125000
##   90-100

#Create a bar plot to show the percentage of problem drinkers for each age group, excluding NA 
age_problem_drinker_perc_df <- as.data.frame(age_problem_drinker_perc)

age_problem_drinker_perc_clean <- age_problem_drinker_perc_df[!is.na(age_problem_drinker_perc_df$Var2) & !is.na(age_problem_drinker_perc_df$Freq), ]

ggplot(data = age_problem_drinker_perc_clean, 
       aes(x = Var1, y = Freq, fill = Var2)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Distribution of Problem Drinkers by Age Group", 
       x = "Age Group", 
       y = "Percentage") +
  scale_fill_manual(values = c("pink", "purple"), labels = c("No", "Yes")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = guide_legend(title = "Problem drinking"))

#Check distribution of age (continuous) and test for differences between groups 
hist(WG$age)

shapiro.test(WG$age)

## 
##  Shapiro-Wilk normality test
## 
## data:  WG$age
## W = 0.97099, p-value = 2.832e-13

wilcox.test(age ~ problem_drinker, data = WG)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age by problem_drinker
## W = 28597, p-value = 1.129e-07
## alternative hypothesis: true location shift is not equal to 0

We calculated percentages per age category, since we argued it would be most informative to see whether the likelihood of being a problem drinker changes across different age groups, instead of in the total population. The histogram of age (continous) demonstrated a non-normal distribution, which is why we performed a Mann-Whitney U test for assessing statistical differences between age groups. The Mann-Whitney U test demonstrated a P-value of 1.129e-07 which is <0.05. Therefore we can reject the null hypothesis that there are no significant differences in age between problem drinkers and non-problem drinkers.

3d

Perform a multivariable test to determine the effects of age (continuous), gender and physical activity on problematic drinking. Think well about the type of regression. Explain what you found.

model <- glm(problem_drinker ~ age + gender_f + total_pa, 
             data = WG, 
             family = binomial)
             
summary(model)

## 
## Call:
## glm(formula = problem_drinker ~ age + gender_f + total_pa, family = binomial, 
##     data = WG)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.309815   0.403001   0.769   0.4420    
## age                       -0.043690   0.008421  -5.188 2.12e-07 ***
## gender_fFemale            -1.615135   0.343013  -4.709 2.49e-06 ***
## total_paInactive          -0.331714   0.420783  -0.788   0.4305    
## total_paModerately active -0.665885   0.320321  -2.079   0.0376 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 408.79  on 752  degrees of freedom
## Residual deviance: 349.60  on 748  degrees of freedom
##   (251 observations deleted due to missingness)
## AIC: 359.6
## 
## Number of Fisher Scoring iterations: 6

We performed a logistic regression since problematic drinking, the dependent variable, is binary.The regression analysis shows that age (p=2.12e-07 <0.0001), being female (p=2.49e-06 <0.0001), and moderate physical activity (p=0.0376) have a significant negative association with problematic drinking in our dataset.So: people with higher age, females and people with moderate physical activity have lower odds of problematic drinking compared to others.

3e

Now regress total daily alcohol use on age, gender, and physical activity (multivariable model).

model2 <- lm(average_quant_day ~ age + gender_f + total_pa, data = WG)
summary(model2)

## 
## Call:
## lm(formula = average_quant_day ~ age + gender_f + total_pa, data = WG)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.393 -3.916 -1.950  2.016 58.100 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.84630    0.99145   7.914 9.00e-15 ***
## age                        0.01983    0.01627   1.219   0.2232    
## gender_fFemale            -5.00868    0.60890  -8.226 8.59e-16 ***
## total_paInactive          -1.66816    0.88780  -1.879   0.0606 .  
## total_paModerately active -0.75911    0.66209  -1.147   0.2519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.233 on 747 degrees of freedom
##   (252 observations deleted due to missingness)
## Multiple R-squared:  0.08551,    Adjusted R-squared:  0.08062 
## F-statistic: 17.46 on 4 and 747 DF,  p-value: 1.041e-13

Since the outcome (average daily alcohol use) is a continuous variable, we performed a linear regression.

Explain what you found. Are there any transformation needed?

The linear regression analysis shows that gender (female), inactive PA and moderatie PA are all negatively associated with average daily alcohol use. However, only for female gender this association is significant (p= 8.59e-16). So: gender shows that females consume significantly less alcohol than males. The model explains 8.6% of the variation in daily alcohol consumption.The residuals suggest that the model may not fully capture the variation in alcohol consumption, as the R-squared is low (8.6%), suggesting that the residuals are large compared to the explained variance. A log transformation of the dependent variable could improve the model fit by addressing non-normality in the residuals.

3f

Check the assumptions of the regression analysis in the previous exercise. Hint: plot(model). Explain what you see in statistical terms. Adapt your analysis and repeat 3e if the assumptions are not met.

plot(model2)

WG$log_average_quant_day <- log(1 + WG$average_quant_day)
model3 <- lm(log_average_quant_day ~ age + gender_f + total_pa, data = WG)
summary(model3)

## 
## Call:
## lm(formula = log_average_quant_day ~ age + gender_f + total_pa, 
##     data = WG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86367 -0.66432  0.00601  0.76877  2.64545 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.632597   0.112629  14.495  < 2e-16 ***
## age                        0.002962   0.001848   1.603  0.10936    
## gender_fFemale            -0.670894   0.069171  -9.699  < 2e-16 ***
## total_paInactive          -0.279912   0.100854  -2.775  0.00565 ** 
## total_paModerately active -0.045127   0.075213  -0.600  0.54870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9353 on 747 degrees of freedom
##   (252 observations deleted due to missingness)
## Multiple R-squared:  0.1178, Adjusted R-squared:  0.113 
## F-statistic: 24.93 on 4 and 747 DF,  p-value: < 2.2e-16

We conclude that the assumptions of the linear regression are not met, as observed in the plots. For instance, the Residuals vs Fitted plot shows a pattern, which indicates a non-linear relationship between the dependent variable and the independent variables. In addition, in the Q-Q plot there is a clear deviation from the diagonal line at the end of the plot, which suggests non-normality in the residuals. Therefore, a transformation (such as a log transformation) would be indicated in this case.This can be performed using the command above.