Preparation

library(haven)
library(dplyr)
library(labelled)
library(lattice)
library(ggplot2)
#tinytex::install_tinytex()

Case study

This case study is an integrated set of exercises in R that build on the knowledge you gained during the first eight days of the course. The goal of the exercises is to determine whether physical activity is associated with alcohol use. To get to the answer to this question, you will first have to generate new variables from the data available and check the data before analysis.

Exercise 1

1a

Read in the entire dataset and select the data for your country

Dataset_casestudy <- read_sav("Dataset_casestudy1.sav") # Read .sav file
Eurobar_ES <- subset(Dataset_casestudy, v7 == "ES") # Select Spain

We are interested in the variables on alcohol use (alc: [v150-v154]) and on physical activity (pa: [v119-v122]). Furthermore, you will need information on age (v331), gender (v330), and country (v6). Names of the variables in relation to original Eurobarometer coding are given in table 1 of the exercises of Day 5.

1b

Make a smaller dataset for you country with only the variables mentioned above

ES <- subset(Eurobar_ES, select = c(v6, v330, v150:v154, v119:v122, v331)) # Select variables

1c

Rename the variables to the names mentioned in table 1 on Day 5

ES <- ES %>% # Rename using dplyr
  rename(
    country = v6,
    alc12m = v150,
    alcfreq_5dr = v151,
    alc_30da = v152,
    alcfreq_30d = v153,
    alc_am_dd = v154,
    pa7d_work = v119,
    pa7d_mov = v120,
    pa7d_house = v121,
    pa7d_recr = v122,
    gender = v330,
    age = v331
  )

1d

Describe the data, using R to help you find information on number of observations, number of variables, number of missing values for each variable.

str(ES) # See variables and sample size
missing_counts <- sapply(ES, function(x) sum(is.na(x))) # Function to compute missing counts
print(missing_counts)
##     country      gender      alc12m alcfreq_5dr    alc_30da alcfreq_30d 
##           0           0           0         337         337         409 
##   alc_am_dd   pa7d_work    pa7d_mov  pa7d_house   pa7d_recr         age 
##         410          21           0           2           1           0
Summarize the description here:

There are 1003 observations in the sample with 12 variables in total. See missing counts table above for number of missing values for each variable.

1e

Determine which of variables are actually categorical. Hint: determine the number of unique values per variable.

sapply(ES, function(x) length(unique(x))) # Function to see amount of unique values per variable.
##     country      gender      alc12m alcfreq_5dr    alc_30da alcfreq_30d 
##           1           2           2           6           3           7 
##   alc_am_dd   pa7d_work    pa7d_mov  pa7d_house   pa7d_recr         age 
##           8           5           4           5           5          79
str(ES) # Check type of variable.
Enter your answer here below:

Every variable except for age is categorical.

1f

Create a function that can convert the numerical variables that are categorical ones into factors and drop the levels that are non-informative (determine from the labels). Use the solution you found on day 5, where you already converted numerical variables into categorical ones. Function input arguments should be the variable and the levels to be dropped / set to missing. Inside the body of the function, use the variable to determine its levels and labels. Output should be the factor variable with the right labels and without non-informative levels.

var_names <- colnames(ES[,-12]) # Create vector of all variable names except for age.

# Loop through each variable name
for (x in var_names) {
  lv <- attr(ES[[x]], "labels") # Get labels of ES
  
  # Remove non-informative levels (9 and 99)
  lv <- lv[!lv %in% c(9, 99)] 
  
  ES[[x]] <- factor(ES[[x]], 
                    levels = lv, 
                    labels = names(lv))
}

# Loop through each variable name
for (x in var_names) {
  lv2 <- attr(ES[[x]], "levels") # Get levels of ES
  
  # Remove non-informative levels (DK/Refusal/It depends/Don't remember etc.)
  lv2 <- lv2[!lv2 %in% c("DK/Refusal", "DK/ Refusal", "DK", "Don't remember/Refusal (SPONT.)", "It depends (SPONT.)" )] 
  
  ES[[x]] <- factor(ES[[x]], 
                    levels = lv2) 
}

1g

Use the function created in 1f to convert the categorical variables determined in exercise 1e to factors and check if the function worked correctly. NB. Use different names for the new variables to avoid that the original one is lost! In the remainder we will however refer to the original names.

str(ES) # Check if transformation succesful
## tibble [1,003 × 12] (S3: tbl_df/tbl/data.frame)
##  $ country    : Factor w/ 33 levels "France","Belgium",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ gender     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ alc12m     : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 1 1 2 1 1 ...
##  $ alcfreq_5dr: Factor w/ 5 levels "Several times a week",..: 5 NA NA 5 5 5 5 NA 2 4 ...
##  $ alc_30da   : Factor w/ 2 levels "Yes","No": 2 NA NA 1 1 1 1 NA 1 1 ...
##  $ alcfreq_30d: Factor w/ 6 levels "Daily","4 - 5 times a week",..: NA NA NA 6 1 5 1 NA 1 4 ...
##  $ alc_am_dd  : Factor w/ 6 levels "Less than 1 drink",..: NA NA NA 2 2 1 1 NA 5 NA ...
##  $ pa7d_work  : Factor w/ 4 levels "A lot","Some",..: 3 4 4 4 1 4 4 4 1 2 ...
##  $ pa7d_mov   : Factor w/ 4 levels "A lot","Some",..: 2 3 4 3 3 3 3 4 3 1 ...
##  $ pa7d_house : Factor w/ 4 levels "A lot","Some",..: 2 3 4 3 2 3 3 2 2 1 ...
##  $ pa7d_recr  : Factor w/ 4 levels "A lot","Some",..: 4 4 4 1 1 4 4 4 4 1 ...
##  $ age        : num [1:1003] 45 86 83 76 45 86 84 33 52 42 ...
##   ..- attr(*, "label")= chr "D11 AGE EXACT"
##   ..- attr(*, "format.spss")= chr "F2.0"
##   ..- attr(*, "display_width")= int 6

Function worked correctly transforming all variables except age to factor and removing inappropriate labels.

1h

Calculate descriptive statistics of all variables. Think well on what kind of statistic that would be depending on the type of variable.

hist(ES$age) # Check distribution of ages

summary(ES$age) # Calculate descriptive stats
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   33.00   47.00   48.55   64.00   96.00
table1 <- colnames(ES[,-c(1, 12)]) # Create vector of all variable names except for age and country. 

# Loop through each variable name
for (x in table1) {
  cat("\nVariable:", x, "\n") # Get variable name
  
  table <- table(ES[[x]]) # Get the frequency table
  
 table2  <- prop.table(table) # Make prop table with frequency table
 print(table2)
  }
## 
## Variable: gender 
## 
##     Male   Female 
## 0.441675 0.558325 
## 
## Variable: alc12m 
## 
##      Yes       No 
## 0.665005 0.334995 
## 
## Variable: alcfreq_5dr 
## 
##   Several times a week            Once a week           Once a month 
##              0.1501502              0.1816817              0.1201201 
## Less than once a month                  Never 
##              0.2042042              0.3438438 
## 
## Variable: alc_30da 
## 
##       Yes        No 
## 0.8918919 0.1081081 
## 
## Variable: alcfreq_30d 
## 
##               Daily  4 - 5 times a week  2 - 3 times a week         Once a week 
##          0.23737374          0.09932660          0.21548822          0.26094276 
## 2 - 3 times a month                Once 
##          0.11616162          0.07070707 
## 
## Variable: alc_am_dd 
## 
## Less than 1 drink      1 - 2 drinks      3 - 4 drinks      5 - 6 drinks 
##       0.096774194       0.641765705       0.203735144       0.040747029 
##      7 - 9 drinks 10 drinks or more 
##       0.015280136       0.001697793 
## 
## Variable: pa7d_work 
## 
##     A lot      Some    Little      None 
## 0.1344196 0.1771894 0.1079430 0.5804481 
## 
## Variable: pa7d_mov 
## 
##     A lot      Some    Little      None 
## 0.1744766 0.4376869 0.2771685 0.1106680 
## 
## Variable: pa7d_house 
## 
##     A lot      Some    Little      None 
## 0.1708292 0.3826174 0.2747253 0.1718282 
## 
## Variable: pa7d_recr 
## 
##     A lot      Some    Little      None 
## 0.1586826 0.2804391 0.1816367 0.3792415

For age, which is a continuous variable, we choose a histogram to check the distribution and we calculate the median and the IQR.

For the categorical variables we calculate proportional tables for each one of them.

1g

Make a bar chart of pa7d_work to visualize this variable. Use both hist() and barplot() as a command to accomplish this and describe which plot is preferred.
N.B. using barplot(pa7d_work) won’t properly display the variable. Check what input the function needs.

ggplot(ES, aes(x = pa7d_work)) + # Bar plot
  geom_bar()

Which one do you prefer?

Bar chart works for categorical variables. Histogram is only for continuous, which pa7d_work is not and therefore we were not able to create a histogram. Thus, we prefer the bar chart.

Exercise 2

Since we are interested in problematic alcohol use, the data on alcohol use will have to be processed to produce a new binary variable “problematic alcohol use”. Alcohol use may be defined as problematic when it is chronically too much (more than 40 g on average per day for male or 20 g per day for female persons) or when too much is consumed on a single occasion, i.e. more than 5 glasses (binge drinking). Questions on both issues are in the dataset. Define someone as a problematic drinker if he/she either chronically drinks too much and/or does binge drinking. You will now produce a new variable “problematic drinker” in a couple of steps. Do this for your own small dataset with only the records from your chosen country.

PROBLEMATIC ALCOHOL USE

2a

Define a new numeric variable “number of drinking days/month” (e.g. called n_drinksmonth) and use the information from alc_30da and freq_30d to fill this variable with numeric values on the number of days a persons was drinking alcohol per month. You may need to make some pragmatic assumptions here. E.g. concerning the number of days in category “Daily” should be converted to 30.

State your assumptions in a comment. There are many approaches exist to do this conversion, e.g. like in exercise 24 of Day 1, using the ifelse() function, or using the revalue() or mapvalues() functions from plyr package.

NB. There are multiple solutions to do this conversion. If you want to use conditional statements, remember that if() does not work on a vector, while ifelse() does. Alternatively use mapply() for loop or another solution.

levels(ES$alc_30da) # Drinking yes or no
## [1] "Yes" "No"
levels(ES$alcfreq_30d) # Frequency of drinking per 30 days
## [1] "Daily"               "4 - 5 times a week"  "2 - 3 times a week" 
## [4] "Once a week"         "2 - 3 times a month" "Once"
ES <- ES %>% 
  mutate(n_drinksmonth = # Use dplyr to create new variable n_drinksmonth.
          ifelse(alc_30da == "No", 0, # Differentiate between drinkers and non-drinkers
          ifelse(alcfreq_30d == "Daily", 30, # Specify numerical values for how much someone drinks
          ifelse(alcfreq_30d == "4 - 5 times a week", 20,
          ifelse(alcfreq_30d == "2 - 3 times a week", 12,
          ifelse(alcfreq_30d == "Once a week", 4,
          ifelse(alcfreq_30d == "2 - 3 times a month", 3,
          ifelse(alcfreq_30d == "Once", 1, 0)))))))) 

2b

Define a new variable “quantity of alcohol (in grams) consumed on a drinking day” (e.g. called quantday) and use the information from alc_30da and alc_am_dd to do so. Use the following assumption: 1 drink/consumption contains 12 g alcohol.

levels(ES$alc_am_dd) # Check levels that we need to quantify
## [1] "Less than 1 drink" "1 - 2 drinks"      "3 - 4 drinks"     
## [4] "5 - 6 drinks"      "7 - 9 drinks"      "10 drinks or more"
ES <- ES %>% 
  mutate(quantday = # Use dplyr to create new variable quantday.
          ifelse(alc_30da == "No", 0, # Differentiate between drinkers and non-drinkers
          ifelse(alc_am_dd == "Less than 1 drink", 6, # Specify numerical values for how much someone drinks in grams
          ifelse(alc_am_dd == "1 - 2 drinks", 18,
          ifelse(alc_am_dd == "3 - 4 drinks", 36,
          ifelse(alc_am_dd == "5 - 6 drinks", 54,
          ifelse(alc_am_dd == "7 - 9 drinks", 72,
          ifelse(alc_am_dd == "10 drinks or more", 120, 0))))))))

2c

Now calculate a third new variable “average quantity of alcohol (in grams) per day over the last month”, multiplying the two variables above and dividing the result by 30.

ES <- ES %>% 
  mutate(avgquant = (quantday * n_drinksmonth) / 30) # Calculate new variable

2d

Make a histogram or barplot of daily intake created in exercise 2c.

hist(ES$avgquant) # Make histogram

Do the results make sense?

Yes.

Is this a continious variable? Explain why (not)

Yes because it can take on any value within a range.

2e

Do the same as in exercise 2d, but now for men and women separately.
Hint: think of the trellis plots from the lattice package.

histogram(~ avgquant | gender , data = ES) # Make charts for males and females.

2f

Compute descriptive statistics for the daily (average) alcohol use in grams.

descriptives_quantday <- summary(ES$quantday) # Make descriptives
print(descriptives_quantday)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   18.00   18.00   20.47   18.00  120.00     342

2g

Create a variable “heavy drinker” for men and women, using the cut-off values mentioned above (40 and 20 g daily, respectively).

levels(ES$gender) # Check levels
## [1] "Male"   "Female"
ES <- ES %>% # use dplyr to create new variable heavydrinker
  mutate(heavydrinker = ifelse(gender == "Male" & avgquant > 40, "Heavy drinker", # Specificy when someone is heavy drinker, differentiating between genders.
                        ifelse(gender == "Female" & avgquant > 20, "Heavy drinker", "Not")))

Now consider binge drinking in addition to drinking too much

2h

Define a new variable “binge” that gets value 1 if someone is a binge drinker (i.e. >=5 glasses on a single occasion) and value 0 else. This has the same definition for men and women.

ES <- ES %>% # Create new variable with dplyr
  mutate(binge = 
           ifelse(alc_am_dd %in% c("5 - 6 drinks", "7 - 9 drinks", "10 drinks or more"), 1, 0)) # Specify values when someone is binge drinker

2i

Use these two variables (heavy drinker and binge) to fill a new binary variable “problem drinker” using logistic operators (|, &) in the appropriate way.

ES <- ES %>% # Create new variable with dplyr
  mutate(problemdrinker = 
           ifelse(heavydrinker == "Heavy drinker" | binge == 1, 1, 0))
# Specify values when someone is a problem drinker

2j

Check the contents of this new variable

table(ES$problemdrinker)
## 
##   0   1 
## 617  44
Does it make sense to you?

Yes.

PHYSICAL ACTIVITY

2k

Define a new variable “total physical activity”, adding relevant information from all “pa variables” on physical activity in various settings. Clearly explain your definitions and assumptions. Categorize people as “inactive” or “moderately active” or “active”.

levels(ES$pa7d_work) # Check levels
## [1] "A lot"  "Some"   "Little" "None"
levels(ES$pa7d_mov)
## [1] "A lot"  "Some"   "Little" "None"
levels(ES$pa7d_house)
## [1] "A lot"  "Some"   "Little" "None"
levels(ES$pa7d_recr)
## [1] "A lot"  "Some"   "Little" "None"
pa <- colnames(ES[, c(8:11)])

ES <- ES %>%
  mutate(across(all_of(pa), # Recode into score
                ~ recode(., 
                         "A Lot" = 3, # Linear scoring system up to 3
                         "Some" = 2, 
                         "Little" = 1, 
                         "None" = 0), # 0 because there is no activity
                .names = "num_{.col}")) %>%  # Creates new variable with prefix "num_"
  mutate(total_pa = num_pa7d_work + num_pa7d_mov + num_pa7d_house + num_pa7d_recr) %>%
  mutate(total_pa = ifelse(total_pa < 4, "Inactive", # Assign categories of physical activity
                           ifelse(total_pa < 7, "Moderately Active", "Active"))) %>% 
  mutate(total_pa = as.factor(total_pa))

# Check if it went well
summary(ES$total_pa)
##            Active          Inactive Moderately Active              NA's 
##                88               248               257               410

We assign a score to the total physical activity based on the individual physical activity variables. This ranges from 0 to 3, indicating “none” up to “a lot” of activity. All 4 individual components add up to the final score, where under 4 is considered “Inactive”, from 4 to 6 is “Moderately Active”, and 7 and above is ” Active”.

Exercise 3 Analyzing the data

Now these new variables will be used in a regression analysis to analyze the effects of age, gender, and physical activity on daily alcohol use and the assumptions of this analysis will be tested. Please remember that we are using categorical variables here and therefore you have check how your data have been stored in R, as numeric values or factors, e.g. by using summary() or table(). In case of the first, the variable will be considered numeric. Sometimes it matters how your data are stored and will it be worth trying the alternative if you run into trouble or your results don’t make sense.

3a

Create a table of gender versus problem drinkers (yes/no). Apply a test to check for differences in percentages of problematic drinking between males and females.

table_gpb <- table(ES$gender, ES$problemdrinker) # Make contingency table
prop_gpb <- prop.table(table_gpb, margin = 1) * 100 # Make prop.table for percentages

chisq.test(prop_gpb) # Perform chi-square on prop table
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  prop_gpb
## X-squared = 0.010108, df = 1, p-value = 0.9199
How do you interpret this analysis?

P-value is 0.9199. This means there is no difference in the percentages of problematic drinkers between males and females.

3b

Do the same for the variable on physical activity created in 2k versus problem drinkers, that is present and test for differences in problematic drinking in the three different categories of physical activity.

table(ES$total_pa, ES$problemdrinker)
##                    
##                       0   1
##   Active             66   1
##   Inactive          124   4
##   Moderately Active 171   6
# Some rows have less than 5, we use fisher exact test

fisher.test(ES$total_pa, ES$problemdrinker) # Fisher test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ES$total_pa and ES$problemdrinker
## p-value = 0.8482
## alternative hypothesis: two.sided

Fisher’s exact test was used to compare the difference between total physical activity and whether someone is a problem drinker or not. These are 2 categorical variables. We had to use this because some rows have less than 5.

P-value is 0.8482, which means that there is no difference in the total physical activity score between problem drinkers and non-problem drinkers.

3c

Make a graph of percentages of problem drinkers versus age (in 10-year age categories). Test whether the distribution of age (continuous) is different between problem drinkers and non-problem drinkers. Check the summary statistics between the two groups.

ES <- ES %>%
  mutate(agegroup10 = cut(age, seq(10, 100, 10))) # Make age categories

problemdrinker_percent <- ES %>%
  group_by(agegroup10) %>%
  summarize(
    total = n(),  # Total number of people in each age group
    problem_drinkers = sum(problemdrinker == 1, na.rm = TRUE),  # Number of problem drinkers
    percentage = (problem_drinkers / total) * 100  # Percentage of problem drinkers
  )

ggplot(problemdrinker_percent, aes(x = agegroup10, y = percentage)) +
  geom_col() +
  theme_classic() # Make plot to check distribution

# Normality tests
shapiro.test(ES$age) # Not significant
## 
##  Shapiro-Wilk normality test
## 
## data:  ES$age
## W = 0.9729, p-value = 9.821e-13
qqnorm(ES$age)
qqline(ES$age) # Deviates from line

wilcox.test(ES$age, ES$problemdrinker) # No normality so we use alternative test
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ES$age and ES$problemdrinker
## W = 662983, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Checking the bar graph of percentages of problem drinkers across age groups, we see that the percentages of problem drinkers are concentrated in the younger ages (<40). Performing normality tests show that the age variable is not normally distributed. This means we need to use a non-parametric alternative. In this case, the Wilcoxon rank-sum test. Outcome: P = < 0.001. This means there is a difference in terms of age between the amount of problem drinkers.

3d

Perform a multivariable test to determine the effects of age (continuous), gender and physical activity on problematic drinking. Think well about the type of regression. Explain what you found.

m1 <- glm(problemdrinker ~ age + gender + total_pa, data = ES)
summary(m1) # Fit logistic regression model
## 
## Call:
## glm(formula = problemdrinker ~ age + gender + total_pa, data = ES)
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                0.0773309  0.0317278   2.437   0.0153 *
## age                       -0.0013105  0.0005087  -2.576   0.0104 *
## genderFemale              -0.0121391  0.0179172  -0.678   0.4985  
## total_paInactive           0.0235012  0.0258104   0.911   0.3631  
## total_paModerately Active  0.0262364  0.0243740   1.076   0.2825  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.02850198)
## 
##     Null deviance: 10.675  on 371  degrees of freedom
## Residual deviance: 10.460  on 367  degrees of freedom
##   (631 observations deleted due to missingness)
## AIC: -260.84
## 
## Number of Fisher Scoring iterations: 2
exp(-0.0013105) # Calculate odds-ratio
## [1] 0.9986904
Explain what you found

The problem drinker variable is binary, so we use multivariate logistic regression. With this, we test the associations between problem drinkers and three predictor variables: age, gender, and total physical activity.

The results show that age is associated with being a problem drinker (P = 0.0104), while gender and physical activity are not (P = > 0.05).

Calculating the odds ration using the coefficient for age, we determine for each one-year increase in age, the odds of being a problem drinker decrease by approximately 0.13%.

3e

Now regress total daily alcohol use on age, gender, and physical activity (multivariable model).

m2 <- lm(avgquant ~ age + gender + total_pa, data = ES) # Fit linear regression
summary(m2)
## 
## Call:
## lm(formula = avgquant ~ age + gender + total_pa, data = ES)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.964  -4.565  -2.502   3.832 108.314 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               10.407652   1.861504   5.591 4.41e-08 ***
## age                       -0.007109   0.029847  -0.238    0.812    
## genderFemale              -6.735645   1.051219  -6.407 4.54e-10 ***
## total_paInactive           1.691109   1.514324   1.117    0.265    
## total_paModerately Active  0.849405   1.430047   0.594    0.553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.905 on 367 degrees of freedom
##   (631 observations deleted due to missingness)
## Multiple R-squared:   0.11,  Adjusted R-squared:  0.1003 
## F-statistic: 11.34 on 4 and 367 DF,  p-value: 1.086e-08
# Check normality of residuals
shapiro.test(resid(m2)) # Not significant
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(m2)
## W = 0.72592, p-value < 2.2e-16
qqnorm(resid(m2))
qqline(resid(m2))

# Check homoscedasticity
plot(m2, 1)

Explain what you found. Are there any transformation needed?

avgquant is a continuous variable, so we use linear regression accordingly. We test associations between average quantity of alcohol and age, gender and total physical activity.

Only gender is significant (P = < 0.001).

However, when we check the assumptions of linear regression, most importantly normality of residuals and homoscedasticity, it reveals that both are violated. The result is thus not applicable and data transformation is necessary.

3f

Check the assumptions of the regression analysis in the previous exercise. Hint: plot(model). Explain what you see in statistical terms. Adapt your analysis and repeat 3e if the assumptions are not met.

ES$log_avgquant <- log(1+ES$avgquant) # Transform to log

m3 <- lm(log_avgquant ~ age + gender + total_pa, data = ES) # Fit new model
summary(m3)
## 
## Call:
## lm(formula = log_avgquant ~ age + gender + total_pa, data = ES)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.24605 -0.83082  0.00653  0.85037  2.79210 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.234083   0.182276  12.257   <2e-16 ***
## age                       -0.003791   0.002923  -1.297    0.195    
## genderFemale              -0.908890   0.102934  -8.830   <2e-16 ***
## total_paInactive          -0.010502   0.148280  -0.071    0.944    
## total_paModerately Active  0.068837   0.140028   0.492    0.623    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9699 on 367 degrees of freedom
##   (631 observations deleted due to missingness)
## Multiple R-squared:  0.1773, Adjusted R-squared:  0.1683 
## F-statistic: 19.77 on 4 and 367 DF,  p-value: 9.422e-15
# Test normality of residuals
shapiro.test(resid(m3))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(m3)
## W = 0.9816, p-value = 0.0001105
qqnorm(resid(m3))
qqline(resid(m3))

# Check homoscedasticity
plot(m3, 1)

We checked the assumptions and they are not met. Thus, we make a transformation of the outcome variable using log transformation.

Log transformation worked, but residuals are not normally distributed and homoscedasticity is violated. This means that assumptions are violated and that the result of this regression is not valid.

We conclude that the data is not suitable for this type of analysis, and that we should consider a non-linear regression, for example GAM models.