Linear Regression Analysis

Authors

Siti Nurhafizah binti Saharudin (23202464),

Muamar Iskandar bin Mohamaed Yusoff (23202846),

Nur Fazlina binti Dzulkifli (23202967),

Yogeswary Mithra A/P Nandi Mithra (23202677)

1 PART A: Introduction and Data Generation

1.1 Dataset / Simulation

Quality of life is shaped by a combination of occupational, behavioural, and health-related factors. This analysis applies linear regression techniques using a simulated dataset to reflect plausible relationships between working experience, physical activity, obesity status, and quality of life. Simulation was used to allow direct control over the data-generating process, enabling clear examination of model assumptions and interaction effects in a reproducible manner.

The simulated dataset comprising 200 observations was generated using R. Continuous predictors were simulated within realistic ranges, with years of working experience generated between 0 and 40 years and physical activity scores ranging from 0 to 10. Obesity status was simulated as a binary categorical variable to represent obese and non-obese individuals.

Quality of life was generated as a continuous outcome variable based on a predefined linear model that incorporated main effects of years of working experience, physical activity, and obesity status, as well as an interaction between physical activity and obesity. This interaction was specified to reflect effect modification, whereby physical activity was assumed to improve QOL for both groups but with a stronger positive effect among non-obese individuals. Random error was added to introduce natural variability, and QOL values were constrained to a plausible range between 0 and 100.

The data-generating process was guided by established public health and occupational health concepts, in which increased physical activity and longer working experience are generally associated with improved quality of life, while obesity is associated with lower baseline QOL. Parameter values were selected to reflect realistic effect sizes and to ensure sufficient variability and overlap between groups, allowing meaningful estimation and interpretation of regression coefficients. The simulated data were designed to balance signal and variability, allowing regression effects and diagnostics to be examined under realistic, non-ideal conditions. A fixed random seed was applied to ensure reproducibility of the simulated dataset across repeated renders of the analysis.

# setting seed to ensure reproducible random data
set.seed(123)

# number of observations
n <- 200

# continuous covariates
## uniform random between 0 and 40
years_working <- runif(n, 0, 40)

## uniform random between 0 and 10
phys_activity <- runif(n, 0, 10) 

# categorical covariate
obesity <- sample(c("obese","not obese"), n, replace=TRUE)
## Numeric code for obesity (0/1)
obese_num <- ifelse(obesity=="obese", 1, 0)

# define how qol depends on predictors, include main effects and an interaction between physical activity and obesity.
## Coefficients for the linear model
### intercept (baseline QOL)
beta0 <- 50 

### effect per year_working
beta_years <- 0.6  

### effect per unit phys_activity
beta_phys <- 3.0  

### effect if obese
beta_obesity <- -8 

### extra effect of phys_activity when obese
beta_int <- -1.5 
 
### random noise sd
sd_noise <- 8       

# Compute the “true” mean QOL (mu) including interaction: 
##beta_int * phys_activity * obese_num adds extra slope for obese group.
mu <- beta0 + beta_years*years_working + beta_phys*phys_activity +
      beta_obesity*obese_num + beta_int*phys_activity*obese_num

# Generate QOL with random normal noise
qol <- rnorm(n, mean=mu, sd=sd_noise)

# Assemble into a data frame
df <- data.frame(qol, years_working, phys_activity, obesity)
head(df,3)

       qol years_working phys_activity obesity
1 46.76081      11.50310      2.387260   obese
2 69.33320      31.53221      9.623589   obese
3 53.32762      16.35908      6.013657   obese

# Glimpse the generated dataset
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

glimpse(df)

Rows: 200
Columns: 4
$ qol           <dbl> 46.76081, 69.33320, 53.32762, 62.49776, 67.11254, 58.946…
$ years_working <dbl> 11.503101, 31.532205, 16.359077, 35.320696, 37.618691, 1…
$ phys_activity <dbl> 2.3872603, 9.6235894, 6.0136573, 5.1502973, 4.0257334, 8…
$ obesity       <chr> "obese", "obese", "obese", "obese", "obese", "obese", "n…

1.2 Analysis Overview

Following data generation, the analysis focused on examining the relationships between working experience, physical activity, obesity status, and quality of life using linear regression techniques.

Research questions The primary research questions addressed here were: 1. Is working experience associated with quality of life? 2. Is physical activity associated with quality of life? 3. Is obesity status independently associated with quality of life? 4. Does obesity status modify the association between physical activity and quality of life?

Objectives The objectives of the analysis were: 1. To examine the association between working experience and quality of life. 2. To examine the association between physical activity and quality of life. 3. To assess the independent association between obesity status and quality of life. 4. To evaluate whether obesity status modifies the association between physical activity and quality of life.

To address these objectives, exploratry data analysis and linear regression models with and without interaction terms were fitted and compared.

1.3 Prepare Environment and Read Data

1.3.1 Load Required Packages

# Data manipulation and workflow
library(tidyverse)
library(dplyr)

# Data reporting & tables
library(gtsummary)
library(knitr)

# Data visualisation
library(ggplot2)
library(corrplot)

# Modelling and diagnostics
library(MASS)

# Model ouputs
library(broom)
library(broom.helpers)

1.3.2 Glimpse Dataset

glimpse(df)

Rows: 200
Columns: 4
$ qol           <dbl> 46.76081, 69.33320, 53.32762, 62.49776, 67.11254, 58.946…
$ years_working <dbl> 11.503101, 31.532205, 16.359077, 35.320696, 37.618691, 1…
$ phys_activity <dbl> 2.3872603, 9.6235894, 6.0136573, 5.1502973, 4.0257334, 8…
$ obesity       <chr> "obese", "obese", "obese", "obese", "obese", "obese", "n…

1.3.3 Transform Data

Convert obesity to factors with ordered levels

df <- df |> 
  mutate(obesity = factor(obesity, level = c("not obese", "obese")))

glimpse(df)

Rows: 200
Columns: 4
$ qol           <dbl> 46.76081, 69.33320, 53.32762, 62.49776, 67.11254, 58.946…
$ years_working <dbl> 11.503101, 31.532205, 16.359077, 35.320696, 37.618691, 1…
$ phys_activity <dbl> 2.3872603, 9.6235894, 6.0136573, 5.1502973, 4.0257334, 8…
$ obesity       <fct> obese, obese, obese, obese, obese, obese, not obese, obe…

The structure of the dataset was examined to confirm variable types and ensure suitability for subsequent analysis.

2 PART B: Exploratory Data Analysis

2.1 Summary Statistics

2.1.1 Overall Descriptive Summary

# summary data
summary(df)

      qol         years_working      phys_activity          obesity   
 Min.   : 39.07   Min.   : 0.02499   Min.   :0.06301   not obese:107  
 1st Qu.: 59.36   1st Qu.:10.88472   1st Qu.:2.35623   obese    : 93  
 Median : 69.43   Median :19.28384   Median :4.69024                  
 Mean   : 69.82   Mean   :20.25568   Mean   :4.89122                  
 3rd Qu.: 79.30   3rd Qu.:29.33483   3rd Qu.:7.41910                  
 Max.   :105.60   Max.   :39.77079   Max.   :9.99405

tbl_summary(
  df,
  label = list(qol ~ 'Quality of Life Score',
               years_working ~ 'Working Experience (years)',
               phys_activity ~ 'Physical Activity Score',
               obesity ~ 'Obesity Status'),
  statistic = list(
    all_continuous() ~ "{mean}, ({sd})",
    all_categorical() ~ "{n} ({p}%)"
  )
)

Table 1: Overall Descriptive Statistics Summary
Characteristic	N = 200¹
Quality of Life Score	70, (14)
Working Experience (years)	20, (11)
Physical Activity Score	4.89, (2.93)
Obesity Status
not obese	107 (54%)
obese	93 (47%)
¹ Mean, (SD); n (%)

2.1.2 Descriptive Summary by Obesity Status

tbl_summary(
  df,
  by = obesity,
  label = list(qol ~ 'Quality of Life Score',
               years_working ~ 'Working Experience (years)',
               phys_activity ~ 'Physical Activity Score'),
  statistic = list(
    all_continuous() ~ "{mean} ({sd})",
    all_categorical() ~ "{n} ({p}%)")
) |>
  modify_header (label = "**Variable**",
                 stat_1 ~ "**Not obese**<br><span style='font-weight:normal;'>n = {n}</span>",
                 stat_2 ~ "**Obese**<br><span style='font-weight:normal;'>n = {n}</span>") |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Obesity Status**")

Table 2: Descriptive Statistics Summary by Obesity Status
Variable	Obesity Status
Variable	Not obese n = 107¹	Obese n = 93¹
Quality of Life Score	77 (12)	62 (11)
Working Experience (years)	20 (11)	20 (11)
Physical Activity Score	4.70 (2.79)	5.11 (3.09)
¹ Mean (SD)

2.2 Visualisation of Dataset Distribution

2.2.1 Continuous Variables Distribution

2.2.1.1 Histogram

Histograms were used in EDA to assess the distribution and identify skewness and outliers in continuous variables.

Histogram for Quality of Life Score

ggplot(df, aes(qol)) +
  geom_histogram() +
  labs(
    x = "Quality of Life Score",
    y = "Frequency",
    title = "Distribution of Quality of Life Score")

Figure 1: Distribution of Quality of Life Score

Interpretation: The distribution of QOL scores is approximately unimodal with mild right skewness. Most values clustered in the moderate to high range and no obvious extreme outliers.

Histogram for Working Experience in years

ggplot(df, aes(years_working)) +
  geom_histogram() +
  labs(
    x = "Working Experience in years",
    y = "Frequency",
    title = "Distribution of Working Experience in years")

Figure 2: Distribution of Working Experience in years

Interpretation: Years of working experience is distributed across the full range from 0 to 40 years, with no strong skewness or extreme outliers, indicating a plausible and well-distributed continuous predictor.

Histogram for Physical Activity Score

ggplot(df, aes(phys_activity)) +
  geom_histogram() +
  labs(
    x = "Physical Activity Score",
    y = "Frequency",
    title = "Distribution of Physical Activity Score")

Figure 3: Distribution of Physical Activity Score

Interpretation: Physical activity scores are distributed across the full 0–10 range with no strong skewness or extreme outliers, indicating sufficient variability for regression analysis.

2.2.1.2 Box Plot

Box plots were used in EDA to assess the distribution of continuous variables across obesity status.

Box Plot for Quality of Life Score across Obesity Status

ggplot(df, aes(x = obesity, y = qol)) +
  geom_boxplot(fill = "pink", width = 0.2) +
  geom_hline(yintercept = 0, color = "red") +
  labs(
    x = "Obesity Status",
    y = "Quality of Life Score",
    title = "Distribution of Quality of Life Score by Obesity Status"
  ) +
  theme_minimal()

Figure 4: Box plot of Quality of Life Score across Obesity Status

Interpretation: The boxplot indicates that non-obese individuals tend to have higher median quality of life scores than obese individuals, although there is some overlap between the two distributions.

Box Plot for Working Experience in years across Obesity Status

ggplot(df, aes(x = obesity, y = years_working)) +
  geom_boxplot(fill = "lightgreen", width = 0.2) +
  geom_hline(yintercept = 0, color = "red") +
  labs(
    x = "Obesity Status",
    y = "Working Expeience in years",
    title = "Distribution of Working Experience in years by Obesity Status"
  ) +
  theme_minimal()

Figure 5: Box plot of Working Experience in years across Obesity Status

Interpretatio: The distribution of years of working experience is similar across obesity status, with comparable medians and substantial overlap between groups.

Box Plot for Physical Activity Score by Obesity Status

ggplot(df, aes(x = obesity, y = phys_activity)) +
  geom_boxplot(fill = "lightblue", width = 0.2) +
  geom_hline(yintercept = 0, color = "red") +
  labs(
    x = "Obesity Status",
    y = "Physical Activity Score",
    title = "Distribution of Physical Activity Score by Obesity Status"
  ) +
  theme_minimal()

Figure 6: Box plot of Physical Activty Score across Obesity Status

Interpretation: Physical activity scores show substantial overlap between obese and non-obese groups, with a slightly higher median observed among obese individuals.

2.2.1.3 Scatter Plot

Scatter plots were used in EDA to visually explore the relationships between continuous variables.

Scatter Plot between Working Experience in years and Quality of Life Score

ggplot(df, aes(x = years_working, y = qol)) +
  geom_point(color = "magenta") +
  geom_smooth(color = "red", fill = "lightpink") +
  labs(title = "Working Experience in years vs Quality of Life Score",
       x = "Working Experience in years",
       y = "Quality of Life Score") +
  theme_minimal()

Figure 7: Scatter plot between Working Expereience in years and Quality of Life Score

Interpretation: The scatter plot indicates a positive association between working experience and quality of life, with a gradual increase in QOL that appears to level off at higher years of experience.

Scatter Plot between Physical Activity Score and Quality of Life Score

ggplot(df, aes(x = phys_activity, y = qol)) +
  geom_point(color = "darkgreen") +
  geom_smooth(color = "green", fill = "lightgreen") +
  labs(title = "Physical Activity Score vs Quality of Life Score",
       x = "Physical Activity Score",
       y = "Quality of Life Score") +
  theme_minimal()

Figure 8: Scatter plot between Physical Activity Scre and Quality of Life Score

Interpretation: Physical activity score is positively associated with quality of life, with QOL increasing across the activity range and showing a modest plateau at higher scores.

Scatter Plot between Working Experience in years and Physical Activity Score

ggplot(df, aes(x = years_working, y = phys_activity)) +
  geom_point(color = "darkblue") +
  geom_smooth(color = "blue", fill = "lightblue") +
  labs(title = " Working Experience in years vs Physical Activity Score",
       x = "Working Experience in years",
       y = "Physical Activity Score") +
  theme_minimal()

Figure 9: Scatter plot between Working Expereience in years and Physical Activity Score

Interpretation: There is no clear linear relationship between years of working experience and physical activity score, with substantial variability observed across the working-life span.

2.2.1.4 Bar Plot

Bar plots are used in EDA to summarise and compare frequencies or proportions of categorical variables.

Bar Plot for Obesity Status

ggplot(df, aes(x = obesity, fill = obesity)) +
  geom_bar(width = 0.5) +
  theme_minimal() +
  labs(title = "Distribution of Obesity Status",
       x = "Obesity Status",
       y = "Count") +
  scale_fill_brewer(palette = "Set3")

Interpretation: The dataset consisted of comparable numbers of obese and non-obese individuals, indicating a reasonably balanced distribution across obesity status.

2.3 Correlation Matrix for Continuous Variables

Correlation matrix was used to examine bivariate relationships and assess potential multicollinearity. This steps serves as an initial screening, as multicollinearity will be formally assessed using variance inflation factors during model diagnostics.

df2 <- df |> 
  dplyr::select(qol, years_working, phys_activity)

cor.data <- 
  cor(df2,
      use = "complete.obs", 
      method = "pearson")
head(round(cor.data,2))

               qol years_working phys_activity
qol           1.00          0.40          0.42
years_working 0.40          1.00         -0.05
phys_activity 0.42         -0.05          1.00

Here is the correlogram to visually summarise the strength and direction of associations among continuous variables, complementing the numeric correlation matrix.

cor_matrix <- cor(df |> select_if(is.numeric), use = "complete.obs")
corrplot(cor_matrix,
         type = "upper",
         method = "circle",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45)

Figure 11: Correlogram for continuous variables

Interpretation: QOL is moderately positively correlated with both working experience and physical activity, while working experience and physical activity show minimal correlation with each other, indicating low risk of multicollinearity.

2.4 Exploratory Data Analysis Summary

Exploratory analyses indicated that quality of life was generally higher among non-obese individuals and increased with both working experience and physical activity. Visualisations suggested approximately linear relationships between continuous predictors and quality of life, with no extreme departures from distributional assumptions. These findings supported the suitability of linear regression modelling and informed subsequent model specification.

3 PART C: Regression Analysis

3.1 Univariable Linear Regression Analysis

Univariable linear regression analyses were conducted to examine the association between each independent variable and Quality of Life Score. Working experience in years, physical activity scoe and obesity status were analysed individually.

3.1.1 Working Experience in years

slr.years <- lm(qol ~ years_working, data = df)

tidy(slr.years, conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))

# A tibble: 2 × 7
  term          estimate std.error statistic p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl> <chr>      <dbl>     <dbl>
1 (Intercept)     59.5      1.91       31.1  <0.001    55.7      63.3  
2 years_working    0.509    0.0831      6.12 <0.001     0.345     0.673

Interpretation: In the univariable linear regression model, years of working experience was significantly associated with quality of life. Each additional year of working experience was associated with an average increase of 0.51 points in quality of life (p-value <0.001, 95% CI: 0.34–0.67). The intercept represents the expected quality of life score for individuals with zero years of working experience.

3.1.2 Physical Activity Score

slr.pa <- lm(qol ~ phys_activity, data = df)

tidy(slr.pa, conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))

# A tibble: 2 × 7
  term          estimate std.error statistic p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl> <chr>      <dbl>     <dbl>
1 (Intercept)      60.0      1.75      34.3  <0.001     56.5      63.4 
2 phys_activity     2.02     0.307      6.58 <0.001      1.41      2.62

Interpretation: In the univariable linear regression analysis, physical activity score was significantly associated with quality of life. Each one-unit increase in physical activity score was associated with an average increase of 2.02 points in quality of life (p < 0.001, 95% CI: 1.41–2.62). The intercept represents the expected quality of life score for individuals with a physical activity score of zero.

3.1.3 Obesity Status

slr.obese <- lm(qol ~ obesity, data = df)

tidy(slr.obese, conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))

# A tibble: 2 × 7
  term         estimate std.error statistic p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl> <chr>      <dbl>     <dbl>
1 (Intercept)      76.5      1.16     66.0  <0.001      74.3      78.8
2 obesityobese    -14.5      1.70     -8.50 <0.001     -17.8     -11.1

Interpretation: In the univariable linear regression analysis, obesity status was significantly associated with quality of life. Compared to non-obese individuals, obese individuals had an average quality of life score that was 14.46 points lower (p < 0.001, 95% CI: −17.81 to −11.11). The intercept represents the mean quality of life score among non-obese individuals.

3.1.4 Univariable Summary Table

tbl_univ <- gtsummary::tbl_uvregression(
  df,
  method = lm,
  y = qol,
  include = c(years_working, phys_activity, obesity),
  label = list(years_working ~ "Working Experience in years",
               phys_activity ~ "Physical Activity Score",
               obesity ~ "Obesity Status")
  ) |>
  modify_fmt_fun(
    estimate ~ function(x) style_number(x, digits = 6),
    conf.low ~ function(x) style_number(x, digits = 6),
    conf.high ~ function(x) style_number(x, digits = 6)
  ) |>
  gtsummary::bold_p()

tbl_univ

Table 3: Summary Table of Univariable Regresion Model
Characteristic	N	Beta	95% CI	p-value
Working Experience in years	200	0.508721	0.344814, 0.672628	<0.001
Physical Activity Score	200	2.017734	1.412776, 2.622693	<0.001
Obesity Status	200
not obese		—	—
obese		-14.460833	-17.813958, -11.107709	<0.001
Abbreviation: CI = Confidence Interval

Interpretation: Univariable linear regression analyses demonstrated that working experience and physical activity were both positively associated with quality of life, while obesity status was negatively associated with quality of life. Each additional year of working experience was associated with a 0.51-point increase in quality of life, and each one-unit increase in physical activity score was associated with a 2.02-point increase in quality of life. In contrast, obese individuals had significantly lower quality of life scores compared with non-obese individuals, with an average difference of −14.46 points. All associations were statistically significant (p < 0.001), supporting inclusion of these variables in subsequent multivariable regression models.

3.2 Multivariable Linear Regression Analysis

3.2.1 Model 1: Model with main effects only

m1 <- lm(qol ~ years_working + phys_activity + obesity, data = df)
tidy(m1, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))

# A tibble: 4 × 7
  term          estimate std.error statistic p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl> <chr>      <dbl>     <dbl>
1 (Intercept)     54.8      1.67        32.7 <0.001    51.5      58.1  
2 years_working    0.538    0.0534      10.1 <0.001     0.433     0.643
3 phys_activity    2.31     0.200       11.5 <0.001     1.91      2.70 
4 obesityobese   -15.4      1.17       -13.1 <0.001   -17.7     -13.1

Interpretation: In the multivariable linear regression model including years of working experience, physical activity, and obesity status, all predictors remained independently associated with quality of life. Each additional year of working experience was associated with a 0.54-point increase in quality of life (p-value < 0.001, 95% CI: 0.43, 0.64), while each one-unit increase in physical activity score was associated with a 2.31-point increase in quality of life (p-value < 0.001, 95% CI: 1.91, 2.70). In contrast, obese individuals had significantly lower quality of life scores compared with non-obese individuals, with an adjusted mean difference of −15.36 points (p-value < 0.001, 95% CI: -17.67, -13.05). All predictors demonstrate independent associations with QOL, supporting their inclusion in the model.

3.2.2 Model 2: Model with inclusion of interation between physical activity score and obesity

m2 <- lm(qol ~ years_working + phys_activity + obesity + phys_activity * obesity, data = df)
tidy(m2, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))

# A tibble: 5 × 7
  term                   estimate std.error statistic p.value conf.low conf.high
  <chr>                     <dbl>     <dbl>     <dbl> <chr>      <dbl>     <dbl>
1 (Intercept)              52.5      1.87       28.0  <0.001    48.8      56.1  
2 years_working             0.530    0.0527     10.1  <0.001     0.426     0.634
3 phys_activity             2.84     0.283      10.0  <0.001     2.28      3.39 
4 obesityobese            -10.3      2.25       -4.59 <0.001   -14.8      -5.89 
5 phys_activity:obesity…   -1.03     0.394      -2.60 0.0099    -1.80     -0.249

Interpretation: In the multivariable model including an interaction between physical activity and obesity status, all main effects remained statistically significant. Among non-obese individuals, each one-unit increase in physical activity score was associated with a 2.84-point increase in quality of life. However, this positive association was significantly weaker among obese individuals, as indicated by a negative interaction term (β = −1.03, p = 0.0099, 95% CI: -1.80, -0.25). This suggests that while physical activity is beneficial for quality of life in both groups, the magnitude of benefit is greater among non-obese individuals. This indicates that obesity status acts as an important effect modifier in the association between physical activity and quality of life.

3.2.3 Model Comparison

To evaluate whether inclusion of an interaction between physical activity and obesity status improved model performance, Model 1 (main effects only) and Model 2 (including the interaction term) were formally compared.

Model comparison was primarily conducted using ANOVA for nested linear models. Model fit was further evaluated using goodness-of-fit and model selection measures that account for model complexity, including coefficient of determination \((R^2)\), adjusted \(R^2\) and the Akaike Information Criterion (AIC).

3.2.3.1 Analysis of Variance (ANOVA) for nested linear models

mod_compare <- anova(m1, m2)

mod_compare_tidy <- tidy(mod_compare) |>
  mutate(Model = case_when(
           term == "qol ~ years_working + phys_activity + obesity" ~ "Model 1: Main effects",
           term == "qol ~ years_working + phys_activity + obesity + phys_activity * obesity" ~ "Model 2: Interaction model"),
         p.value = format.pval(p.value, digits = 2, eps = 0.001)
         ) |>
  dplyr::select(Model, df.residual, rss, df, sumsq, statistic, p.value)

mod_compare_tidy

# A tibble: 2 × 7
  Model                      df.residual    rss    df sumsq statistic p.value
  <chr>                            <dbl>  <dbl> <dbl> <dbl>     <dbl> <chr>  
1 Model 1: Main effects              196 13313.    NA   NA      NA    NA     
2 Model 2: Interaction model         195 12866.     1  447.      6.78 0.0099

Interpretation: ANOVA comparison of the nested linear models indicated that inclusion of the interaction between physical activity and obesity status significantly improved model fit. The interaction model demonstrated a significant reduction in residual sum of squares compared with the main-effects-only model (F = 6.78, p = 0.0099), supporting retention of Model 2 as the preferred model.

3.2.3.2 Goodness-of-fit and model selection measures

model_fit <- bind_rows(
  glance(m1) |> mutate(Model = "Model 1: Main effects"),
  glance(m2) |> mutate(Model = "Model 2: Interaction model")
  ) |>
  dplyr::select(Model, r.squared, adj.r.squared, AIC, sigma, nobs)

model_fit

# A tibble: 2 × 6
  Model                      r.squared adj.r.squared   AIC sigma  nobs
  <chr>                          <dbl>         <dbl> <dbl> <dbl> <int>
1 Model 1: Main effects          0.658         0.652 1417.  8.24   200
2 Model 2: Interaction model     0.669         0.662 1412.  8.12   200

Interpretation: Comparison of goodness-of-fit statistics indicates that Model 2, which includes the interaction between physical activity and obesity status, provides a better fit to the data than Model 1. Model 2 demonstrated higher \(𝑅^2\) (0.669 vs 0.658) and adjusted \(R^2\) (0.662 vs 0.652) indicating improved explanatory power after accounting for model complexity. In addition, Model 2 yielded a lower Akaike Information Criterion (AIC = 1412.38) compared to Model 1 (AIC = 1417.22), further supporting improved model performance. The residual standard deviation (sigma) was also slightly lower for Model 2, suggesting reduced unexplained variability. Taken together, these results support selection of the interaction model as the preferred final model.

1.2.4 Preliminary Final Model Based on the results of the nested model comparison and goodness-of-fit assessment, Model 2, which includes the interaction between physical activity and obesity status, demonstrated superior performance compared to the main-effects model. Accordingly, Model 2 was selected as the preliminary final model and was subsequently subjected to formal diagnostic evaluation to assess compliance with linear regression assumptions.

prelim_m <- tbl_regression(
    m2,
    intercept = TRUE,
    conf.level = 0.95,
    pvalue_fun = ~style_pvalue(.x, digits = 3, eps = 0.001),
    label = list(years_working ~ "Working Experience in years",
                 phys_activity ~ "Physical Activity Score",
                 obesity ~ "Obesity Status"),
    estimate_fun = ~style_number(.x, digits = 2)
    ) |>
    add_glance_table(
        include = c(r.squared, adj.r.squared, AIC, nobs))

prelim_m

Table 4: Preliminary Final Model with Fit Statistics
Characteristic	Beta	95% CI	p-value
(Intercept)	52.45	48.75, 56.15	<0.001
Working Experience in years	0.53	0.43, 0.63	<0.001
Physical Activity Score	2.84	2.28, 3.39	<0.001
Obesity Status
not obese	—	—
obese	-10.33	-14.77, -5.89	<0.001
Physical Activity Score * Obesity Status
Physical Activity Score * obese	-1.03	-1.80, -0.25	0.010
R²	0.669
Adjusted R²	0.662
AIC	1,412
No. Obs.	200
Abbreviation: CI = Confidence Interval

Interpretation: The preliminary final model indicated statistically significant associations between quality of life and years of working experience, physical activity, obesity status, and their interaction. However, formal interpretation of effect estimates was reserved until completion of model diagnostic checks.

4 PART D: Model Checking

The preliminary final model was evaluated to assess compliance with the assumptions of linear regression. Diagnostic procedures focused on linearity, normality of residuals, homoscedasticity, independence of errors, and the presence of influential observations.

4.1 Check Assumptions

4.1.1 Assumption 1: Linearity of the relationship between predictors and residuals

4.1.1.1 Overall linearity

Linearity was assessed using the residuals-versus-fitted values plot.

#Residuals vs Fitted plot
plot(m2, which = 1)

Interpretation: The residuals are randomly scattered around the zero line across the range of fitted values, with no clear systematic pattern or curvature. This suggests that the assumption of linearity is reasonably satisfied for the preliminary final model. A small number of observations exhibited larger residuals (labelled points: 60, 116, 156) however, no systematic structure was evident, and these points were further evaluated in subsequent influence diagnostics.

4.1.1.2 Linearity of each continuous predictors

Although the residuals-versus-fitted plot suggested that the overall linearity assumption was satisfied, residuals were also examined against each continuous predictor to ensure that their individual linear functional forms were appropriate.

library(gridExtra)

# Residuals vs working experience in years
p1 <- augment(m2) |>
  ggplot(aes(x = years_working, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red")

# Residuals vs physical activity score
p2 <- augment(m2) |>
  ggplot(aes(x = phys_activity, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red")

grid.arrange (p1,p2, ncol = 2)

Interpretation: Linearity with respect to the continuous predictors was further assessed by plotting residuals against years of working experience and physical activity score. In both plots, residuals are randomly scattered around the zero line with no clear systematic pattern or curvature. This indicates that the linear functional form for both predictors is appropriate and that no substantial non-linear relationships remain unaccounted for in the model.Therefore, the linearity assumption for both predictors was reasnably met.

4.2 Assumption 2: Normality of residuals

Normality of residuals was assessed visually using a normal Q–Q plot and histogram with supplementary evaluation using the Shapiro-Wilk test

par(mfrow = c(1, 2))

data.res <- augment(m2)

plot(m2, which = 2)

hist(data.res$.resid)

shapiro.test(m2$residuals)


    Shapiro-Wilk normality test

data:  m2$residuals
W = 0.99343, p-value = 0.5188

Interpretation: Normality of residuals was assessed using a normal Q–Q plot and a histogram of the residuals. The Q–Q plot shows that residuals closely follow the reference line across most of the distribution, with only minor deviations at the extreme tails. Consistently, the histogram of residuals appears approximately symmetric and bell-shaped. This was further supported by the Shapiro–Wilk test, which did not indicate evidence against normality (W = 0.993, p = 0.519). Together, these findings indicate that the assumption of normality of residuals is reasonably satisfied.

4.3 Assumption 3: Homoscedasticity (constant variance of residuals)

4.3.1 Visual diagnostics

Homoscedasticity was assessed by visual inspection of residuals versus fitted values and scale–location plots.

par(mfrow = c(1, 2))

plot(m2, which = c(1, 3))

Interpretation: In the residuals versus fitted plot, residuals are randomly scattered around zero with no clear systematic pattern, suggesting that the assumption of constant variance is reasonable. Similarly, the scale–location plot shows a relatively horizontal smoothing line with no strong funnel-shaped pattern, indicating that the spread of residuals is approximately constant across fitted values. Overall, there is no strong evidence of heteroscedasticity.

4.3.2 Breusch-Pagan test

Homoscedasticity was primarily assessed using graphical diagnostics, including residuals versus fitted values and scale–location plots. In addition, the Breusch–Pagan test was conducted as a formal statistical assessment to evaluate whether the variance of residuals was systematically associated with the predictors in the model.

library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

bptest(m2)


    studentized Breusch-Pagan test

data:  m2
BP = 2.6153, df = 4, p-value = 0.6241

Interpretation: The Breusch–Pagan test did not indicate evidence of heteroscedasticity (BP = 2.62, df = 4, p = 0.62), supporting the assumption of constant variance. This finding was consistent with visual inspection of residual and scale–location plots.

4.4 Assumption 4: Independence

The independence assumption in linear regression means residuals (errors) are independent of one another. Independence was primarily justified at the design stage and then checked again at the analysis/model‑checking stage whenever a method assumes independent observations.

4.5 Check multicollinearity

library(car)
vif(m2)

        years_working         phys_activity               obesity 
             1.005877              2.073625              3.819413 
phys_activity:obesity 
             5.134661

4.2 Check Influentials

We used three diagnostic criteria to identify potentially influential observations:

Cook’s Distance.

Threshold: values above 1, or above \[\frac{4}{n}\]

Standardized Residuals

Threshold: values greater than 2 or less than −2

Leverage Threshold: values above \[\frac{2k+2}{n}\]

Where: \(n\) = number of observations \(k\) = number of predictors in the model

These thresholds help flag data points that may disproportionately influence model estimates or distort inference.

4.2.1 Cook’s distance

4.2.1.1 Identify influential observations and cook’s cutoff point

# Add diagnostics columns
df.inf <- augment(m2, df)

# Define thresholds
n_obs <- nrow(df)
p_preds <- length(coef(m2))
cooks_cutoff <- 4 / n_obs
leverage_cutoff <- (2 * p_preds + 2) / n_obs

# Filter Influential Observations
influential_obs <- df.inf %>%
  filter(.cooksd > cooks_cutoff | abs(.std.resid) > 2 | .hat > leverage_cutoff) 

# Print count
cat("Number of influential observation:", nrow(influential_obs), "\n")

Number of influential observation: 16

cat("Cook's distance cutoff:", round(cooks_cutoff, 4), "\n")

Cook's distance cutoff: 0.02

We identify 16 potential influence observations with Cook’s distance greater than the cutoff value of 0.02 \[(\frac{4}{n})\] that may warrant further investigation.

4.2.1.2 Cook’s distance plot

A Cook’s distance plot was used to visually identify potentially influential observations and to evaluate whether influence was driven by isolated cases or broadly distributed across the dataset.

plot(m2, which = 4)

Interpretation: The Cook’s distance plot shows that while a few observations have moderately higher influence (e.g. 60, 116, 156), no single case exerts excessive leverage on the model (Threshold values > 1). However, the 3 points with highest Cook’s distance was further assessed to see whether it fulfills the other diagnostic criteria to be considered as influential observation.

4.2.2 Leverage and standardized residuals

Following the assessment of Cook’s distance, standardized residuals and leverage values were examined to better understand the source of potential influence. Standardized residuals were used to identify observations with unusually large deviations from the fitted values, while leverage values were assessed to detect observations with extreme predictor combinations.

Residuals vs leverage plot

plot(m2, which = 5)

Interpretation: Several observations exceeded conventional thresholds for standardized residuals and leverage, contributing to moderately elevated Cook’s distance values. However, no single observation exhibited extreme values across all influence diagnostics, suggesting that the fitted model is not unduly driven by a small number of influential cases.

df.inf <- df.inf |> 
  dplyr::mutate(obs_id = row_number())

top10_influential <- df.inf |>
  arrange(desc(.cooksd), desc(abs(.std.resid))) |>
  slice(1:10)

top10_influential |>
  dplyr::select(
    obs_id,
    qol,
    years_working,
    phys_activity,
    obesity,
    .cooksd,
    .hat,
    .std.resid
  )

# A tibble: 10 × 8
   obs_id   qol years_working phys_activity obesity   .cooksd   .hat .std.resid
    <int> <dbl>         <dbl>         <dbl> <fct>       <dbl>  <dbl>      <dbl>
 1    156  44.9         15.6          9.86  obese      0.0655 0.0367      -2.93
 2    116  56.4          5.69         8.03  not obese  0.0493 0.0318      -2.74
 3     60 106.          15.0          8.68  not obese  0.0406 0.0298       2.57
 4    127  79.9          6.17         9.99  obese      0.0393 0.0440       2.07
 5     71  92.5         30.2          8.71  obese      0.0347 0.0310       2.33
 6     92  51.4         26.1          1.01  not obese  0.0275 0.0273      -2.21
 7     34  74.1         31.8          0.585 obese      0.0247 0.0381       1.77
 8    139  75.0         39.2          6.44  not obese  0.0243 0.0280      -2.06
 9     59  50.8         35.8          4.72  obese      0.0235 0.0210      -2.34
10     10  71.8         18.3          1.72  obese      0.0222 0.0243       2.11

Interpretation: Influential observations were further examined by ranking cases based on Cook’s distance and standardized residuals. The top ten observations with the highest Cook’s distance values were identified to assess whether model estimates were disproportionately driven by individual cases, with observation identifiers were retained to ensure traceability. Although several observations exhibited moderately elevated influence measures, no single observation simultaneously exhibited extreme Cook’s distance, high leverage, and large standardized residuals. While several observations exceeded individual diagnostic thresholds, influence was not concentrated in any single case. Overall, these findings suggest that the fitted model is not unduly driven by a small number of influential observations and that the regression estimates are reasonably robust.

4.2.3 DFBETAS

DFBETAS were examined to assess whether any individual observation exerted disproportionate influence on specific regression coefficients.

# Compute DFBETAS and display as a datatable
library(DT)

dfb <- dfbetas(m2)
dfb_long <- dfb |> 
  as.data.frame() |> 
  mutate(id = row_number()) |> 
  pivot_longer(
    cols = -id,
    names_to = "term",
    values_to = "dfbeta"
  )

# Cutoff
dfb_threshold <- 2 / sqrt(nrow(df))

# Identify influential observations
influential_dfb <- dfb_long |> 
  filter(abs(dfbeta) > dfb_threshold) |>
  arrange(desc(abs(dfbeta)))
influential_dfb %>%
  datatable()

Interpretation: Several observations exceeded conventional screening thresholds, indicating moderate influence on individual parameters such as physical activity, working experience, and the interaction term. However, no observation consistently exerted large influence across multiple coefficients.

In conclusion, influential diagnostics were conducted using Cook’s distance, leverage values, standardized residuals, and DFBETAS to evaluate the impact of individual observations on the fitted regression model. While a small number of observations exhibited moderately elevated influence measures under individual diagnostic criteria, no single observation simultaneously demonstrated extreme Cook’s distance, high leverage, and large standardized residuals, nor did any case consistently exert excessive influence across multiple regression coefficients. Collectively, these findings indicate that the fitted model is not unduly driven by a small number of influential observations and that the regression estimates are stable and robust.

4.3 Model Checking Summary

Overall, diagnostic assessments indicated that the assumptions of linear regression were reasonably satisfied. Visual inspection of residual plots supported the assumptions of linearity and homoscedasticity, while normality of residuals was supported by Q–Q plots, histograms, and the Shapiro–Wilk test. Influential diagnostics, including Cook’s distance, leverage, standardized residuals, and DFBETAS, revealed no evidence that the fitted model was unduly driven by a small number of influential observations. Taken together, these findings indicate that the selected regression model is appropriate and provides a reliable basis for interpretation and inference.

Although no major violations of regression assumptions were identified, several standard remedial strategies were considered. In the presence of non-linearity, alternative functional forms or transformation of continuous predictors would have been explored. Evidence of heteroscedasticity would have warranted the use of robust standard errors or variance-stabilizing transformations, while substantial influence from individual observations would have prompted sensitivity analyses excluding influential cases. However, as diagnostic assessments did not indicate meaningful departures from model assumptions or excessive influence, no remedial measures were required, and all observations were retained in the final model.

5 PART E: Result Presentation and Interpretation

5.1 Final model

Results from the final multivariable linear regression model examining the association between working experience, physical activity, obesity status, and quality of life are presented below. Following model comparison and diagnostic evaluation, the model including an interaction between physical activity and obesity status was retained. Regression coefficients are reported with corresponding 95% confidence intervals and p-values.

library(gt)

final_m <- tbl_regression(
  m2,
  intercept = TRUE,
  label = list(years_working ~ "Working Experience in years",
               phys_activity ~ "Physical Activity Score",
               obesity ~ "Obesity Status")
  ) |>
    bold_p(t = 0.05) |>
    bold_labels() |>
    add_glance_table(include = c(r.squared, adj.r.squared, AIC, BIC)) |>
    modify_fmt_fun(
      estimate ~ function(x) style_number(x, digits = 2),
      conf.low ~ function(x) style_number(x, digits = 2),
      conf.high ~ function(x) style_number(x, digits = 2)) |>
    modify_header(label = "**Variables**",
                  estimate = "**Adj. B**",
                  statistic = "**t-stat**") |>
    modify_caption("**Final Multivariable Linear Regression Model for Quality of Life Score**")

final_m

Table : Final Multvariable Linear Regression Model fo Quality of Life Score
Variables	Adj. B	t-stat	95% CI	p-value
(Intercept)	52.45	28.0	48.75, 56.15	<0.001
Working Experience in years	0.53	10.1	0.43, 0.63	<0.001
Physical Activity Score	2.84	10.0	2.28, 3.39	<0.001
Obesity Status
not obese	—	—	—
obese	-10.33	-4.59	-14.77, -5.89	<0.001
Physical Activity Score * Obesity Status
Physical Activity Score * obese	-1.03	-2.60	-1.80, -0.25	0.010
R²	0.67
Adjusted R²	0.66
AIC	1,412.38
BIC	1,432.17
Abbreviation: CI = Confidence Interval

5.2 Final Model Equation

\(\widehat{QOL} = 52.45 + 0.53(\text{Working Experience in years}) + 2.84(\text{Physical Activity Score}) - 10.33 (\text{Obese}) + - 1.03 (\text{Physical Activity Score × Obese}) + \varepsilon\)

5.3 Interpretation

Model performance and overall fit Overall, the final model explained approximately 67% of the variability in quality of life (adjusted \(R²\) = 0.66). Inclusion of the interaction term improved model fit, supporting obesity status as an important effect modifier in the relationship between physical activity and quality of life.
Analysis of coefficients:

Working experience in years After adjustment for physical activity and obesity status, each additional year of working experience was associated with a 0.53-point increase in quality of life (95% CI: 0.43–0.63; p < 0.001). This suggests that longer working experience is positively associated with quality of life, independent of lifestyle and obesity status.
Physical activity score Among non-obese individuals, each one-unit increase in physical activity score was associated with a 2.84-point increase in quality of life (95% CI: 2.28, 3.39; p < 0.001), after adjusting for working experience and obesity status.
Obesity status Compared with non-obese individuals, obese individuals had a 10.33-point lower quality of life score on average (95% CI: −14.77, −5.89; p < 0.001), after adjusting for working experience and physical activity.This indicates a substantial negative association between obesity and quality of life.
Physical activity score * obese The interaction between physical activity and obesity status was statistically significant (β = −1.03, p = 0.010), indicating that the positive association between physical activity and quality of life was weaker among obese individuals. Specifically, while physical activity was beneficial for quality of life in both groups, the magnitude of benefit was reduced by approximately 1.03 points per unit increase in physical activity among obese individuals compared with non-obese individuals.

Model assessment Model assessment was conducted by evaluating overall goodness-of-fit and comparing competing model specifications. The interaction model demonstrated a modest improvement in explanatory power relative to the main-effects model, as reflected by a higher adjusted \(𝑅^2\) and a lower Akaike Information Criterion (AIC). Formal comparison of nested models using ANOVA further indicated that inclusion of the interaction term significantly improved model fit. Taken together, these findings support the interaction model as providing a better balance between explanatory adequacy and model parsimony, and it was therefore selected for final interpretation.
Practical implication The findings of this analysis have several practical implications for public health and workplace health promotion.

Physical activity was positively associated with quality of life across the study population, indicating that promoting physical activity remains a key strategy for improving overall well-being.
The magnitude of benefit from physical activity differed by obesity status, with stronger improvements observed among non-obese individuals, highlighting the importance of considering effect modification in intervention design.
For obese individuals, physical activity alone may be insufficient to maximise quality-of-life gains. Complementary strategies such as weight management, metabolic health optimisation, and management of comorbid conditions may be required.
Longer working experience was independently associated with higher quality of life, suggesting that occupational stability, accumulated skills, and workplace adaptation may play an important role in supporting well-being among working adults.

Collectively, these findings support the development of integrated and tailored public health and workplace interventions that address both lifestyle behaviours and underlying health status to improve quality of life in adult populations.

Limitations Several limitations should be acknowledged.

The cross-sectional design of the study limits causal inference, as temporal relationships between working experience, physical activity, obesity status, and quality of life cannot be established.
Physical activity and quality of life were modelled as continuous variables with linear effects, which may not fully capture potential non-linear or threshold relationships present in real-world settings.
Obesity status was operationalised as a binary variable, which may oversimplify heterogeneity in adiposity, metabolic risk, and health profiles within obese and non-obese groups.
Several potential confounders, including socioeconomic status, comorbid medical conditions, mental health factors, and workplace characteristics, were not included in the model and may have contributed to residual confounding.

Suggestion for improvement

Future studies should adopt longitudinal study designs to better assess temporal relationships and strengthen causal inference between physical activity, occupational factors, obesity, and quality of life.
Use of more granular measures of obesity, such as body mass index categories, waist circumference, or body composition indicators, may improve model precision and better capture heterogeneity in metabolic health.
Inclusion of psychosocial, socioeconomic, and occupational environment variables (e.g., income, education, job stress, workplace conditions) would provide a more comprehensive understanding of determinants of quality of life.
From a methodological perspective, exploring non-linear functional forms or conducting stratified analyses may help elucidate subgroup-specific effects and potential threshold relationships.
Collectively, these methodological and substantive enhancements would improve the robustness, interpretability, and real-world applicability of findings for public health and occupational health interventions.

5.4 Model Performance Visualisation

#  Actual vs Predicted
augmented_data <- augment(m2)
ggplot(augmented_data, aes(x = qol, y = .fitted)) +
  geom_point(alpha = 0.4, color = "#667eea", size = 2) +
  geom_abline(intercept = 0, slope = 1, color = "#F57C00", 
              linewidth = 1.2, linetype = "dashed") +
  geom_smooth(method = "lm", se = FALSE, color = "#D32F2F", linewidth = 1) +
  labs(title = "Actual vs Predicted QOL",
       x = "Actual QOL",
       y = "Predicted QOL") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Interpretation: The observed versus predicted quality of life plot shows good agreement between predicted and actual values, with most points distributed around the line of equality. This indicates that the model captures the central tendency of the data well, with no clear evidence of systematic prediction bias. While some variability remains, the overall pattern suggests adequate model calibration and reasonable predictive accuracy.

6 Prediction

6.1 Create new data for prediction

newdata <- expand.grid(
  years_working = mean(df$years_working),
  phys_activity = seq(0, 10, by = 0.5),
  obesity = factor(c("not obese", "obese"),
                   levels = c("not obese", "obese")))


newdata |> datatable()

6.2 Generate predictions with confidence intervals using final model

pred.qol <- augment(m2, newdata = newdata, interval = "confidence", level = 0.95)

Warning: The `level` argument is not supported in the `augment()` method for
`lm` objects and will be ignored.

pred.qol |> datatable()

6.3 Prediction visualization

ggplot(pred.qol, 
       aes(x = phys_activity, 
           y = .fitted,
           color = obesity, 
           fill = obesity)) +
  geom_ribbon(
    aes(ymin = .lower, ymax = .upper),
    alpha = 0.2,
    color = NA) +
  geom_line(linewidth = 1.2) +
  scale_color_brewer(palette = "Set2") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Predicted Effect of Physical Activity Score on Quality of Life Score",
    subtitle = "Stratified by Obesity Status",
    x = "Physical Activity",
    y = "Predicted QOL",
    color = "obesity"
    ) +
  theme_minimal(base_size = 12)

Interpretation: This interaction plot demonstrates that physical activity is positively associated with quality of life in both obese and non-obese individuals; however, the magnitude of benefit is greater among non-obese individuals, confirming obesity status as an important effect modifier.

7 Conclusion

This analysis examined the relationship between working experience, physical activity, obesity status, and quality of life using a simulated dataset designed to reflect plausible public health relationships. Through exploratory data analysis, regression modelling, and comprehensive diagnostic assessment, the application of linear regression techniques was demonstrated in a structured and reproducible manner.

Multivariable regression results indicated that both years of working experience and physical activity were positively associated with quality of life, while obesity status was associated with lower baseline quality of life. Importantly, the inclusion of an interaction term between physical activity and obesity status significantly improved model fit, indicating that the beneficial effect of physical activity on quality of life was modified by obesity status. Specifically, physical activity conferred greater improvements in quality of life among non-obese individuals compared to obese individuals.

Model comparison using ANOVA for nested models, alongside goodness-of-fit measures including \(R²\), adjusted \(R²\), and AIC, supported the interaction model as the preferred specification. Diagnostic evaluations demonstrated that key regression assumptions were reasonably satisfied, and influential diagnostics provided no evidence that model estimates were driven by a small number of observations. Consequently, all observations were retained, and the final model was deemed appropriate for inference.

Overall, this analysis illustrates the importance of considering interaction effects when modelling complex health outcomes and highlights how behavioural factors such as physical activity may operate differently across population subgroups. The findings underscore the value of careful model specification, rigorous diagnostic assessment, and clear interpretation in applied regression analysis.

Thank you.