Practice with Prof. Dr Fittri

Author

Munir

Published

January 20, 2026

1 Load library

Pacman is a package management tool that helps to load and install multiple R packages in a single command. If a package is not already installed, pacman will install it first before loading it.

Show code
library(pacman)
p_load(haven,
       readxl,
       tidyverse,
       broom,
       gtsummary,
       corrplot,
       officer,
       flextable,
       lmtest,
       lubridate,
       summarytools,
       skimr,
       car,
       psych,
       MASS,
       performance,
       caret,
       pROC,
       ggeffects,
       mfp,
       ResourceSelection,
       survival,
       survminer)

2 Read data

Code explanation: The read.csv() function is used to read a CSV (Comma-Separated Values) file and load its contents into R as a data frame. In this case, it reads the file named ‘qol_data.csv’ and stores the resulting data frame in the variable qol_life.

Show code
qol_life <- read.csv('qol_data.csv') 

3 Glimpse data

Code explanation: The glimpse() function from the dplyr package provides a quick overview of the structure of a data frame. It displays the variable names, data types, and a preview of the first few values for each variable, allowing users to quickly understand the contents and format of the dataset.

Show code
glimpse(qol_life)
Rows: 200
Columns: 4
$ qol           <dbl> 90.9, 48.6, 48.7, 93.9, 71.2, 86.5, 80.4, 77.2, 76.8, 91…
$ years_working <dbl> 29.0, 16.1, 27.0, 17.9, 24.1, 21.0, 15.2, 11.2, 19.1, 12…
$ phys_activity <dbl> 6.4, 5.1, 8.2, 7.2, 4.2, 4.5, 2.9, 3.9, 3.4, 8.7, 8.6, 4…
$ obesity       <chr> "not obese", "obese", "obese", "not obese", "not obese",…

4 Change to factor

Code explanation: The mutate(across(where(is.character), as.factor)) function is used to convert all character columns in the qol_life data frame into factor variables. This is useful for categorical data analysis, as factors are treated differently than character strings in statistical modeling and visualization.

Show code
qol_life <- qol_life %>% 
  mutate(across(where(is.character), as.factor))

glimpse(qol_life)
Rows: 200
Columns: 4
$ qol           <dbl> 90.9, 48.6, 48.7, 93.9, 71.2, 86.5, 80.4, 77.2, 76.8, 91…
$ years_working <dbl> 29.0, 16.1, 27.0, 17.9, 24.1, 21.0, 15.2, 11.2, 19.1, 12…
$ phys_activity <dbl> 6.4, 5.1, 8.2, 7.2, 4.2, 4.5, 2.9, 3.9, 3.4, 8.7, 8.6, 4…
$ obesity       <fct> not obese, obese, obese, not obese, not obese, not obese…

#Descriptive analysis

Show code
summary(qol_life)
      qol         years_working   phys_activity         obesity   
 Min.   :  5.00   Min.   : 0.00   Min.   : 0.000   not obese:131  
 1st Qu.: 59.00   1st Qu.:12.85   1st Qu.: 3.075   obese    : 69  
 Median : 72.80   Median :19.45   Median : 4.850                  
 Mean   : 70.63   Mean   :19.39   Mean   : 4.880                  
 3rd Qu.: 83.92   3rd Qu.:26.43   3rd Qu.: 6.700                  
 Max.   :100.00   Max.   :40.00   Max.   :10.000                  

5 Histogram

Code explanation: A histogram is a graphical representation of the distribution of a numerical variable. It displays the frequency of data points within specified intervals (bins) along the x-axis, allowing for visualization of the shape, spread, and central tendency of the data.

Show code
hist(qol_life$years_working)

6 Box plot

Code explanation: A box plot (or box-and-whisker plot) is a graphical representation of the distribution of a numerical variable, showing its median, quartiles, and potential outliers. It provides a visual summary of the data’s central tendency and variability, allowing for easy comparison between different groups or categories.

Show code
boxplot(years_working ~ obesity, data = qol_life, main = 'Years of Working x Obesity status', xlab ="", ylab = "")

7 Select all numerical variable

Code explanation: The select(where(is.numeric)) function is used to filter and select only the numeric columns from the qol_life data frame. This is useful for performing numerical analyses, such as correlation calculations or statistical modeling, on only the relevant numeric variables.

Show code
num_var <- qol_life %>%
  dplyr::select(where(is.numeric))

8 Create corr plot

Code explanation: A correlation plot is a graphical representation of the correlation matrix, which shows the pairwise correlations between multiple numerical variables. It helps to visualize the strength and direction of relationships between variables, making it easier to identify patterns and potential associations in the data.

Show code
cor_mat <- cor(num_var, use = "complete.obs") %>% 
  corrplot(method = "number")

9 Descriptive table

Code explanation: The tbl_summary() function from the gtsummary package is used to create a summary table of selected variables in a data frame. In this case, it summarizes the variables years_working, phys_activity, and obesity, providing descriptive statistics (mean and standard deviation) for the continuous variables, grouped by the levels of the categorical variable obesity.

Show code
qol_life %>% dplyr::select(years_working, phys_activity, obesity) %>% 
  tbl_summary(by = obesity, statistic = list(all_continuous() ~ "{mean}({sd})"))
Characteristic not obese
N = 1311
obese
N = 691
years_working 19(9) 20(10)
phys_activity 4.86(2.10) 4.92(2.60)
1 Mean(SD)

10 Model without interaction

Code explanation: A linear regression model is a statistical method used to model the relationship between a dependent variable (response) and one or more independent variables (predictors). In this case, the model predicts the quality of life (qol) based on years of working experience (years_working), physical activity level (phys_activity), and obesity status (obesity).

Show code
mod1 <- lm(qol ~ years_working + phys_activity + obesity, data = qol_life)

tidy(mod1, conf.int = T)
# A tibble: 4 × 7
  term          estimate std.error statistic  p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     76.3      2.91       26.2  5.10e-66   70.6      82.0  
2 years_working   -0.330    0.0967     -3.41 7.96e- 4   -0.520    -0.139
3 phys_activity    1.90     0.390       4.88 2.18e- 6    1.13      2.67 
4 obesityobese   -24.9      1.87      -13.3  3.23e-29  -28.5     -21.2  

11 Model with interaction

Code explanation: An interaction term in a linear regression model allows for the examination of how the effect of one independent variable on the dependent variable changes depending on the level of another independent variable. In this case, the interaction between physical activity (phys_activity) and obesity status (obesity) is included to see if the relationship between physical activity and quality of life (qol) differs based on whether an individual is obese or not.

Show code
mod2 <- lm(qol ~ years_working + phys_activity + obesity + phys_activity:obesity, data = qol_life)

tidy(mod2, conf.int = T)
# A tibble: 5 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)             72.1      3.33       21.7  7.48e-54   65.6      78.7  
2 years_working           -0.329    0.0955     -3.44 7.08e- 4   -0.517    -0.140
3 phys_activity            2.76     0.516       5.34 2.62e- 7    1.74      3.77 
4 obesityobese           -15.5      4.21       -3.68 3.06e- 4  -23.8      -7.18 
5 phys_activity:obesit…   -1.92     0.774      -2.48 1.42e- 2   -3.44     -0.390

12 Compare model

12.1 Nested ANOVA

Code explanation: Nested ANOVA (Analysis of Variance) is a statistical method used to compare the fit of two or more hierarchical linear models. It tests whether the addition of one or more predictors significantly improves the model’s ability to explain the variability in the dependent variable. In this case, it compares mod1 (without interaction) and mod2 (with interaction) to see if including the interaction term significantly enhances the model.

Show code
anova(mod1, mod2)
Analysis of Variance Table

Model 1: qol ~ years_working + phys_activity + obesity
Model 2: qol ~ years_working + phys_activity + obesity + phys_activity:obesity
  Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
1    196 30841                              
2    195 29901  1    939.58 6.1275 0.01416 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.2 Performance comparison

Code explanation: The compare_performance() function from the performance package is used to compare the performance of multiple statistical models. It provides various metrics such as R-squared, AIC, BIC, and others to evaluate and compare how well each model fits the data. In this case, it compares mod1 and mod2 to assess which model performs better.

Show code
compare_performance(mod1, mod2)
# Comparison of Model Performance Indices

Name | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2
-----------------------------------------------------------------------
mod1 |    lm | 1585.2 (0.110) | 1585.5 (0.116) | 1601.7 (0.391) | 0.525
mod2 |    lm | 1581.0 (0.890) | 1581.5 (0.884) | 1600.8 (0.609) | 0.540

Name | R2 (adj.) |   RMSE |  Sigma
----------------------------------
mod1 |     0.518 | 12.418 | 12.544
mod2 |     0.530 | 12.227 | 12.383

13 Coefficient comparison

Code explanation: The following code compares the coefficients of two linear models (mod1 and mod2) by extracting the estimates from each model, merging them into a single data frame, and calculating the percentage change in coefficients between the two models. This helps to understand how the inclusion of interaction terms in mod2 affects the estimated effects of the predictors compared to mod1.

Show code
# Coefficient from model 1

b1 <- tidy(mod1) %>% dplyr::select(term,
estimate) %>% rename(beta1 =
estimate)

# Coefficient from model 2

b2 <- tidy(mod2) %>% dplyr::select(term,
estimate) %>% rename(beta2 =
estimate)

# Coefficient from model 2

compare <- inner_join(b1, b2, by =
"term") %>%
 mutate( pct_change = (beta2 - beta1) /
beta1 * 100)

# View comparison
compare
# A tibble: 4 × 4
  term            beta1   beta2 pct_change
  <chr>           <dbl>   <dbl>      <dbl>
1 (Intercept)    76.3    72.1       -5.45 
2 years_working  -0.330  -0.329     -0.304
3 phys_activity   1.90    2.76      44.7  
4 obesityobese  -24.9   -15.5      -37.7  

$ We choose Model 2 as pre liminary model $

14 Model assumption

List of model assumption that we will run: - residual vs fitted plot (to check linearity and homoscedasticity) - histogram of residuals (to check normality) - normality test of residuals (Shapiro-Wilk test) - homoscedasticity test (Breusch-Pagan test) - autocorrelation test (Durbin-Watson test)

14.1 Plot model

Code explanation: The plot() function applied to a linear model object in R generates a series of diagnostic plots that help assess the assumptions of the linear regression model. These plots typically include

List of plot: - Residuals vs Fitted (tip to check linearity and homoscedasticity) - Normal Q-Q (to check normality of residuals) - Scale-Location (to check homoscedasticity) - Residuals vs Leverage (to identify influential data points)

Show code
plot(mod2)

14.2 Create augmented data

Code explanation: The augment() function from the broom package is used to add additional information to a model object, such as predicted values, residuals, and other diagnostic measures. In this case, it is applied to the linear model mod2 to create a new data frame (aug.mod) that includes these additional details for further analysis and visualization.

Show code
aug.mod <- augment(mod2)

14.3 Residual vs Fitted plot

Code explanation: The residuals vs fitted plot is a diagnostic tool used to assess the assumptions of a linear regression model. It displays the residuals (the differences between observed and predicted values) on the y-axis against the fitted values (predicted values) on the x-axis. This plot helps to check for linearity, homoscedasticity (constant variance of residuals), and the presence of any patterns that may indicate model inadequacies.

Show code
ggplot(qol_life, aes(x = years_working, y =
residuals(mod2))) + geom_point() +
geom_hline(yintercept = 0)

14.4 Histogram of residuals

Code explanation: A histogram of residuals is a graphical representation that displays the distribution of the residuals (the differences between observed and predicted values) from a regression model. It helps to assess whether the residuals are normally distributed, which is an important assumption in linear regression analysis.

Show code
hist(residuals(mod2))

14.5 Shapiro-Wilk test

Code explanation: The Shapiro-Wilk test is a statistical test used to assess the normality of a dataset. In the context of regression analysis, it is often applied to the residuals of a model to determine whether they follow a normal distribution, which is an important assumption for many statistical tests and models.

Show code
shapiro.test(mod2$residuals)

    Shapiro-Wilk normality test

data:  mod2$residuals
W = 0.79951, p-value = 2.667e-15

14.6 Breusch-Pagan test

Code explanation: The Breusch-Pagan test is a statistical test used to assess the homoscedasticity (constant variance) of the residuals in a regression model. It tests whether the variance of the residuals is dependent on the values of the independent variables. A significant result indicates heteroscedasticity, meaning that the variance of the residuals varies with the predictors, which can violate the assumptions of linear regression.

Show code
bptest(mod2)

    studentized Breusch-Pagan test

data:  mod2
BP = 4.2319, df = 4, p-value = 0.3755

14.7 Durbin-Watson test

Code explanation: The Durbin-Watson test is a statistical test used to detect the presence of autocorrelation (correlation between residuals) in the residuals of a regression model. It helps to determine whether the residuals are independent of each other, which is an important assumption in linear regression analysis. A Durbin-Watson statistic close to 2 indicates no autocorrelation, while values significantly less than 2 suggest positive autocorrelation and values significantly greater than 2 suggest negative autocorrelation.

Show code
dwtest(mod2)

    Durbin-Watson test

data:  mod2
DW = 2.1229, p-value = 0.8104
alternative hypothesis: true autocorrelation is greater than 0

14.8 Transformasi Box-Cox

Code explanation: The Box-Cox transformation is a statistical technique used to stabilize variance and make data more normally distributed. It is often applied to the response variable in regression analysis to meet the assumptions of linear regression, such as normality of residuals and homoscedasticity. The transformation involves finding an optimal lambda parameter that maximizes the likelihood of the transformed data fitting a normal distribution.

Show code
boxcox(mod2)

14.9 Multicollinearity test

Code explanation: The Variance Inflation Factor (VIF) is a measure used to detect multicollinearity in regression analysis. It quantifies how much the variance of a regression coefficient is increased due to collinearity with other predictors. A VIF value greater than 5 or 10 is often considered indicative of significant multicollinearity, which can affect the stability and interpretability of the regression model.

Show code
vif(mod2)
        years_working         phys_activity               obesity 
             1.003451              1.801865              5.230799 
phys_activity:obesity 
             6.076052 

15 Create new data for prediction

Code explanation: The expand.grid() function in R is used to create a data frame from all combinations of the specified vectors or factors. In this case, it generates a new data frame (new_data) that contains all possible combinations of the specified values for years_working, obesity, and phys_activity. This is useful for making predictions or conducting analyses across a range of scenarios.

Show code
new_data <- expand.grid(
years_working = c(5, 10, 15),
obesity = c("not obese", "obese"),
phys_activiy = c(1, 5, 10)
)

16 Predict data

Code explanation: The predict() function in R is used to generate predicted values from a fitted model based on new input data. In this case, it uses the linear model mod2 to predict the quality of life (qol) for a specific set of predictor values: 5 years of working, not obese status, and a physical activity level of 1.

Predict QOL for :

  • obese yes
  • working 10
  • phys activity 5
Show code
predict(mod2, newdata = data.frame(
  years_working = 5,
  obesity = "not obese",
  phys_activity = 1
))
       1 
73.26122 

17 Manual equation

17.1 Manual Prediction Using Regression Equation

The linear regression model is:

\[ \hat{y} = 72.15 - 0.33 \cdot \text{YearsWorking} + 2.76 \cdot \text{PhysActivity} - 15.49 \cdot \text{Obesity} - 1.92 \cdot (\text{PhysActivity} \cdot \text{Obesity}) \]

17.1.1 Example: Predict for someone with:

  • Years working = 5
  • PhysActivity = 1 (active)
  • Obesity = 0 (obese)

Substitute into the equation:

\[ \hat{y} = 72.15 - 0.33 \cdot 5 + 2.76 \cdot 1 - 15.49 \cdot 0 - 1.92 \cdot (1 \cdot 0) \]

Step-by-step:

\[ \hat{y} = 72.15 - 1.65 + 2.76 - 0 - 0 \]

Final result:

\[ \hat{y} = 73.26 \]

So, the predicted Quality of Life Score is 73.26.