---
title: "Practice with Prof. Dr Fittri "
author: "Munir"
date: "`r Sys.Date()`"
format:
html:
theme: cerulean
toc: true
toc-expand: 1
toc-location: left
code-fold: true
code-summary: "Show code"
code-tools: true
number-sections: true
number-depth: 3
execute:
echo: true
warning: false
message: false
comment: NA
---
# Load library
Pacman is a package management tool that helps to load and install multiple R packages in a single command. If a package is not already installed, pacman will install it first before loading it.
```{r}
library(pacman)
p_load(haven,
readxl,
tidyverse,
broom,
gtsummary,
corrplot,
officer,
flextable,
lmtest,
lubridate,
summarytools,
skimr,
car,
psych,
MASS,
performance,
caret,
pROC,
ggeffects,
mfp,
ResourceSelection,
survival,
survminer)
```
# Read data
Code explanation: The `read.csv()` function is used to read a CSV (Comma-Separated Values) file and load its contents into R as a data frame. In this case, it reads the file named 'qol_data.csv' and stores the resulting data frame in the variable `qol_life`.
```{r}
qol_life <- read.csv('qol_data.csv')
```
# Glimpse data
Code explanation: The `glimpse()` function from the `dplyr` package provides a quick overview of the structure of a data frame. It displays the variable names, data types, and a preview of the first few values for each variable, allowing users to quickly understand the contents and format of the dataset.
```{r}
glimpse(qol_life)
```
# Change to factor
Code explanation: The `mutate(across(where(is.character), as.factor))` function is used to convert all character columns in the `qol_life` data frame into factor variables. This is useful for categorical data analysis, as factors are treated differently than character strings in statistical modeling and visualization.
```{r}
qol_life <- qol_life %>%
mutate(across(where(is.character), as.factor))
glimpse(qol_life)
```
#Descriptive analysis
```{r}
summary(qol_life)
```
# Histogram
Code explanation: A histogram is a graphical representation of the distribution of a numerical variable. It displays the frequency of data points within specified intervals (bins) along the x-axis, allowing for visualization of the shape, spread, and central tendency of the data.
```{r}
hist(qol_life$years_working)
```
# Box plot
Code explanation: A box plot (or box-and-whisker plot) is a graphical representation of the distribution of a numerical variable, showing its median, quartiles, and potential outliers. It provides a visual summary of the data's central tendency and variability, allowing for easy comparison between different groups or categories.
```{r}
boxplot(years_working ~ obesity, data = qol_life, main = 'Years of Working x Obesity status', xlab ="", ylab = "")
```
# Select all numerical variable
Code explanation: The `select(where(is.numeric))` function is used to filter and select only the numeric columns from the `qol_life` data frame. This is useful for performing numerical analyses, such as correlation calculations or statistical modeling, on only the relevant numeric variables.
```{r}
num_var <- qol_life %>%
dplyr::select(where(is.numeric))
```
# Create corr plot
Code explanation: A correlation plot is a graphical representation of the correlation matrix, which shows the pairwise correlations between multiple numerical variables. It helps to visualize the strength and direction of relationships between variables, making it easier to identify patterns and potential associations in the data.
```{r}
cor_mat <- cor(num_var, use = "complete.obs") %>%
corrplot(method = "number")
```
# Descriptive table
Code explanation: The `tbl_summary()` function from the `gtsummary` package is used to create a summary table of selected variables in a data frame. In this case, it summarizes the variables `years_working`, `phys_activity`, and `obesity`, providing descriptive statistics (mean and standard deviation) for the continuous variables, grouped by the levels of the categorical variable `obesity`.
```{r}
qol_life %>% dplyr::select(years_working, phys_activity, obesity) %>%
tbl_summary(by = obesity, statistic = list(all_continuous() ~ "{mean}({sd})"))
```
# Model without interaction
Code explanation: A linear regression model is a statistical method used to model the relationship between a dependent variable (response) and one or more independent variables (predictors). In this case, the model predicts the quality of life (`qol`) based on years of working experience (`years_working`), physical activity level (`phys_activity`), and obesity status (`obesity`).
```{r}
mod1 <- lm(qol ~ years_working + phys_activity + obesity, data = qol_life)
tidy(mod1, conf.int = T)
```
# Model with interaction
Code explanation: An interaction term in a linear regression model allows for the examination of how the effect of one independent variable on the dependent variable changes depending on the level of another independent variable. In this case, the interaction between physical activity (`phys_activity`) and obesity status (`obesity`) is included to see if the relationship between physical activity and quality of life (`qol`) differs based on whether an individual is obese or not.
```{r}
mod2 <- lm(qol ~ years_working + phys_activity + obesity + phys_activity:obesity, data = qol_life)
tidy(mod2, conf.int = T)
```
# Compare model
## Nested ANOVA
Code explanation: Nested ANOVA (Analysis of Variance) is a statistical method used to compare the fit of two or more hierarchical linear models. It tests whether the addition of one or more predictors significantly improves the model's ability to explain the variability in the dependent variable. In this case, it compares `mod1` (without interaction) and `mod2` (with interaction) to see if including the interaction term significantly enhances the model.
```{r}
anova(mod1, mod2)
```
## Performance comparison
Code explanation: The `compare_performance()` function from the `performance` package is used to compare the performance of multiple statistical models. It provides various metrics such as R-squared, AIC, BIC, and others to evaluate and compare how well each model fits the data. In this case, it compares `mod1` and `mod2` to assess which model performs better.
```{r}
compare_performance(mod1, mod2)
```
# Coefficient comparison
Code explanation: The following code compares the coefficients of two linear models (`mod1` and `mod2`) by extracting the estimates from each model, merging them into a single data frame, and calculating the percentage change in coefficients between the two models. This helps to understand how the inclusion of interaction terms in `mod2` affects the estimated effects of the predictors compared to `mod1`.
```{r}
# Coefficient from model 1
b1 <- tidy(mod1) %>% dplyr::select(term,
estimate) %>% rename(beta1 =
estimate)
# Coefficient from model 2
b2 <- tidy(mod2) %>% dplyr::select(term,
estimate) %>% rename(beta2 =
estimate)
# Coefficient from model 2
compare <- inner_join(b1, b2, by =
"term") %>%
mutate( pct_change = (beta2 - beta1) /
beta1 * 100)
# View comparison
compare
```
\$ We choose Model 2 as pre liminary model \$
# Model assumption
List of model assumption that we will run:
- residual vs fitted plot (to check linearity and homoscedasticity)
- histogram of residuals (to check normality)
- normality test of residuals (Shapiro-Wilk test)
- homoscedasticity test (Breusch-Pagan test)
- autocorrelation test (Durbin-Watson test)
## Plot model
Code explanation: The `plot()` function applied to a linear model object in R generates a series of diagnostic plots that help assess the assumptions of the linear regression model. These plots typically include
List of plot:
- Residuals vs Fitted (tip to check linearity and homoscedasticity)
- Normal Q-Q (to check normality of residuals)
- Scale-Location (to check homoscedasticity)
- Residuals vs Leverage (to identify influential data points)
```{r}
plot(mod2)
```
## Create augmented data
Code explanation: The `augment()` function from the `broom` package is used to add additional information to a model object, such as predicted values, residuals, and other diagnostic measures. In this case, it is applied to the linear model `mod2` to create a new data frame (`aug.mod`) that includes these additional details for further analysis and visualization.
```{r}
aug.mod <- augment(mod2)
```
## Residual vs Fitted plot
Code explanation: The residuals vs fitted plot is a diagnostic tool used to assess the assumptions of a linear regression model. It displays the residuals (the differences between observed and predicted values) on the y-axis against the fitted values (predicted values) on the x-axis. This plot helps to check for linearity, homoscedasticity (constant variance of residuals), and the presence of any patterns that may indicate model inadequacies.
```{r}
ggplot(qol_life, aes(x = years_working, y =
residuals(mod2))) + geom_point() +
geom_hline(yintercept = 0)
```
## Histogram of residuals
Code explanation: A histogram of residuals is a graphical representation that displays the distribution of the residuals (the differences between observed and predicted values) from a regression model. It helps to assess whether the residuals are normally distributed, which is an important assumption in linear regression analysis.
```{r}
hist(residuals(mod2))
```
## Shapiro-Wilk test
Code explanation: The Shapiro-Wilk test is a statistical test used to assess the normality of a dataset. In the context of regression analysis, it is often applied to the residuals of a model to determine whether they follow a normal distribution, which is an important assumption for many statistical tests and models.
```{r}
shapiro.test(mod2$residuals)
```
## Breusch-Pagan test
Code explanation: The Breusch-Pagan test is a statistical test used to assess the homoscedasticity (constant variance) of the residuals in a regression model. It tests whether the variance of the residuals is dependent on the values of the independent variables. A significant result indicates heteroscedasticity, meaning that the variance of the residuals varies with the predictors, which can violate the assumptions of linear regression.
```{r}
bptest(mod2)
```
## Durbin-Watson test
Code explanation: The Durbin-Watson test is a statistical test used to detect the presence of autocorrelation (correlation between residuals) in the residuals of a regression model. It helps to determine whether the residuals are independent of each other, which is an important assumption in linear regression analysis. A Durbin-Watson statistic close to 2 indicates no autocorrelation, while values significantly less than 2 suggest positive autocorrelation and values significantly greater than 2 suggest negative autocorrelation.
```{r}
dwtest(mod2)
```
## Transformasi Box-Cox
Code explanation: The Box-Cox transformation is a statistical technique used to stabilize variance and make data more normally distributed. It is often applied to the response variable in regression analysis to meet the assumptions of linear regression, such as normality of residuals and homoscedasticity. The transformation involves finding an optimal lambda parameter that maximizes the likelihood of the transformed data fitting a normal distribution.
```{r}
boxcox(mod2)
```
## Multicollinearity test
Code explanation: The Variance Inflation Factor (VIF) is a measure used to detect multicollinearity in regression analysis. It quantifies how much the variance of a regression coefficient is increased due to collinearity with other predictors. A VIF value greater than 5 or 10 is often considered indicative of significant multicollinearity, which can affect the stability and interpretability of the regression model.
```{r}
vif(mod2)
```
# Create new data for prediction
Code explanation: The `expand.grid()` function in R is used to create a data frame from all combinations of the specified vectors or factors. In this case, it generates a new data frame (`new_data`) that contains all possible combinations of the specified values for `years_working`, `obesity`, and `phys_activity`. This is useful for making predictions or conducting analyses across a range of scenarios.
```{r}
new_data <- expand.grid(
years_working = c(5, 10, 15),
obesity = c("not obese", "obese"),
phys_activiy = c(1, 5, 10)
)
```
# Predict data
Code explanation: The `predict()` function in R is used to generate predicted values from a fitted model based on new input data. In this case, it uses the linear model `mod2` to predict the quality of life (`qol`) for a specific set of predictor values: 5 years of working, not obese status, and a physical activity level of 1.
Predict QOL for :
- obese yes
- working 10
- phys activity 5
```{r}
predict(mod2, newdata = data.frame(
years_working = 5,
obesity = "not obese",
phys_activity = 1
))
```
# Manual equation
## Manual Prediction Using Regression Equation
The linear regression model is:
$$
\hat{y} = 72.15 - 0.33 \cdot \text{YearsWorking} + 2.76 \cdot \text{PhysActivity} - 15.49 \cdot \text{Obesity} - 1.92 \cdot (\text{PhysActivity} \cdot \text{Obesity})
$$
### Example: Predict for someone with:
- Years working = 5
- PhysActivity = 1 (active)
- Obesity = 0 (obese)
Substitute into the equation:
$$
\hat{y} = 72.15 - 0.33 \cdot 5 + 2.76 \cdot 1 - 15.49 \cdot 0 - 1.92 \cdot (1 \cdot 0)
$$
Step-by-step:
$$
\hat{y} = 72.15 - 1.65 + 2.76 - 0 - 0
$$
Final result:
$$
\hat{y} = 73.26
$$
So, the predicted Quality of Life Score is **73.26**.