Lab - 11

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(purrr)

## Warning: package 'purrr' was built under R version 4.3.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Warning: package 'tibble' was built under R version 4.3.2

## Warning: package 'tidyr' was built under R version 4.3.2

## Warning: package 'readr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'forcats' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.3.2

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

data_path <- "C:/Users/shanata/Downloads/smoking_driking_dataset_Ver01.csv"

data <- read.csv(data_path)

Building Linear Model

I want to predict systolic blood pressure based on age, height, wiastline, cholestral levels. I am building a linear regression model

model <- lm(SBP ~ age + height + weight + waistline + HDL_chole + LDL_chole, data = data)

summary(model)

## 
## Call:
## lm(formula = SBP ~ age + height + weight + waistline + HDL_chole + 
##     LDL_chole, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.381   -8.889   -0.581    8.106  151.021 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.7943156  0.3384147 294.888  < 2e-16 ***
## age          0.3019872  0.0010752 280.876  < 2e-16 ***
## height      -0.1351739  0.0021128 -63.980  < 2e-16 ***
## weight       0.3989575  0.0019047 209.456  < 2e-16 ***
## waistline    0.0529521  0.0015721  33.681  < 2e-16 ***
## HDL_chole    0.0134980  0.0008194  16.473  < 2e-16 ***
## LDL_chole   -0.0011301  0.0003737  -3.024  0.00249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.24 on 991339 degrees of freedom
## Multiple R-squared:  0.1711, Adjusted R-squared:  0.1711 
## F-statistic: 3.411e+04 on 6 and 991339 DF,  p-value: < 2.2e-16

Interpretations:

The intercept of approximately 99.79 represents the estimated Systolic Blood Pressure (SBP) when all other explanatory variables are zero, which may not have a practical interpretation in this context.

Age has a positive coefficient of approximately 0.302, indicating that for each year increase in age, SBP is expected to increase by about 0.302 units, holding all other variables constant.

Height has a negative coefficient of approximately -0.135, suggesting that for each additional centimeter in height, SBP is expected to decrease by approximately 0.135 units, assuming other factors remain the same.

Weight has a positive coefficient of approximately 0.399, implying that for each additional kilogram in weight, SBP is expected to increase by about 0.399 units, all else being equal.

Waistline has a positive coefficient of approximately 0.053, indicating that for each additional centimeter in waistline measurement, SBP is expected to rise by approximately 0.053 units, while keeping other variables constant.

Both HDL cholesterol and LDL cholesterol have significant effects on SBP. HDL cholesterol has a positive coefficient of approximately 0.0135, suggesting that higher HDL cholesterol levels are associated with higher SBP. In contrast, LDL cholesterol has a negative coefficient of approximately -0.0011, indicating that higher LDL cholesterol levels are associated with lower SBP.

The overall model explains 17.11% of the variance in SBP, as indicated by the adjusted R-squared value. This means that the included variables collectively account for 17.11% of the variation in SBP. The F-statistic is highly significant (p-value < 2.2e-16), indicating that the model as a whole is statistically significant.

par(mfrow=c(2,2))
plot(model)

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(model)

##       age    height    weight waistline HDL_chole LDL_chole 
##  1.314615  2.175155  3.212861  1.962747  1.128252  1.014379

VIF values are relatively low for most variables, which suggests that multicollinearity is not a significant concern

The p-values (Pr(>|t|)) for all coefficients, except LDL cholesterol, are extremely low, suggesting high statistical significance. This implies that these variables have a strong relationship with the response variable (SBP)

Building a generalised linear model

Converting categorical data atypes

data$sex <- as.factor(data$sex)
data$SMK_stat_type_cd <- as.factor(data$SMK_stat_type_cd)

data$DRK_YN <- as.factor(ifelse(data$DRK_YN == "Y", 1, 0))

model <- glm(DRK_YN ~ age + height + weight + waistline + HDL_chole + LDL_chole + SMK_stat_type_cd, data = data, family = binomial,control = glm.control(maxit = 1000))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model)

## 
## Call:
## glm(formula = DRK_YN ~ age + height + weight + waistline + HDL_chole + 
##     LDL_chole + SMK_stat_type_cd, family = binomial, data = data, 
##     control = glm.control(maxit = 1000))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.618e+00  6.895e-02 -110.48   <2e-16 ***
## age               -3.213e-02  2.017e-04 -159.27   <2e-16 ***
## height             4.012e-02  4.052e-04   99.02   <2e-16 ***
## weight             1.112e-02  4.372e-04   25.43   <2e-16 ***
## waistline          4.935e-03  4.659e-04   10.59   <2e-16 ***
## HDL_chole          2.248e-02  1.719e-04  130.78   <2e-16 ***
## LDL_chole         -2.309e-03  6.693e-05  -34.49   <2e-16 ***
## SMK_stat_type_cd2  1.252e+00  6.761e-03  185.12   <2e-16 ***
## SMK_stat_type_cd3  1.375e+00  6.494e-03  211.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1374297  on 991345  degrees of freedom
## Residual deviance: 1122393  on 991337  degrees of freedom
## AIC: 1122411
## 
## Number of Fisher Scoring iterations: 4

# Odds ratios for coefficients
exp(coef(model))

##       (Intercept)               age            height            weight 
##       0.000491697       0.968382506       1.040938064       1.011180976 
##         waistline         HDL_chole         LDL_chole SMK_stat_type_cd2 
##       1.004947566       1.022737204       0.997693941       3.495933664 
## SMK_stat_type_cd3 
##       3.955600920

# Predicted probabilities
data$predicted_prob <- predict(model, type = "response")

# Diagnostic plots (if needed)
par(mfrow=c(2,2))
plot(model)

# Check for multicollinearity (if needed)
vif(model)

##                      GVIF Df GVIF^(1/(2*Df))
## age              1.414349  1        1.189264
## height           2.227012  1        1.492318
## weight           5.190639  1        2.278297
## waistline        3.899107  1        1.974616
## HDL_chole        1.256425  1        1.120904
## LDL_chole        1.013410  1        1.006683
## SMK_stat_type_cd 1.329183  2        1.073733

The model did not fit well, so I will try to use another GLM

model <- glm(SBP ~ age + height + weight + waistline + HDL_chole + LDL_chole, data = data)

summary(model)

## 
## Call:
## glm(formula = SBP ~ age + height + weight + waistline + HDL_chole + 
##     LDL_chole, data = data)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.7943156  0.3384147 294.888  < 2e-16 ***
## age          0.3019872  0.0010752 280.876  < 2e-16 ***
## height      -0.1351739  0.0021128 -63.980  < 2e-16 ***
## weight       0.3989575  0.0019047 209.456  < 2e-16 ***
## waistline    0.0529521  0.0015721  33.681  < 2e-16 ***
## HDL_chole    0.0134980  0.0008194  16.473  < 2e-16 ***
## LDL_chole   -0.0011301  0.0003737  -3.024  0.00249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 175.3104)
## 
##     Null deviance: 209672593  on 991345  degrees of freedom
## Residual deviance: 173792040  on 991339  degrees of freedom
## AIC: 7935174
## 
## Number of Fisher Scoring iterations: 2

The null deviance, which represents the variability explained by the null model (no predictors), is 209,672,593, while the residual deviance, which represents the unexplained variability by your model, is 173,792,040. The reduction in deviance between the null and your model indicates that model provides a statistically significant improvement in explaining the variance in SBP

The AIC value is 7,935,174, and it can be used to compare models. Lower AIC values indicate better model fit. In this context, the AIC value is a measure of the trade-off between model complexity and goodness of fit.

The model required two Fisher scoring iterations to converge, which is relatively low and suggests that the model fitting process was successful.

data$predicted_prob <- predict(model, type = "response")

I have created a seperate column to store the results

par(mfrow=c(2,2))
plot(model)

Lab - 11

2023-11-06

Building Linear Model

Interpretations:

Building a generalised linear model