library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
## Warning: package 'purrr' was built under R version 4.3.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.2
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.2
data_path <- "C:/Users/shanata/Downloads/smoking_driking_dataset_Ver01.csv"
data <- read.csv(data_path)
I want to predict systolic blood pressure based on age, height, wiastline, cholestral levels. I am building a linear regression model
model <- lm(SBP ~ age + height + weight + waistline + HDL_chole + LDL_chole, data = data)
summary(model)
##
## Call:
## lm(formula = SBP ~ age + height + weight + waistline + HDL_chole +
## LDL_chole, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -108.381 -8.889 -0.581 8.106 151.021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.7943156 0.3384147 294.888 < 2e-16 ***
## age 0.3019872 0.0010752 280.876 < 2e-16 ***
## height -0.1351739 0.0021128 -63.980 < 2e-16 ***
## weight 0.3989575 0.0019047 209.456 < 2e-16 ***
## waistline 0.0529521 0.0015721 33.681 < 2e-16 ***
## HDL_chole 0.0134980 0.0008194 16.473 < 2e-16 ***
## LDL_chole -0.0011301 0.0003737 -3.024 0.00249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.24 on 991339 degrees of freedom
## Multiple R-squared: 0.1711, Adjusted R-squared: 0.1711
## F-statistic: 3.411e+04 on 6 and 991339 DF, p-value: < 2.2e-16
The intercept of approximately 99.79 represents the estimated Systolic Blood Pressure (SBP) when all other explanatory variables are zero, which may not have a practical interpretation in this context.
Age has a positive coefficient of approximately 0.302, indicating that for each year increase in age, SBP is expected to increase by about 0.302 units, holding all other variables constant.
Height has a negative coefficient of approximately -0.135, suggesting that for each additional centimeter in height, SBP is expected to decrease by approximately 0.135 units, assuming other factors remain the same.
Weight has a positive coefficient of approximately 0.399, implying that for each additional kilogram in weight, SBP is expected to increase by about 0.399 units, all else being equal.
Waistline has a positive coefficient of approximately 0.053, indicating that for each additional centimeter in waistline measurement, SBP is expected to rise by approximately 0.053 units, while keeping other variables constant.
Both HDL cholesterol and LDL cholesterol have significant effects on SBP. HDL cholesterol has a positive coefficient of approximately 0.0135, suggesting that higher HDL cholesterol levels are associated with higher SBP. In contrast, LDL cholesterol has a negative coefficient of approximately -0.0011, indicating that higher LDL cholesterol levels are associated with lower SBP.
The overall model explains 17.11% of the variance in SBP, as indicated by the adjusted R-squared value. This means that the included variables collectively account for 17.11% of the variation in SBP. The F-statistic is highly significant (p-value < 2.2e-16), indicating that the model as a whole is statistically significant.
par(mfrow=c(2,2))
plot(model)
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
vif(model)
## age height weight waistline HDL_chole LDL_chole
## 1.314615 2.175155 3.212861 1.962747 1.128252 1.014379
VIF values are relatively low for most variables, which suggests that multicollinearity is not a significant concern
The p-values (Pr(>|t|)) for all coefficients, except LDL cholesterol, are extremely low, suggesting high statistical significance. This implies that these variables have a strong relationship with the response variable (SBP)
Converting categorical data atypes
data$sex <- as.factor(data$sex)
data$SMK_stat_type_cd <- as.factor(data$SMK_stat_type_cd)
data$DRK_YN <- as.factor(ifelse(data$DRK_YN == "Y", 1, 0))
model <- glm(DRK_YN ~ age + height + weight + waistline + HDL_chole + LDL_chole + SMK_stat_type_cd, data = data, family = binomial,control = glm.control(maxit = 1000))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = DRK_YN ~ age + height + weight + waistline + HDL_chole +
## LDL_chole + SMK_stat_type_cd, family = binomial, data = data,
## control = glm.control(maxit = 1000))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.618e+00 6.895e-02 -110.48 <2e-16 ***
## age -3.213e-02 2.017e-04 -159.27 <2e-16 ***
## height 4.012e-02 4.052e-04 99.02 <2e-16 ***
## weight 1.112e-02 4.372e-04 25.43 <2e-16 ***
## waistline 4.935e-03 4.659e-04 10.59 <2e-16 ***
## HDL_chole 2.248e-02 1.719e-04 130.78 <2e-16 ***
## LDL_chole -2.309e-03 6.693e-05 -34.49 <2e-16 ***
## SMK_stat_type_cd2 1.252e+00 6.761e-03 185.12 <2e-16 ***
## SMK_stat_type_cd3 1.375e+00 6.494e-03 211.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1374297 on 991345 degrees of freedom
## Residual deviance: 1122393 on 991337 degrees of freedom
## AIC: 1122411
##
## Number of Fisher Scoring iterations: 4
# Odds ratios for coefficients
exp(coef(model))
## (Intercept) age height weight
## 0.000491697 0.968382506 1.040938064 1.011180976
## waistline HDL_chole LDL_chole SMK_stat_type_cd2
## 1.004947566 1.022737204 0.997693941 3.495933664
## SMK_stat_type_cd3
## 3.955600920
# Predicted probabilities
data$predicted_prob <- predict(model, type = "response")
# Diagnostic plots (if needed)
par(mfrow=c(2,2))
plot(model)
# Check for multicollinearity (if needed)
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## age 1.414349 1 1.189264
## height 2.227012 1 1.492318
## weight 5.190639 1 2.278297
## waistline 3.899107 1 1.974616
## HDL_chole 1.256425 1 1.120904
## LDL_chole 1.013410 1 1.006683
## SMK_stat_type_cd 1.329183 2 1.073733
The model did not fit well, so I will try to use another GLM
model <- glm(SBP ~ age + height + weight + waistline + HDL_chole + LDL_chole, data = data)
summary(model)
##
## Call:
## glm(formula = SBP ~ age + height + weight + waistline + HDL_chole +
## LDL_chole, data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.7943156 0.3384147 294.888 < 2e-16 ***
## age 0.3019872 0.0010752 280.876 < 2e-16 ***
## height -0.1351739 0.0021128 -63.980 < 2e-16 ***
## weight 0.3989575 0.0019047 209.456 < 2e-16 ***
## waistline 0.0529521 0.0015721 33.681 < 2e-16 ***
## HDL_chole 0.0134980 0.0008194 16.473 < 2e-16 ***
## LDL_chole -0.0011301 0.0003737 -3.024 0.00249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 175.3104)
##
## Null deviance: 209672593 on 991345 degrees of freedom
## Residual deviance: 173792040 on 991339 degrees of freedom
## AIC: 7935174
##
## Number of Fisher Scoring iterations: 2
The null deviance, which represents the variability explained by the null model (no predictors), is 209,672,593, while the residual deviance, which represents the unexplained variability by your model, is 173,792,040. The reduction in deviance between the null and your model indicates that model provides a statistically significant improvement in explaining the variance in SBP
The AIC value is 7,935,174, and it can be used to compare models. Lower AIC values indicate better model fit. In this context, the AIC value is a measure of the trade-off between model complexity and goodness of fit.
The model required two Fisher scoring iterations to converge, which is relatively low and suggests that the model fitting process was successful.
data$predicted_prob <- predict(model, type = "response")
I have created a seperate column to store the results
par(mfrow=c(2,2))
plot(model)