This study investigates the effects of subliminal messages on test scores.
Students who failed a math exam were randomly assigned to receive either a treatment message (“Each day I am getting better at mathematics”) or a control message (a boring message unrelated to math).
All students took an initial test (pre-test), went to a math course during the summer, and then took a post-test. The researchers claim that the positive subliminal message improves test scores.
We will test this claim using the change in test scores (post-test score - pre-test score) for both the treatment and control groups. #loading up data sets
# Load required libraries and data set
library(haven)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("/Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin")
data <- read_dta("subliminal.dta")
head(data)
## # A tibble: 6 × 4
## pre_test post_test improvement group
## <dbl> <dbl> <dbl> <chr>
## 1 18 24 6 T
## 2 18 25 7 T
## 3 21 33 12 T
## 4 18 29 11 T
## 5 18 33 15 T
## 6 20 36 16 T
The hypotheses for this test are as follows:
Null Hypothesis (\(H_0\)): There is no difference in the change in test scores between the treatment and control groups. That is, the subliminal message has no effect on test scores. \[ H_0: \mu_{\text{treatment}} = \mu_{\text{control}} \]
Alternative Hypothesis (\(H_1\)): The treatment group has a higher change in test scores than the control group. This is a one-sided hypothesis, as we are specifically testing for an improvement in the treatment group. \[ H_1: \mu_{\text{treatment}} > \mu_{\text{control}} \]
First, we calculate the mean change in test scores for both the treatment and control groups, and summarize the changes
#summarising data
data <- data %>% mutate(change_in_score = post_test - pre_test)
# Summarize the mean change in score by group
group_summary <- data %>%
group_by(group) %>%
summarise(mean_change = mean(change_in_score, na.rm = TRUE))
# Print the summary
print(group_summary)
## # A tibble: 2 × 2
## group mean_change
## <chr> <dbl>
## 1 C 7.43
## 2 T 11.4
##3. Now we run a t test to. analyze the data
t_test_result <- t.test(
change_in_score ~ group,
data = data,
alternative = "greater",
var.equal = TRUE
)
print(t_test_result)
##
## Two Sample t-test
##
## data: change_in_score by group
## t = -2.5646, df = 15, p-value = 0.9892
## alternative hypothesis: true difference in means between group C and group T is greater than 0
## 95 percent confidence interval:
## -6.686134 Inf
## sample estimates:
## mean in group C mean in group T
## 7.428571 11.400000
Mean:” The mean of treatment group was 11.4 and the mean of our control group was 7.43. The mean is the average change in the test scores”
Test-Statistic: “The T-value is -2.5646, which indicates the difference between the groups”
P-value: “The P-value is 0.9892
Since the p-value (0.9892) is much greater than 0.01, we do not reject the null hypothesis at the 99%(C.I) confidence level. This means we don’t have enough evidence to support the claim that the positive subliminal message improved test scores in the treatment group compared to the control group.
One Final Note on the Hypothesis Test
Since we used a one-sided test with alternative = “greater”, this assumes that the treatment group was hypothesized to have a higher mean than the control group. However, the results didn’t support this hypothesis in this case.
This report aims to investigate the determinants of mortality among U.S. adults by analyzing the mortality dataset provided.
We are particularly interested in understanding how factors such as age, gender, income, smoking status, and alcohol use relate to mortality rates.
Our findings aim to inform policy recommendations for the Department of Health, helping to address key health priorities
file.info("/Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds")
## size
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 85468
## isdir
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds FALSE
## mode
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 644
## mtime
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 2024-11-06 12:20:58
## ctime
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 2024-11-21 16:23:03
## atime
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 2024-11-14 14:35:09
## uid
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 501
## gid
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 20
## uname
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds ciaralouisenoctor
## grname
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds staff
library(ggplot2)
library(broom)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
mortality_data <- readRDS("/Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds")
head(mortality_data)
## # A tibble: 6 × 67
## seqn qsmk death yrdth modth dadth sbp dbp sex age race income
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
## 1 233 0 0 NA NA NA 175 96 0 42 1 19
## 2 235 0 0 NA NA NA 123 80 0 36 0 18
## 3 244 0 0 NA NA NA 115 75 1 56 1 15
## 4 245 0 1 85 2 14 148 78 0 68 1 15
## 5 252 0 0 NA NA NA 118 77 0 40 0 18
## 6 257 0 0 NA NA NA 141 83 1 43 1 11
## # ℹ 55 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## # wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## # smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## # asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## # pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## # hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## # nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …
summary(mortality_data)
## seqn qsmk death yrdth
## Min. : 233 Min. :0.0000 Min. :0.0000 Min. :83.00
## 1st Qu.:10625 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:85.00
## Median :20702 Median :0.0000 Median :0.0000 Median :88.00
## Mean :16639 Mean :0.2573 Mean :0.1858 Mean :87.64
## 3rd Qu.:22771 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:90.00
## Max. :25061 Max. :1.0000 Max. :1.0000 Max. :92.00
## NA's :1275
## modth dadth sbp dbp sex
## Min. : 1.000 Min. : 1.0 Min. : 87.0 Min. : 47.00 0:762
## 1st Qu.: 3.000 1st Qu.: 8.0 1st Qu.:115.0 1st Qu.: 70.00 1:804
## Median : 6.000 Median :16.0 Median :126.0 Median : 77.00
## Mean : 6.383 Mean :16.1 Mean :128.6 Mean : 77.74
## 3rd Qu.:10.000 3rd Qu.:24.5 3rd Qu.:139.0 3rd Qu.: 85.00
## Max. :12.000 Max. :31.0 Max. :229.0 Max. :130.00
## NA's :1271 NA's :1271 NA's :29 NA's :33
## age race income marital school
## Min. :25.00 0:1360 Min. :11.00 Min. :2.000 Min. : 0.00
## 1st Qu.:33.00 1: 206 1st Qu.:17.00 1st Qu.:2.000 1st Qu.:10.00
## Median :43.00 Median :19.00 Median :2.000 Median :12.00
## Mean :43.66 Mean :17.99 Mean :2.496 Mean :11.17
## 3rd Qu.:53.00 3rd Qu.:20.00 3rd Qu.:2.000 3rd Qu.:12.00
## Max. :74.00 Max. :22.00 Max. :8.000 Max. :17.00
## NA's :59
## education ht wt71 wt82 wt82_71
## 1:291 Min. :142.9 Min. : 39.58 Min. : 35.38 Min. :-41.280
## 2:340 1st Qu.:161.8 1st Qu.: 59.53 1st Qu.: 61.69 1st Qu.: -1.478
## 3:637 Median :168.2 Median : 69.23 Median : 72.12 Median : 2.604
## 4:121 Mean :168.7 Mean : 70.83 Mean : 73.47 Mean : 2.638
## 5:177 3rd Qu.:175.3 3rd Qu.: 79.80 3rd Qu.: 83.46 3rd Qu.: 6.690
## Max. :198.1 Max. :151.73 Max. :136.53 Max. : 48.538
##
## birthplace smokeintensity smkintensity82_71 smokeyrs
## Min. : 1.00 Min. : 1.00 Min. :-80.000 Min. : 1.00
## 1st Qu.:22.00 1st Qu.:10.00 1st Qu.:-10.000 1st Qu.:15.00
## Median :34.00 Median :20.00 Median : -1.000 Median :24.00
## Mean :31.67 Mean :20.53 Mean : -4.633 Mean :24.59
## 3rd Qu.:42.00 3rd Qu.:30.00 3rd Qu.: 1.000 3rd Qu.:33.00
## Max. :56.00 Max. :80.00 Max. : 50.000 Max. :64.00
## NA's :90
## asthma bronch tb hf
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.04853 Mean :0.08365 Mean :0.01341 Mean :0.005109
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
##
## hbp pepticulcer colitis hepatitis
## Min. :0.000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :1.059 Mean :0.1015 Mean :0.03448 Mean :0.01788
## 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :2.000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## chroniccough hayfever diabetes polio
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.05109 Mean :0.08621 Mean :0.9898 Mean :0.01405
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:2.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :2.0000 Max. :1.00000
##
## tumor nervousbreak alcoholpy alcoholfreq
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:1.000
## Median :0.00000 Median :0.00000 Median :1.0000 Median :2.000
## Mean :0.02363 Mean :0.02746 Mean :0.8787 Mean :1.913
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.00000 Max. :1.00000 Max. :2.0000 Max. :5.000
##
## alcoholtype alcoholhowmuch pica headache
## Min. :1.000 Min. : 1.000 Min. :0.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.: 2.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :3.000 Median : 2.000 Median :0.000 Median :1.0000
## Mean :2.466 Mean : 3.293 Mean :0.986 Mean :0.6328
## 3rd Qu.:4.000 3rd Qu.: 4.000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.000 Max. :48.000 Max. :2.000 Max. :1.0000
## NA's :397
## otherpain weakheart allergies nerves
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.000
## Mean :0.2433 Mean :0.02235 Mean :0.06322 Mean :0.145
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.000
##
## lackpep hbpmed boweltrouble wtloss
## Min. :0.00000 Min. :0.000 Min. :0.000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.00000
## Median :0.00000 Median :1.000 Median :1.000 Median :0.00000
## Mean :0.05045 Mean :1.015 Mean :1.046 Mean :0.02618
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.00000
## Max. :1.00000 Max. :2.000 Max. :2.000 Max. :1.00000
##
## infection active exercise birthcontrol pregnancies
## Min. :0.000 0:702 0:300 Min. :0.000 Min. : 1.000
## 1st Qu.:0.000 1:715 1:661 1st Qu.:0.000 1st Qu.: 2.000
## Median :0.000 2:149 2:605 Median :1.000 Median : 3.000
## Mean :0.145 Mean :1.081 Mean : 3.694
## 3rd Qu.:0.000 3rd Qu.:2.000 3rd Qu.: 5.000
## Max. :1.000 Max. :2.000 Max. :15.000
## NA's :866
## cholesterol hightax82 price71 price82
## Min. : 78.0 Min. :0.0000 Min. :1.507 Min. :1.452
## 1st Qu.:189.0 1st Qu.:0.0000 1st Qu.:2.037 1st Qu.:1.740
## Median :216.0 Median :0.0000 Median :2.168 Median :1.815
## Mean :219.9 Mean :0.1653 Mean :2.138 Mean :1.806
## 3rd Qu.:245.0 3rd Qu.:0.0000 3rd Qu.:2.242 3rd Qu.:1.868
## Max. :416.0 Max. :1.0000 Max. :2.693 Max. :2.103
## NA's :16 NA's :90 NA's :90 NA's :90
## tax71 tax82 price71_82 tax71_82
## Min. :0.5249 Min. :0.2200 Min. :-0.2027 Min. :0.0360
## 1st Qu.:0.9449 1st Qu.:0.4399 1st Qu.: 0.2010 1st Qu.:0.4610
## Median :1.0498 Median :0.5060 Median : 0.3360 Median :0.5440
## Mean :1.0580 Mean :0.5058 Mean : 0.3324 Mean :0.5522
## 3rd Qu.:1.1548 3rd Qu.:0.5719 3rd Qu.: 0.4438 3rd Qu.:0.6220
## Max. :1.5225 Max. :0.7479 Max. : 0.6121 Max. :0.8844
## NA's :90 NA's :90 NA's :90 NA's :90
## id censored older
## Min. : 1.0 Min. :0 Min. :0.0000
## 1st Qu.: 414.2 1st Qu.:0 1st Qu.:0.0000
## Median : 824.5 Median :0 Median :0.0000
## Mean : 821.0 Mean :0 Mean :0.2989
## 3rd Qu.:1228.8 3rd Qu.:0 3rd Qu.:1.0000
## Max. :1629.0 Max. :0 Max. :1.0000
##
##2. summarising the mortality by smoking status
# Summarize mortality by smoking status
mortality_vs_smoking <- mortality_data %>%
mutate(smoking_status = ifelse(qsmk == 1, "Smoker", "Non-Smoker")) %>%
group_by(smoking_status) %>%
summarise(mortality_rate = mean(death, na.rm = TRUE),
n = n()) # Number of observations in each group
# Print the summary
print(mortality_vs_smoking)
## # A tibble: 2 × 3
## smoking_status mortality_rate n
## <chr> <dbl> <int>
## 1 Non-Smoker 0.172 1163
## 2 Smoker 0.226 403
# Visualize mortality rate by smoking status using a bar plot
ggplot(mortality_vs_smoking, aes(x = smoking_status, y = mortality_rate, fill = smoking_status)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(x = "Smoking Status",
y = "Mortality Rate",
title = "Mortality Rate by Smoking Status") +
scale_fill_manual(values = c("Smoker" = "pink", "Non-Smoker" = "skyblue")) +
theme_minimal()
#running an OLS regression on smoking status on mortality
# OLS regression
ols_model <- lm(death ~ qsmk, data = mortality_data)
# Summary of the model
summary(ols_model)
##
## Call:
## lm(formula = death ~ qsmk, data = mortality_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2258 -0.1720 -0.1720 -0.1720 0.8280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17197 0.01139 15.096 <2e-16 ***
## qsmk 0.05384 0.02246 2.397 0.0166 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3885 on 1564 degrees of freedom
## Multiple R-squared: 0.003661, Adjusted R-squared: 0.003024
## F-statistic: 5.748 on 1 and 1564 DF, p-value: 0.01663
#AIC
aic_value <- AIC(ols_model)
cat("AIC for the OLS regression model:", aic_value, "\n")
## AIC for the OLS regression model: 1486.925
# Logistic regression
logit_model <- glm(death ~ qsmk, data = mortality_data, family = binomial)
# Summary of the model
summary(logit_model)
##
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = mortality_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57174 0.07771 -20.226 <2e-16 ***
## qsmk 0.33959 0.14224 2.387 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1498.2 on 1564 degrees of freedom
## AIC: 1502.2
##
## Number of Fisher Scoring iterations: 4
# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1502.154
#comparing the AIC to determine which model balances goodness of fit
# Comparing AIC between models
cat("AIC for simple model:", AIC(logit_model), "\n")
## AIC for simple model: 1502.154
cat("AIC for model with controls:", AIC(ols_model), "\n")
## AIC for model with controls: 1486.925
##3. summarising the mortality by Alcohol intake frequency
# Summarize mortality by alcohol frequency
alcohol_mortality_summary <- mortality_data %>%
group_by(alcoholfreq) %>%
summarise(mortality_rate = mean(death, na.rm = TRUE),
n = n()) # Number of observations in each group
# Print the summary
print(alcohol_mortality_summary)
## # A tibble: 6 × 3
## alcoholfreq mortality_rate n
## <dbl> <dbl> <int>
## 1 0 0.231 325
## 2 1 0.178 219
## 3 2 0.123 494
## 4 3 0.186 328
## 5 4 0.282 195
## 6 5 0 5
# Visualize mortality rate by alcohol frequency using a bar plot
ggplot(alcohol_mortality_summary, aes(x = factor(alcoholfreq), y = mortality_rate, fill = factor(alcoholfreq))) +
geom_bar(stat = "identity") +
labs(x = "Alcohol Frequency",
y = "Mortality Rate",
title = "Mortality Rate by Alcohol Frequency") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
# OLS regression
ols_model <- lm(death ~ alcoholfreq, data = mortality_data)
# Summary of the model
summary(ols_model)
##
## Call:
## lm(formula = death ~ alcoholfreq, data = mortality_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1943 -0.1888 -0.1861 -0.1806 0.8194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.180594 0.017468 10.338 <2e-16 ***
## alcoholfreq 0.002734 0.007546 0.362 0.717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3892 on 1564 degrees of freedom
## Multiple R-squared: 8.39e-05, Adjusted R-squared: -0.0005554
## F-statistic: 0.1312 on 1 and 1564 DF, p-value: 0.7172
#AIC Model
aic_value <- AIC(ols_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1492.538
# Logistic regression
logit_model <- glm(death ~ alcoholfreq, data = mortality_data, family = binomial)
# Summary of the model
summary(logit_model)
##
## Call:
## glm(formula = death ~ alcoholfreq, family = binomial, data = mortality_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.51213 0.11622 -13.011 <2e-16 ***
## alcoholfreq 0.01807 0.04986 0.362 0.717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1503.6 on 1564 degrees of freedom
## AIC: 1507.6
##
## Number of Fisher Scoring iterations: 4
# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1507.574
new variable testing
#Create a new variable for change in weight
mortality <- mortality_data %>%
mutate(weight_change = wt82 - wt71) # Calculate weight change
# Fit the logistic regression model
logistic_model <- glm(
death ~ weight_change + age + sex + hbp + qsmk + diabetes,
data = mortality,
family = binomial(link = "logit") # Specify logistic regression
)
# Summarize the model
summary(logistic_model)
##
## Call:
## glm(formula = death ~ weight_change + age + sex + hbp + qsmk +
## diabetes, family = binomial(link = "logit"), data = mortality)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.33175 0.42331 -14.958 < 2e-16 ***
## weight_change -0.02403 0.01030 -2.334 0.01961 *
## age 0.11053 0.00797 13.868 < 2e-16 ***
## sex1 -0.63742 0.15390 -4.142 3.44e-05 ***
## hbp 0.38082 0.23706 1.606 0.10818
## qsmk 0.01492 0.17028 0.088 0.93018
## diabetes -0.58694 0.22635 -2.593 0.00951 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1122.6 on 1559 degrees of freedom
## AIC: 1136.6
##
## Number of Fisher Scoring iterations: 5
Predicted Probabilities vs. Weight Change
We can visualize the effect of weight change on the probability of death, as weight change is likely to be one of the significant predictors. #visulaising weight change on mortality
# Add predicted probabilities to the dataset
mortality <- mortality %>%
mutate(predicted_prob = predict(logistic_model, type = "response"))
# Plot predicted probabilities vs. weight_change
ggplot(mortality, aes(x = weight_change, y = predicted_prob)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "loess", color = "red") +
labs(
title = "Predicted Probability of Death vs. Weight Change",
x = "Weight Change (kg)",
y = "Predicted Probability of Death"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
#Visualizing the Impact of Age on the Predicted Probability of Death
# Plot predicted probability of death against age
ggplot(mortality, aes(x = age, y = predicted_prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "loess", color = "skyblue", size = 1) +
labs(
title = "Predicted Probability of Death vs. Age",
x = "Age in 1971",
y = "Predicted Probability of Death"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
#2. Visualizing the Effect of Other Variables You can also explore how sex, diabetes, and other variables affect the probability of death. Since these are categorical variables, you can visualize their effects with bar plots or boxplots.
# Plot predicted probability of death by sex
ggplot(mortality, aes(x = factor(sex), y = predicted_prob)) +
geom_boxplot() +
labs(
title = "Predicted Probability of Death by Sex",
x = "Sex (0 = Female, 1 = Male)",
y = "Predicted Probability of Death"
) +
theme_minimal()
# Plot predicted probability of death by diabetes status
ggplot(mortality, aes(x = factor(diabetes), y = predicted_prob)) +
geom_boxplot() +
labs(
title = "Predicted Probability of Death by Diabetes Status",
x = "Diabetes (0 = No, 1 = Yes)",
y = "Predicted Probability of Death"
) +
theme_minimal()
mortality_vs_heartfailure <- mortality_data %>%
mutate(heartfailure = ifelse(hf == 1, "had hf", "no hf")) %>%
group_by(heartfailure) %>%
summarise(mortality_rate = mean(death, na.rm = TRUE),
n = n()) # Number of observations in each group
# Print the summary
print(mortality_vs_heartfailure)
## # A tibble: 2 × 3
## heartfailure mortality_rate n
## <chr> <dbl> <int>
## 1 had hf 0.5 8
## 2 no hf 0.184 1558
# OLS regression
ols_model <- lm(death ~ hf, data = mortality_data)
# Summary of the model
summary(ols_model)
##
## Call:
## lm(formula = death ~ hf, data = mortality_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5000 -0.1842 -0.1842 -0.1842 0.8158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.184211 0.009844 18.713 <2e-16 ***
## hf 0.315789 0.137729 2.293 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3886 on 1564 degrees of freedom
## Multiple R-squared: 0.00335, Adjusted R-squared: 0.002713
## F-statistic: 5.257 on 1 and 1564 DF, p-value: 0.02199
#AIC
aic_value <- AIC(ols_model)
cat("AIC for the OLS regression model:", aic_value, "\n")
## AIC for the OLS regression model: 1487.415
# Logistic regression
logit_model <- glm(death ~ hf, data = mortality_data, family = binomial)
# Summary of the model
summary(logit_model)
##
## Call:
## glm(formula = death ~ hf, family = binomial, data = mortality_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.48808 0.06535 -22.770 <2e-16 ***
## hf 1.48808 0.71012 2.096 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1499.7 on 1564 degrees of freedom
## AIC: 1503.7
##
## Number of Fisher Scoring iterations: 4
# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1503.661
# Fit logistic regression model including interaction terms
regression_model <- glm(death ~ hf * qsmk * alcoholfreq,
data = mortality,
family = binomial)
# Summary of the model
summary(regression_model)
##
## Call:
## glm(formula = death ~ hf * qsmk * alcoholfreq, family = binomial,
## data = mortality)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.54455 0.13725 -11.254 <2e-16 ***
## hf -17.65150 454.69537 -0.039 0.969
## qsmk 0.17298 0.25910 0.668 0.504
## alcoholfreq -0.02078 0.06018 -0.345 0.730
## hf:qsmk -8.38589 151.58171 -0.055 0.956
## hf:alcoholfreq 6.78566 151.56610 0.045 0.964
## qsmk:alcoholfreq 0.08735 0.10974 0.796 0.426
## hf:qsmk:alcoholfreq NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1489.6 on 1559 degrees of freedom
## AIC: 1503.6
##
## Number of Fisher Scoring iterations: 12
# Add predicted probabilities to the dataset
mortality <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# View first few rows of the updated dataset
head(mortality)
## # A tibble: 6 × 69
## seqn qsmk death yrdth modth dadth sbp dbp sex age race income
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
## 1 233 0 0 NA NA NA 175 96 0 42 1 19
## 2 235 0 0 NA NA NA 123 80 0 36 0 18
## 3 244 0 0 NA NA NA 115 75 1 56 1 15
## 4 245 0 1 85 2 14 148 78 0 68 1 15
## 5 252 0 0 NA NA NA 118 77 0 40 0 18
## 6 257 0 0 NA NA NA 141 83 1 43 1 11
## # ℹ 57 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## # wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## # smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## # asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## # pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## # hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## # nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …
# Assuming 'mortality' has been updated with predicted probabilities
library(ggplot2)
# Ensure relevant variables are factors for proper grouping
mortality <- mortality %>%
mutate(
hf = factor(hf, levels = c(0, 1), labels = c("No Heart Failure", "Heart Failure")),
qsmk = factor(qsmk, levels = c(0, 1), labels = c("Non-smoker", "Smoker")),
alcoholfreq = factor(alcoholfreq, levels = c(0, 1), labels = c("No Alcohol", "Alcohol Consumer"))
)
# Create the plot
ggplot(mortality, aes(x = hf, y = predicted_prob, color = factor(qsmk), shape = factor(alcoholfreq))) +
geom_point(alpha = 0.7) + # Use alpha to make points semi-transparent
labs(
title = "Predicted Probability of Death by Heart Failure, Smoking, and Alcohol Consumption",
x = "Heart Failure (1 = Yes, 0 = No)",
y = "Predicted Probability of Death",
color = "Smoking Status",
shape = "Alcohol Consumption"
) +
theme_minimal() + # Optional: change to a minimal theme for clarity
theme(legend.position = "top") # Optional: position legend at the top
## Warning: Removed 1022 rows containing missing values or values outside the scale range
## (`geom_point()`).
new regression including income, death alcohol frequency smoke intenstity and education
regression_data <- mortality %>%
select(death, income, alcoholfreq, smokeintensity, education) %>% # Select relevant columns
filter(!is.na(death)) # Remove rows with missing values for 'death'
# Then run your regression model
model <- glm(death ~ income + alcoholfreq + smokeintensity + education,
data = regression_data,
family = binomial)
predicted probabilities
# Load necessary libraries
library(dplyr)
# Fit a logistic regression model (ensure your model is defined)
regression_model <- glm(death ~ income + alcoholfreq + smokeintensity + education,
data = mortality, family = binomial)
# Add predicted probabilities to the dataset
regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# View the first few rows of the modified dataset
head(regression_data)
## # A tibble: 6 × 69
## seqn qsmk death yrdth modth dadth sbp dbp sex age race income
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
## 1 233 Non-smoker 0 NA NA NA 175 96 0 42 1 19
## 2 235 Non-smoker 0 NA NA NA 123 80 0 36 0 18
## 3 244 Non-smoker 0 NA NA NA 115 75 1 56 1 15
## 4 245 Non-smoker 1 85 2 14 148 78 0 68 1 15
## 5 252 Non-smoker 0 NA NA NA 118 77 0 40 0 18
## 6 257 Non-smoker 0 NA NA NA 141 83 1 43 1 11
## # ℹ 57 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## # wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## # smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## # asthma <dbl>, bronch <dbl>, tb <dbl>, hf <fct>, hbp <dbl>,
## # pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## # hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## # nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <fct>, …
#summary
summary(regression_model)
##
## Call:
## glm(formula = death ~ income + alcoholfreq + smokeintensity +
## education, family = binomial, data = mortality)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.731899 0.790522 2.191 0.028464 *
## income -0.115288 0.045092 -2.557 0.010567 *
## alcoholfreqAlcohol Consumer -0.392983 0.240248 -1.636 0.101894
## smokeintensity 0.003301 0.009084 0.363 0.716293
## education2 -0.953184 0.349417 -2.728 0.006373 **
## education3 -1.120995 0.324028 -3.460 0.000541 ***
## education4 -1.500196 0.515461 -2.910 0.003610 **
## education5 -1.098010 0.403960 -2.718 0.006566 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 527.71 on 517 degrees of freedom
## Residual deviance: 490.59 on 510 degrees of freedom
## (1048 observations deleted due to missingness)
## AIC: 506.59
##
## Number of Fisher Scoring iterations: 4
##graphing Income
graph_income <- ggplot(regression_data, aes(x = income, y = predicted_prob)) +
geom_point(alpha = 0.3, color = "blue") +
geom_smooth(method = "loess", color = "red", se = FALSE) +
labs(title = "Predicted Probability of Death vs. Income",
x = "Income",
y = "Predicted Probability of Death") +
theme_minimal()
print(graph_income)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1048 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1048 rows containing missing values or values outside the scale range
## (`geom_point()`).
##graphing education
graph_education <- ggplot(regression_data, aes(x = factor(education), y = predicted_prob)) +
geom_boxplot(fill = "orange", color = "black", alpha = 0.7) +
labs(title = "Predicted Probability of Death vs. Education Levels",
x = "Education Level",
y = "Predicted Probability of Death") +
theme_minimal()
print(graph_education)
## Warning: Removed 1048 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Load necessary libraries
library(dplyr)
# Fit logistic regression model
regression_model <- glm(death ~ asthma + bronch + tb + hf + hbp + pepticulcer +
colitis + hepatitis + chroniccough + diabetes + polio,
family = binomial, data = mortality)
# Summarize the model results
summary(regression_model)
##
## Call:
## glm(formula = death ~ asthma + bronch + tb + hf + hbp + pepticulcer +
## colitis + hepatitis + chroniccough + diabetes + polio, family = binomial,
## data = mortality)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.47603 0.10234 -14.423 < 2e-16 ***
## asthma 0.24692 0.29647 0.833 0.404927
## bronch 0.05816 0.24426 0.238 0.811801
## tb 0.38623 0.53032 0.728 0.466436
## hfHeart Failure 1.61919 0.73401 2.206 0.027386 *
## hbp 0.85394 0.21591 3.955 7.65e-05 ***
## pepticulcer 0.69891 0.19428 3.598 0.000321 ***
## colitis -0.12711 0.37599 -0.338 0.735314
## hepatitis 0.16835 0.50743 0.332 0.740060
## chroniccough 0.94140 0.27249 3.455 0.000551 ***
## diabetes -1.13700 0.20648 -5.507 3.66e-08 ***
## polio -0.77026 0.75272 -1.023 0.306164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1431.1 on 1554 degrees of freedom
## AIC: 1455.1
##
## Number of Fisher Scoring iterations: 4
# Add predicted probabilities to the dataset
mortality <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Plot predicted probabilities based on asthma, heart failure, and diabetes (for example)
library(ggplot2)
ggplot(mortality, aes(x = asthma, y = predicted_prob, color = factor(hf), shape = factor(diabetes))) +
geom_point(alpha = 0.7) +
labs(title = "Predicted Probability of Death by Asthma, Heart Failure, and Diabetes",
x = "Asthma (1 = Yes, 0 = No)",
y = "Predicted Probability of Death",
color = "Heart Failure (1 = Yes, 0 = No)",
shape = "Diabetes (1 = Yes, 0 = No)") +
theme_minimal()
# Fit a logistic regression model (ensure your model is defined)
regression_model <- glm(death ~ active + exercise,
data = mortality, family = binomial)
# Add predicted probabilities to the dataset
regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Summarize the model results
summary(regression_model)
##
## Call:
## glm(formula = death ~ active + exercise, family = binomial, data = mortality)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.824815 0.167567 -10.890 <2e-16 ***
## active1 0.298240 0.144724 2.061 0.0393 *
## active2 0.401316 0.224350 1.789 0.0736 .
## exercise1 -0.007505 0.199869 -0.038 0.9700
## exercise2 0.397755 0.195535 2.034 0.0419 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1486.4 on 1561 degrees of freedom
## AIC: 1496.4
##
## Number of Fisher Scoring iterations: 4
library(ggplot2)
# Bar plot with predicted probabilities
ggplot(regression_data, aes(x = factor(active), y = predicted_prob, fill = factor(exercise))) +
geom_col(position = "dodge") +
labs(
title = "Predicted Probabilities of Death by Activity Level",
x = "Active Level",
y = "Predicted Probability of Death",
fill = "Exercise Level"
) +
scale_fill_brewer(palette = "Set3") +
theme_minimal()
activity on mortality
# Fit a logistic regression model (ensure your model is defined)
regression_model <- glm(death ~ active,
data = mortality, family = binomial)
# Add predicted probabilities to the dataset
regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Summarize the model results
summary(regression_model)
##
## Call:
## glm(formula = death ~ active, family = binomial, data = mortality)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6830 0.1038 -16.210 <2e-16 ***
## active1 0.3141 0.1394 2.253 0.0242 *
## active2 0.5392 0.2177 2.476 0.0133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1495.4 on 1563 degrees of freedom
## AIC: 1501.4
##
## Number of Fisher Scoring iterations: 4
# Bar plot with predicted probabilities
ggplot(regression_data, aes(x = factor(active), y = predicted_prob, )) +
geom_col(position = "dodge") +
labs(
title = "Predicted Probabilities of Death by Activity Level",
x = "Active Level",
y = "Predicted Probability of Death",
fill = "Exercise Level"
) +
scale_fill_brewer(palette = "Set3") +
theme_minimal()
death by amount of exercise
# Fit a logistic regression model (ensure your model is defined)
regression_model <- glm(death ~ exercise ,
data = mortality, family = binomial)
# Add predicted probabilities to the dataset
regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Summarize the model results
summary(regression_model)
##
## Call:
## glm(formula = death ~ exercise, family = binomial, data = mortality)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7346 0.1617 -10.728 < 2e-16 ***
## exercise1 0.1014 0.1929 0.526 0.59922
## exercise2 0.5155 0.1885 2.735 0.00624 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1492.0 on 1563 degrees of freedom
## AIC: 1498
##
## Number of Fisher Scoring iterations: 4
# Bar plot with predicted probabilities
ggplot(regression_data, aes(x = factor(exercise), y = predicted_prob, )) +
geom_col(position = "dodge") +
labs(
title = "Predicted Probabilities of Death by Activity Level",
x = "Active Level",
y = "Predicted Probability of Death",
fill = "Exercise Level"
) +
scale_fill_brewer(palette = "Set3") +
theme_minimal()
this is using the data from 1971 #getting actual bmi
regression_data$ht<-regression_data$ht/100
regression_data$bmi<-regression_data$wt71/(regression_data$ht)^2
#adding in our categorys of bmi
regression_data$weight_category <- ifelse(
regression_data$bmi < 18.5, "Underweight",
ifelse(regression_data$bmi < 25, "Ideal weight", "Overweight")
)
# adding in thresholds based on age and gender
regression_data$weight_category <- ifelse(
regression_data$age < 18 & regression_data$bmi < 19, "Underweight (Child)",
ifelse(regression_data$age < 18 & regression_data$bmi < 24, "Ideal weight (Child)",
ifelse(regression_data$sex == "1" & regression_data$bmi >= 24 & regression_data$bmi < 30, "Overweight",
ifelse(regression_data$bmi >= 30, "Obese", regression_data$weight_category)))
)
#we can also check our data at anytime using
summary(regression_data$weight_category)
## Length Class Mode
## 1566 character character
head(regression_data)
## # A tibble: 6 × 71
## seqn qsmk death yrdth modth dadth sbp dbp sex age race income
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
## 1 233 Non-smoker 0 NA NA NA 175 96 0 42 1 19
## 2 235 Non-smoker 0 NA NA NA 123 80 0 36 0 18
## 3 244 Non-smoker 0 NA NA NA 115 75 1 56 1 15
## 4 245 Non-smoker 1 85 2 14 148 78 0 68 1 15
## 5 252 Non-smoker 0 NA NA NA 118 77 0 40 0 18
## 6 257 Non-smoker 0 NA NA NA 141 83 1 43 1 11
## # ℹ 59 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## # wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## # smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## # asthma <dbl>, bronch <dbl>, tb <dbl>, hf <fct>, hbp <dbl>,
## # pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## # hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## # nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <fct>, …
#Create a new dataframe with relevant columns
new_data <- regression_data[, c("wt71", "ht", "age","death", "sex", "bmi", "weight_category")]
# Frequency count of weight categories
table(new_data$weight_category)
##
## Ideal weight Obese Overweight Underweight
## 755 184 557 70
# Proportion of each weight category giving us more insight then our frequency
prop.table(table(new_data$weight_category))
##
## Ideal weight Obese Overweight Underweight
## 0.48212005 0.11749681 0.35568327 0.04469987
# Proportion of each weight category
prop.table(table(new_data$weight_category))
##
## Ideal weight Obese Overweight Underweight
## 0.48212005 0.11749681 0.35568327 0.04469987
continue <- TRUE # Ensure this is logically valid
# Check the type of the variable `continue`
print(typeof(continue))
## [1] "logical"
if (continue) {
# Proceed with further processing
print("Continuing process...")
} else {
# Handle case when `continue` is FALSE
print("Process halted.")
}
## [1] "Continuing process..."
continue <- TRUE # or FALSE based on your logic
if (continue) {
# Processing code
} else {
# Alternative processing
}
## NULL
continue <- TRUE
if (continue) {
print("Continuing...")
} else {
print("Process halted.")
}
## [1] "Continuing..."
#table of deaths by weight category
table_death <- table(new_data$weight_category, new_data$death)
table_death
##
## 0 1
## Ideal weight 634 121
## Obese 142 42
## Overweight 443 114
## Underweight 56 14
# Convert it to a proportion table
proportion_table <- prop.table(table_death, margin = 1) # Proportions within each row (weight category)
print(proportion_table)
##
## 0 1
## Ideal weight 0.8397351 0.1602649
## Obese 0.7717391 0.2282609
## Overweight 0.7953321 0.2046679
## Underweight 0.8000000 0.2000000
#now only looking at the deaths of the weight categorys
dead_data<- new_data%>%
filter(death=="1")
# Count number of deaths per weight category
death_counts <- dead_data %>%
group_by(weight_category) %>%
summarize(count = n()) %>%
arrange(desc(count))
print(death_counts)
## # A tibble: 4 × 2
## weight_category count
## <chr> <int>
## 1 Ideal weight 121
## 2 Overweight 114
## 3 Obese 42
## 4 Underweight 14
#proportions of death per weight category
#proportion of deaths per weight category
death_proportions <- new_data %>%
group_by(weight_category) %>%
summarize(
death_counts = sum(death == "1"), # Count deaths
total_count = n() # Total count in each group
) %>%
mutate(
proportion = death_counts / total_count # Calculate proportions
)
print(death_proportions)
## # A tibble: 4 × 4
## weight_category death_counts total_count proportion
## <chr> <int> <int> <dbl>
## 1 Ideal weight 121 755 0.160
## 2 Obese 42 184 0.228
## 3 Overweight 114 557 0.205
## 4 Underweight 14 70 0.2
#visualisation
#Bar plot for proportions
ggplot(death_proportions, aes(x = reorder(weight_category, -proportion), y = proportion, fill = weight_category)) +
geom_bar(stat = "identity") +
labs(
title = "Proportion of Deaths by Weight Category",
x = "Weight Category",
y = "Proportion of Deaths"
) +
scale_fill_manual(values = c("pink", "blue", "purple", "red")) +
theme_minimal()
#statistical significance
# Ensure correct data types
new_data$death <- as.factor(new_data$death)
new_data$weight_category <- as.factor(new_data$weight_category)
# Contingency table
contingency_table <- table(new_data$weight_category, new_data$death)
print(contingency_table)
##
## 0 1
## Ideal weight 634 121
## Obese 142 42
## Overweight 443 114
## Underweight 56 14
# Chi-square test
chi_test <- chisq.test(contingency_table)
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 6.8505, df = 3, p-value = 0.07682
# Summarize deaths and totals by weight category
death_summary <- new_data %>%
group_by(weight_category) %>%
summarize(death_count = sum(death == "1"), total_count = n(), .groups = "drop")
print(death_summary)
## # A tibble: 4 × 3
## weight_category death_count total_count
## <fct> <int> <int>
## 1 Ideal weight 121 755
## 2 Obese 42 184
## 3 Overweight 114 557
## 4 Underweight 14 70
#proportions test
prop_test <- prop.test(death_summary$death_count, death_summary$total_count)
print(prop_test)
##
## 4-sample test for equality of proportions without continuity correction
##
## data: death_summary$death_count out of death_summary$total_count
## X-squared = 6.8505, df = 3, p-value = 0.07682
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 0.1602649 0.2282609 0.2046679 0.2000000
# Pairwise comparisons of proportions
pairwise_prop <- pairwise.prop.test(death_summary$death_count, death_summary$total_count)
print(pairwise_prop)
##
## Pairwise comparisons using Pairwise comparison of proportions
##
## data: death_summary$death_count out of death_summary$total_count
##
## 1 2 3
## 2 0.23 - -
## 3 0.23 1.00 -
## 4 1.00 1.00 1.00
##
## P value adjustment method: holm
# Calculate odds ratios and 95% confidence intervals
exp(cbind(Odds_Ratio = coef(logistic_model), confint(logistic_model)))
## Waiting for profiling to be done...
## Odds_Ratio 2.5 % 97.5 %
## (Intercept) 0.001778922 0.0007570502 0.003985288
## weight_change 0.976255711 0.9565913643 0.996033196
## age 1.116866204 1.0999815425 1.134925533
## sex1 0.528652542 0.3900780193 0.713498137
## hbp 1.463478198 0.9180982033 2.328244475
## qsmk 1.015031983 0.7247401826 1.413651329
## diabetes 0.556025921 0.3571477052 0.868582671
# Logistic regression
logistic_model <- glm(death ~ weight_category, data = new_data, family = binomial)
summary(logistic_model)
##
## Call:
## glm(formula = death ~ weight_category, family = binomial, data = new_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65626 0.09921 -16.695 <2e-16 ***
## weight_categoryObese 0.43810 0.20173 2.172 0.0299 *
## weight_categoryOverweight 0.29889 0.14447 2.069 0.0386 *
## weight_categoryUnderweight 0.26996 0.31485 0.857 0.3912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1496.9 on 1562 degrees of freedom
## AIC: 1504.9
##
## Number of Fisher Scoring iterations: 4
# look at what aspects we have calculated in that had the most significance
logistic_model_interact <- glm(death ~ weight_category * age, data = new_data, family = binomial)
summary(logistic_model_interact)
##
## Call:
## glm(formula = death ~ weight_category * age, family = binomial,
## data = new_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.455003 0.604797 -12.326 <2e-16 ***
## weight_categoryObese 1.906025 1.099553 1.733 0.083 .
## weight_categoryOverweight 0.188991 0.920780 0.205 0.837
## weight_categoryUnderweight -1.887844 2.365007 -0.798 0.425
## age 0.121788 0.011520 10.572 <2e-16 ***
## weight_categoryObese:age -0.030745 0.021253 -1.447 0.148
## weight_categoryOverweight:age -0.002313 0.017411 -0.133 0.894
## weight_categoryUnderweight:age 0.045586 0.045695 0.998 0.318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1151.4 on 1558 degrees of freedom
## AIC: 1167.4
##
## Number of Fisher Scoring iterations: 6
#Extract coefficients and calculate odds ratios (OR)
coef_summary <- tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE)
# Remove the intercept for plotting (optional)
coef_summary <- coef_summary[-1, ]
# Create the odds ratio plot
ggplot(coef_summary, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = p.value < 0.05)) +
geom_pointrange() +
scale_color_manual(values = c("red", "black"), name = "Significance") +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray") +
labs(
title = "Odds Ratios from Logistic Regression",
x = "Weight Categories",
y = "Odds Ratio (95% CI)"
) +
theme_minimal() +
coord_flip()