“Statistical Analysis of Subliminal Message Effects on Test Scores”

This study investigates the effects of subliminal messages on test scores.

Students who failed a math exam were randomly assigned to receive either a treatment message (“Each day I am getting better at mathematics”) or a control message (a boring message unrelated to math).

All students took an initial test (pre-test), went to a math course during the summer, and then took a post-test. The researchers claim that the positive subliminal message improves test scores.

We will test this claim using the change in test scores (post-test score - pre-test score) for both the treatment and control groups. #loading up data sets

# Load required libraries and data set 
library(haven)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

setwd("/Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin")
data <- read_dta("subliminal.dta")
head(data)

## # A tibble: 6 × 4
##   pre_test post_test improvement group
##      <dbl>     <dbl>       <dbl> <chr>
## 1       18        24           6 T    
## 2       18        25           7 T    
## 3       21        33          12 T    
## 4       18        29          11 T    
## 5       18        33          15 T    
## 6       20        36          16 T

1. Formulation of Hypotheses

The hypotheses for this test are as follows:

Null Hypothesis (\(H_0\)): There is no difference in the change in test scores between the treatment and control groups. That is, the subliminal message has no effect on test scores. \[ H_0: \mu_{\text{treatment}} = \mu_{\text{control}} \]
Alternative Hypothesis (\(H_1\)): The treatment group has a higher change in test scores than the control group. This is a one-sided hypothesis, as we are specifically testing for an improvement in the treatment group. \[ H_1: \mu_{\text{treatment}} > \mu_{\text{control}} \]

2. Descriptive Statistics

First, we calculate the mean change in test scores for both the treatment and control groups, and summarize the changes

#summarising data

data <- data %>% mutate(change_in_score = post_test - pre_test)

# Summarize the mean change in score by group
group_summary <- data %>%
  group_by(group) %>%
  summarise(mean_change = mean(change_in_score, na.rm = TRUE))

# Print the summary
print(group_summary)

## # A tibble: 2 × 2
##   group mean_change
##   <chr>       <dbl>
## 1 C            7.43
## 2 T           11.4

##3. Now we run a t test to. analyze the data

t_test_result <- t.test(
  change_in_score ~ group,
  data = data,
  alternative = "greater",  
  var.equal = TRUE         
)
print(t_test_result)

## 
##  Two Sample t-test
## 
## data:  change_in_score by group
## t = -2.5646, df = 15, p-value = 0.9892
## alternative hypothesis: true difference in means between group C and group T is greater than 0
## 95 percent confidence interval:
##  -6.686134       Inf
## sample estimates:
## mean in group C mean in group T 
##        7.428571       11.400000

Our Final Interpretation of all of the Results

Mean:” The mean of treatment group was 11.4 and the mean of our control group was 7.43. The mean is the average change in the test scores”

Test-Statistic: “The T-value is -2.5646, which indicates the difference between the groups”

P-value: “The P-value is 0.9892

Since the p-value (0.9892) is much greater than 0.01, we do not reject the null hypothesis at the 99%(C.I) confidence level. This means we don’t have enough evidence to support the claim that the positive subliminal message improved test scores in the treatment group compared to the control group.

One Final Note on the Hypothesis Test

Since we used a one-sided test with alternative = “greater”, this assumes that the treatment group was hypothesized to have a higher mean than the control group. However, the results didn’t support this hypothesis in this case.

“Statistical Analysis of the determinants of Mortality among U.S adults”

This report aims to investigate the determinants of mortality among U.S. adults by analyzing the mortality dataset provided.

We are particularly interested in understanding how factors such as age, gender, income, smoking status, and alcohol use relate to mortality rates.

Our findings aim to inform policy recommendations for the Department of Health, helping to address key health priorities

file.info("/Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds")

##                                                                                               size
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 85468
##                                                                                              isdir
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds FALSE
##                                                                                              mode
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds  644
##                                                                                                            mtime
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 2024-11-06 12:20:58
##                                                                                                            ctime
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 2024-11-21 16:23:03
##                                                                                                            atime
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 2024-11-14 14:35:09
##                                                                                              uid
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds 501
##                                                                                              gid
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds  20
##                                                                                                          uname
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds ciaralouisenoctor
##                                                                                              grname
## /Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds  staff

1. load the nescessary packages and the data set

library(ggplot2) 
library(broom)
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

mortality_data <- readRDS("/Users/ciaralouisenoctor/Library/CloudStorage/OneDrive-UniversityCollegeDublin/mortality.rds")

head(mortality_data)

## # A tibble: 6 × 67
##    seqn  qsmk death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233     0     0    NA    NA    NA   175    96 0        42 1         19
## 2   235     0     0    NA    NA    NA   123    80 0        36 0         18
## 3   244     0     0    NA    NA    NA   115    75 1        56 1         15
## 4   245     0     1    85     2    14   148    78 0        68 1         15
## 5   252     0     0    NA    NA    NA   118    77 0        40 0         18
## 6   257     0     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 55 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …

summary(mortality_data)

##       seqn            qsmk            death            yrdth      
##  Min.   :  233   Min.   :0.0000   Min.   :0.0000   Min.   :83.00  
##  1st Qu.:10625   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:85.00  
##  Median :20702   Median :0.0000   Median :0.0000   Median :88.00  
##  Mean   :16639   Mean   :0.2573   Mean   :0.1858   Mean   :87.64  
##  3rd Qu.:22771   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:90.00  
##  Max.   :25061   Max.   :1.0000   Max.   :1.0000   Max.   :92.00  
##                                                    NA's   :1275   
##      modth            dadth           sbp             dbp         sex    
##  Min.   : 1.000   Min.   : 1.0   Min.   : 87.0   Min.   : 47.00   0:762  
##  1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.:115.0   1st Qu.: 70.00   1:804  
##  Median : 6.000   Median :16.0   Median :126.0   Median : 77.00          
##  Mean   : 6.383   Mean   :16.1   Mean   :128.6   Mean   : 77.74          
##  3rd Qu.:10.000   3rd Qu.:24.5   3rd Qu.:139.0   3rd Qu.: 85.00          
##  Max.   :12.000   Max.   :31.0   Max.   :229.0   Max.   :130.00          
##  NA's   :1271     NA's   :1271   NA's   :29      NA's   :33              
##       age        race         income         marital          school     
##  Min.   :25.00   0:1360   Min.   :11.00   Min.   :2.000   Min.   : 0.00  
##  1st Qu.:33.00   1: 206   1st Qu.:17.00   1st Qu.:2.000   1st Qu.:10.00  
##  Median :43.00            Median :19.00   Median :2.000   Median :12.00  
##  Mean   :43.66            Mean   :17.99   Mean   :2.496   Mean   :11.17  
##  3rd Qu.:53.00            3rd Qu.:20.00   3rd Qu.:2.000   3rd Qu.:12.00  
##  Max.   :74.00            Max.   :22.00   Max.   :8.000   Max.   :17.00  
##                           NA's   :59                                     
##  education       ht             wt71             wt82           wt82_71       
##  1:291     Min.   :142.9   Min.   : 39.58   Min.   : 35.38   Min.   :-41.280  
##  2:340     1st Qu.:161.8   1st Qu.: 59.53   1st Qu.: 61.69   1st Qu.: -1.478  
##  3:637     Median :168.2   Median : 69.23   Median : 72.12   Median :  2.604  
##  4:121     Mean   :168.7   Mean   : 70.83   Mean   : 73.47   Mean   :  2.638  
##  5:177     3rd Qu.:175.3   3rd Qu.: 79.80   3rd Qu.: 83.46   3rd Qu.:  6.690  
##            Max.   :198.1   Max.   :151.73   Max.   :136.53   Max.   : 48.538  
##                                                                               
##    birthplace    smokeintensity  smkintensity82_71    smokeyrs    
##  Min.   : 1.00   Min.   : 1.00   Min.   :-80.000   Min.   : 1.00  
##  1st Qu.:22.00   1st Qu.:10.00   1st Qu.:-10.000   1st Qu.:15.00  
##  Median :34.00   Median :20.00   Median : -1.000   Median :24.00  
##  Mean   :31.67   Mean   :20.53   Mean   : -4.633   Mean   :24.59  
##  3rd Qu.:42.00   3rd Qu.:30.00   3rd Qu.:  1.000   3rd Qu.:33.00  
##  Max.   :56.00   Max.   :80.00   Max.   : 50.000   Max.   :64.00  
##  NA's   :90                                                       
##      asthma            bronch              tb                hf          
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.04853   Mean   :0.08365   Mean   :0.01341   Mean   :0.005109  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000000  
##                                                                          
##       hbp         pepticulcer        colitis          hepatitis      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :1.059   Mean   :0.1015   Mean   :0.03448   Mean   :0.01788  
##  3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :2.000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                      
##   chroniccough        hayfever          diabetes          polio        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.05109   Mean   :0.08621   Mean   :0.9898   Mean   :0.01405  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:2.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :2.0000   Max.   :1.00000  
##                                                                        
##      tumor          nervousbreak       alcoholpy       alcoholfreq   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:1.000  
##  Median :0.00000   Median :0.00000   Median :1.0000   Median :2.000  
##  Mean   :0.02363   Mean   :0.02746   Mean   :0.8787   Mean   :1.913  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :2.0000   Max.   :5.000  
##                                                                      
##   alcoholtype    alcoholhowmuch        pica          headache     
##  Min.   :1.000   Min.   : 1.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :3.000   Median : 2.000   Median :0.000   Median :1.0000  
##  Mean   :2.466   Mean   : 3.293   Mean   :0.986   Mean   :0.6328  
##  3rd Qu.:4.000   3rd Qu.: 4.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :48.000   Max.   :2.000   Max.   :1.0000  
##                  NA's   :397                                      
##    otherpain        weakheart         allergies           nerves     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.000  
##  Mean   :0.2433   Mean   :0.02235   Mean   :0.06322   Mean   :0.145  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000  
##                                                                      
##     lackpep            hbpmed       boweltrouble       wtloss       
##  Min.   :0.00000   Min.   :0.000   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00000  
##  Median :0.00000   Median :1.000   Median :1.000   Median :0.00000  
##  Mean   :0.05045   Mean   :1.015   Mean   :1.046   Mean   :0.02618  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :2.000   Max.   :2.000   Max.   :1.00000  
##                                                                     
##    infection     active  exercise  birthcontrol    pregnancies    
##  Min.   :0.000   0:702   0:300    Min.   :0.000   Min.   : 1.000  
##  1st Qu.:0.000   1:715   1:661    1st Qu.:0.000   1st Qu.: 2.000  
##  Median :0.000   2:149   2:605    Median :1.000   Median : 3.000  
##  Mean   :0.145                    Mean   :1.081   Mean   : 3.694  
##  3rd Qu.:0.000                    3rd Qu.:2.000   3rd Qu.: 5.000  
##  Max.   :1.000                    Max.   :2.000   Max.   :15.000  
##                                                   NA's   :866     
##   cholesterol      hightax82         price71         price82     
##  Min.   : 78.0   Min.   :0.0000   Min.   :1.507   Min.   :1.452  
##  1st Qu.:189.0   1st Qu.:0.0000   1st Qu.:2.037   1st Qu.:1.740  
##  Median :216.0   Median :0.0000   Median :2.168   Median :1.815  
##  Mean   :219.9   Mean   :0.1653   Mean   :2.138   Mean   :1.806  
##  3rd Qu.:245.0   3rd Qu.:0.0000   3rd Qu.:2.242   3rd Qu.:1.868  
##  Max.   :416.0   Max.   :1.0000   Max.   :2.693   Max.   :2.103  
##  NA's   :16      NA's   :90       NA's   :90      NA's   :90     
##      tax71            tax82          price71_82         tax71_82     
##  Min.   :0.5249   Min.   :0.2200   Min.   :-0.2027   Min.   :0.0360  
##  1st Qu.:0.9449   1st Qu.:0.4399   1st Qu.: 0.2010   1st Qu.:0.4610  
##  Median :1.0498   Median :0.5060   Median : 0.3360   Median :0.5440  
##  Mean   :1.0580   Mean   :0.5058   Mean   : 0.3324   Mean   :0.5522  
##  3rd Qu.:1.1548   3rd Qu.:0.5719   3rd Qu.: 0.4438   3rd Qu.:0.6220  
##  Max.   :1.5225   Max.   :0.7479   Max.   : 0.6121   Max.   :0.8844  
##  NA's   :90       NA's   :90       NA's   :90        NA's   :90      
##        id            censored     older       
##  Min.   :   1.0   Min.   :0   Min.   :0.0000  
##  1st Qu.: 414.2   1st Qu.:0   1st Qu.:0.0000  
##  Median : 824.5   Median :0   Median :0.0000  
##  Mean   : 821.0   Mean   :0   Mean   :0.2989  
##  3rd Qu.:1228.8   3rd Qu.:0   3rd Qu.:1.0000  
##  Max.   :1629.0   Max.   :0   Max.   :1.0000  
##

##2. summarising the mortality by smoking status

# Summarize mortality by smoking status
mortality_vs_smoking <- mortality_data %>%
  mutate(smoking_status = ifelse(qsmk == 1, "Smoker", "Non-Smoker")) %>%
  group_by(smoking_status) %>%
  summarise(mortality_rate = mean(death, na.rm = TRUE),
            n = n())  # Number of observations in each group

# Print the summary
print(mortality_vs_smoking)

## # A tibble: 2 × 3
##   smoking_status mortality_rate     n
##   <chr>                   <dbl> <int>
## 1 Non-Smoker              0.172  1163
## 2 Smoker                  0.226   403

visualizing the mortality rate by smoking status

# Visualize mortality rate by smoking status using a bar plot
ggplot(mortality_vs_smoking, aes(x = smoking_status, y = mortality_rate, fill = smoking_status)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(x = "Smoking Status", 
       y = "Mortality Rate", 
       title = "Mortality Rate by Smoking Status") +
  scale_fill_manual(values = c("Smoker" = "pink", "Non-Smoker" = "skyblue")) +
  theme_minimal()

#running an OLS regression on smoking status on mortality

# OLS regression
ols_model <- lm(death ~ qsmk, data = mortality_data)

# Summary of the model
summary(ols_model)

## 
## Call:
## lm(formula = death ~ qsmk, data = mortality_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2258 -0.1720 -0.1720 -0.1720  0.8280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17197    0.01139  15.096   <2e-16 ***
## qsmk         0.05384    0.02246   2.397   0.0166 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3885 on 1564 degrees of freedom
## Multiple R-squared:  0.003661,   Adjusted R-squared:  0.003024 
## F-statistic: 5.748 on 1 and 1564 DF,  p-value: 0.01663

#AIC
aic_value <- AIC(ols_model)
cat("AIC for the OLS regression model:", aic_value, "\n")

## AIC for the OLS regression model: 1486.925

running a logistic regression as well

# Logistic regression
logit_model <- glm(death ~ qsmk, data = mortality_data, family = binomial)

# Summary of the model
summary(logit_model)

## 
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = mortality_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.57174    0.07771 -20.226   <2e-16 ***
## qsmk         0.33959    0.14224   2.387    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1498.2  on 1564  degrees of freedom
## AIC: 1502.2
## 
## Number of Fisher Scoring iterations: 4

# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")

## AIC for the logistic regression model: 1502.154

#comparing the AIC to determine which model balances goodness of fit

# Comparing AIC between models
cat("AIC for simple model:", AIC(logit_model), "\n")

## AIC for simple model: 1502.154

cat("AIC for model with controls:", AIC(ols_model), "\n")

## AIC for model with controls: 1486.925

##3. summarising the mortality by Alcohol intake frequency

# Summarize mortality by alcohol frequency
alcohol_mortality_summary <- mortality_data %>%
  group_by(alcoholfreq) %>%
  summarise(mortality_rate = mean(death, na.rm = TRUE),
            n = n())  # Number of observations in each group

# Print the summary
print(alcohol_mortality_summary)

## # A tibble: 6 × 3
##   alcoholfreq mortality_rate     n
##         <dbl>          <dbl> <int>
## 1           0          0.231   325
## 2           1          0.178   219
## 3           2          0.123   494
## 4           3          0.186   328
## 5           4          0.282   195
## 6           5          0         5

# Visualize mortality rate by alcohol frequency using a bar plot
ggplot(alcohol_mortality_summary, aes(x = factor(alcoholfreq), y = mortality_rate, fill = factor(alcoholfreq))) +
  geom_bar(stat = "identity") +
  labs(x = "Alcohol Frequency", 
       y = "Mortality Rate", 
       title = "Mortality Rate by Alcohol Frequency") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

# OLS regression
ols_model <- lm(death ~ alcoholfreq, data = mortality_data)

# Summary of the model
summary(ols_model)

## 
## Call:
## lm(formula = death ~ alcoholfreq, data = mortality_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1943 -0.1888 -0.1861 -0.1806  0.8194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.180594   0.017468  10.338   <2e-16 ***
## alcoholfreq 0.002734   0.007546   0.362    0.717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3892 on 1564 degrees of freedom
## Multiple R-squared:  8.39e-05,   Adjusted R-squared:  -0.0005554 
## F-statistic: 0.1312 on 1 and 1564 DF,  p-value: 0.7172

#AIC Model 
aic_value <- AIC(ols_model)
cat("AIC for the logistic regression model:", aic_value, "\n")

## AIC for the logistic regression model: 1492.538

# Logistic regression
logit_model <- glm(death ~ alcoholfreq, data = mortality_data, family = binomial)

# Summary of the model
summary(logit_model)

## 
## Call:
## glm(formula = death ~ alcoholfreq, family = binomial, data = mortality_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.51213    0.11622 -13.011   <2e-16 ***
## alcoholfreq  0.01807    0.04986   0.362    0.717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1503.6  on 1564  degrees of freedom
## AIC: 1507.6
## 
## Number of Fisher Scoring iterations: 4

# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")

## AIC for the logistic regression model: 1507.574

new variable testing

#Create a new variable for change in weight
mortality <- mortality_data %>%
  mutate(weight_change = wt82 - wt71)  # Calculate weight change

# Fit the logistic regression model
logistic_model <- glm(
  death ~ weight_change + age + sex + hbp + qsmk + diabetes,
  data = mortality,
  family = binomial(link = "logit")  # Specify logistic regression
)

# Summarize the model
summary(logistic_model)

## 
## Call:
## glm(formula = death ~ weight_change + age + sex + hbp + qsmk + 
##     diabetes, family = binomial(link = "logit"), data = mortality)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.33175    0.42331 -14.958  < 2e-16 ***
## weight_change -0.02403    0.01030  -2.334  0.01961 *  
## age            0.11053    0.00797  13.868  < 2e-16 ***
## sex1          -0.63742    0.15390  -4.142 3.44e-05 ***
## hbp            0.38082    0.23706   1.606  0.10818    
## qsmk           0.01492    0.17028   0.088  0.93018    
## diabetes      -0.58694    0.22635  -2.593  0.00951 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1122.6  on 1559  degrees of freedom
## AIC: 1136.6
## 
## Number of Fisher Scoring iterations: 5

Predicted Probabilities vs. Weight Change

We can visualize the effect of weight change on the probability of death, as weight change is likely to be one of the significant predictors. #visulaising weight change on mortality

# Add predicted probabilities to the dataset
mortality <- mortality %>%
  mutate(predicted_prob = predict(logistic_model, type = "response"))

# Plot predicted probabilities vs. weight_change
ggplot(mortality, aes(x = weight_change, y = predicted_prob)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "loess", color = "red") +
  labs(
    title = "Predicted Probability of Death vs. Weight Change",
    x = "Weight Change (kg)",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

#Visualizing the Impact of Age on the Predicted Probability of Death

# Plot predicted probability of death against age
ggplot(mortality, aes(x = age, y = predicted_prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", color = "skyblue", size = 1) +
  labs(
    title = "Predicted Probability of Death vs. Age",
    x = "Age in 1971",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

#2. Visualizing the Effect of Other Variables You can also explore how sex, diabetes, and other variables affect the probability of death. Since these are categorical variables, you can visualize their effects with bar plots or boxplots.

Effect of Sex on Mortality

# Plot predicted probability of death by sex
ggplot(mortality, aes(x = factor(sex), y = predicted_prob)) +
  geom_boxplot() +
  labs(
    title = "Predicted Probability of Death by Sex",
    x = "Sex (0 = Female, 1 = Male)",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()

Effect of Diabetes on Mortality

# Plot predicted probability of death by diabetes status
ggplot(mortality, aes(x = factor(diabetes), y = predicted_prob)) +
  geom_boxplot() +
  labs(
    title = "Predicted Probability of Death by Diabetes Status",
    x = "Diabetes (0 = No, 1 = Yes)",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()

mortality_vs_heartfailure <- mortality_data %>%
  mutate(heartfailure = ifelse(hf == 1, "had hf", "no hf")) %>%
  group_by(heartfailure) %>%
  summarise(mortality_rate = mean(death, na.rm = TRUE),
            n = n())  # Number of observations in each group

# Print the summary
print(mortality_vs_heartfailure)

## # A tibble: 2 × 3
##   heartfailure mortality_rate     n
##   <chr>                 <dbl> <int>
## 1 had hf                0.5       8
## 2 no hf                 0.184  1558

# OLS regression
ols_model <- lm(death ~ hf, data = mortality_data)

# Summary of the model
summary(ols_model)

## 
## Call:
## lm(formula = death ~ hf, data = mortality_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5000 -0.1842 -0.1842 -0.1842  0.8158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.184211   0.009844  18.713   <2e-16 ***
## hf          0.315789   0.137729   2.293    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3886 on 1564 degrees of freedom
## Multiple R-squared:  0.00335,    Adjusted R-squared:  0.002713 
## F-statistic: 5.257 on 1 and 1564 DF,  p-value: 0.02199

#AIC
aic_value <- AIC(ols_model)
cat("AIC for the OLS regression model:", aic_value, "\n")

## AIC for the OLS regression model: 1487.415

# Logistic regression
logit_model <- glm(death ~ hf, data = mortality_data, family = binomial)

# Summary of the model
summary(logit_model)

## 
## Call:
## glm(formula = death ~ hf, family = binomial, data = mortality_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.48808    0.06535 -22.770   <2e-16 ***
## hf           1.48808    0.71012   2.096   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1499.7  on 1564  degrees of freedom
## AIC: 1503.7
## 
## Number of Fisher Scoring iterations: 4

# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")

## AIC for the logistic regression model: 1503.661

Analyzing the relationship between weak heart smoking, alcohol frequencey on mortality

# Fit logistic regression model including interaction terms
regression_model <- glm(death ~ hf * qsmk * alcoholfreq, 
                        data = mortality, 
                        family = binomial)

# Summary of the model
summary(regression_model)

## 
## Call:
## glm(formula = death ~ hf * qsmk * alcoholfreq, family = binomial, 
##     data = mortality)
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.54455    0.13725 -11.254   <2e-16 ***
## hf                  -17.65150  454.69537  -0.039    0.969    
## qsmk                  0.17298    0.25910   0.668    0.504    
## alcoholfreq          -0.02078    0.06018  -0.345    0.730    
## hf:qsmk              -8.38589  151.58171  -0.055    0.956    
## hf:alcoholfreq        6.78566  151.56610   0.045    0.964    
## qsmk:alcoholfreq      0.08735    0.10974   0.796    0.426    
## hf:qsmk:alcoholfreq        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1489.6  on 1559  degrees of freedom
## AIC: 1503.6
## 
## Number of Fisher Scoring iterations: 12

# Add predicted probabilities to the dataset
mortality <- mortality %>%
  mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))

# View first few rows of the updated dataset
head(mortality)

## # A tibble: 6 × 69
##    seqn  qsmk death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233     0     0    NA    NA    NA   175    96 0        42 1         19
## 2   235     0     0    NA    NA    NA   123    80 0        36 0         18
## 3   244     0     0    NA    NA    NA   115    75 1        56 1         15
## 4   245     0     1    85     2    14   148    78 0        68 1         15
## 5   252     0     0    NA    NA    NA   118    77 0        40 0         18
## 6   257     0     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 57 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …

# Assuming 'mortality' has been updated with predicted probabilities
library(ggplot2)

# Ensure relevant variables are factors for proper grouping
mortality <- mortality %>%
  mutate(
    hf = factor(hf, levels = c(0, 1), labels = c("No Heart Failure", "Heart Failure")),
    qsmk = factor(qsmk, levels = c(0, 1), labels = c("Non-smoker", "Smoker")),
    alcoholfreq = factor(alcoholfreq, levels = c(0, 1), labels = c("No Alcohol", "Alcohol Consumer"))
  )

# Create the plot
ggplot(mortality, aes(x = hf, y = predicted_prob, color = factor(qsmk), shape = factor(alcoholfreq))) +
  geom_point(alpha = 0.7) +  # Use alpha to make points semi-transparent
  labs(
    title = "Predicted Probability of Death by Heart Failure, Smoking, and Alcohol Consumption",
    x = "Heart Failure (1 = Yes, 0 = No)",
    y = "Predicted Probability of Death",
    color = "Smoking Status",
    shape = "Alcohol Consumption"
  ) +
  theme_minimal() +  # Optional: change to a minimal theme for clarity
  theme(legend.position = "top")  # Optional: position legend at the top

## Warning: Removed 1022 rows containing missing values or values outside the scale range
## (`geom_point()`).

new regression including income, death alcohol frequency smoke intenstity and education

regression_data <- mortality %>%
  select(death, income, alcoholfreq, smokeintensity, education) %>%  # Select relevant columns
  filter(!is.na(death))  # Remove rows with missing values for 'death'

# Then run your regression model
model <- glm(death ~ income + alcoholfreq + smokeintensity + education, 
             data = regression_data, 
             family = binomial)

predicted probabilities

# Load necessary libraries
library(dplyr)

# Fit a logistic regression model (ensure your model is defined)
regression_model <- glm(death ~ income + alcoholfreq + smokeintensity + education,
                        data = mortality, family = binomial)

# Add predicted probabilities to the dataset
regression_data <- mortality %>%
  mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))

# View the first few rows of the modified dataset
head(regression_data)

## # A tibble: 6 × 69
##    seqn qsmk       death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233 Non-smoker     0    NA    NA    NA   175    96 0        42 1         19
## 2   235 Non-smoker     0    NA    NA    NA   123    80 0        36 0         18
## 3   244 Non-smoker     0    NA    NA    NA   115    75 1        56 1         15
## 4   245 Non-smoker     1    85     2    14   148    78 0        68 1         15
## 5   252 Non-smoker     0    NA    NA    NA   118    77 0        40 0         18
## 6   257 Non-smoker     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 57 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <fct>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <fct>, …

#summary
summary(regression_model)

## 
## Call:
## glm(formula = death ~ income + alcoholfreq + smokeintensity + 
##     education, family = binomial, data = mortality)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  1.731899   0.790522   2.191 0.028464 *  
## income                      -0.115288   0.045092  -2.557 0.010567 *  
## alcoholfreqAlcohol Consumer -0.392983   0.240248  -1.636 0.101894    
## smokeintensity               0.003301   0.009084   0.363 0.716293    
## education2                  -0.953184   0.349417  -2.728 0.006373 ** 
## education3                  -1.120995   0.324028  -3.460 0.000541 ***
## education4                  -1.500196   0.515461  -2.910 0.003610 ** 
## education5                  -1.098010   0.403960  -2.718 0.006566 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 527.71  on 517  degrees of freedom
## Residual deviance: 490.59  on 510  degrees of freedom
##   (1048 observations deleted due to missingness)
## AIC: 506.59
## 
## Number of Fisher Scoring iterations: 4

##graphing Income

graph_income <- ggplot(regression_data, aes(x = income, y = predicted_prob)) +
  geom_point(alpha = 0.3, color = "blue") +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(title = "Predicted Probability of Death vs. Income",
       x = "Income",
       y = "Predicted Probability of Death") +
  theme_minimal()

print(graph_income)

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1048 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1048 rows containing missing values or values outside the scale range
## (`geom_point()`).

##graphing education

graph_education <- ggplot(regression_data, aes(x = factor(education), y = predicted_prob)) +
  geom_boxplot(fill = "orange", color = "black", alpha = 0.7) +
  labs(title = "Predicted Probability of Death vs. Education Levels",
       x = "Education Level",
       y = "Predicted Probability of Death") +
  theme_minimal()

print(graph_education)

## Warning: Removed 1048 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

genuine health status on mortality

# Load necessary libraries
library(dplyr)

# Fit logistic regression model
regression_model <- glm(death ~ asthma + bronch + tb + hf + hbp + pepticulcer + 
                         colitis + hepatitis + chroniccough + diabetes + polio, 
                         family = binomial, data = mortality)

# Summarize the model results
summary(regression_model)

## 
## Call:
## glm(formula = death ~ asthma + bronch + tb + hf + hbp + pepticulcer + 
##     colitis + hepatitis + chroniccough + diabetes + polio, family = binomial, 
##     data = mortality)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.47603    0.10234 -14.423  < 2e-16 ***
## asthma           0.24692    0.29647   0.833 0.404927    
## bronch           0.05816    0.24426   0.238 0.811801    
## tb               0.38623    0.53032   0.728 0.466436    
## hfHeart Failure  1.61919    0.73401   2.206 0.027386 *  
## hbp              0.85394    0.21591   3.955 7.65e-05 ***
## pepticulcer      0.69891    0.19428   3.598 0.000321 ***
## colitis         -0.12711    0.37599  -0.338 0.735314    
## hepatitis        0.16835    0.50743   0.332 0.740060    
## chroniccough     0.94140    0.27249   3.455 0.000551 ***
## diabetes        -1.13700    0.20648  -5.507 3.66e-08 ***
## polio           -0.77026    0.75272  -1.023 0.306164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1431.1  on 1554  degrees of freedom
## AIC: 1455.1
## 
## Number of Fisher Scoring iterations: 4

# Add predicted probabilities to the dataset
mortality <- mortality %>%
  mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))

# Plot predicted probabilities based on asthma, heart failure, and diabetes (for example)
library(ggplot2)
ggplot(mortality, aes(x = asthma, y = predicted_prob, color = factor(hf), shape = factor(diabetes))) +
  geom_point(alpha = 0.7) +
  labs(title = "Predicted Probability of Death by Asthma, Heart Failure, and Diabetes",
       x = "Asthma (1 = Yes, 0 = No)",
       y = "Predicted Probability of Death",
       color = "Heart Failure (1 = Yes, 0 = No)",
       shape = "Diabetes (1 = Yes, 0 = No)") +
  theme_minimal()

# Fit a logistic regression model (ensure your model is defined)
 regression_model <- glm(death ~ active + exercise,
data = mortality, family = binomial)

 # Add predicted probabilities to the dataset
 regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Summarize the model results
summary(regression_model)

## 
## Call:
## glm(formula = death ~ active + exercise, family = binomial, data = mortality)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.824815   0.167567 -10.890   <2e-16 ***
## active1      0.298240   0.144724   2.061   0.0393 *  
## active2      0.401316   0.224350   1.789   0.0736 .  
## exercise1   -0.007505   0.199869  -0.038   0.9700    
## exercise2    0.397755   0.195535   2.034   0.0419 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1486.4  on 1561  degrees of freedom
## AIC: 1496.4
## 
## Number of Fisher Scoring iterations: 4

library(ggplot2)

# Bar plot with predicted probabilities
ggplot(regression_data, aes(x = factor(active), y = predicted_prob, fill = factor(exercise))) +
  geom_col(position = "dodge") +
  labs(
    title = "Predicted Probabilities of Death by Activity Level",
    x = "Active Level",
    y = "Predicted Probability of Death",
    fill = "Exercise Level"
  ) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal()

activity on mortality

# Fit a logistic regression model (ensure your model is defined)
 regression_model <- glm(death ~ active,
data = mortality, family = binomial)

 # Add predicted probabilities to the dataset
 regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Summarize the model results
summary(regression_model)

## 
## Call:
## glm(formula = death ~ active, family = binomial, data = mortality)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6830     0.1038 -16.210   <2e-16 ***
## active1       0.3141     0.1394   2.253   0.0242 *  
## active2       0.5392     0.2177   2.476   0.0133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1495.4  on 1563  degrees of freedom
## AIC: 1501.4
## 
## Number of Fisher Scoring iterations: 4

# Bar plot with predicted probabilities
ggplot(regression_data, aes(x = factor(active), y = predicted_prob, )) +
  geom_col(position = "dodge") +
  labs(
    title = "Predicted Probabilities of Death by Activity Level",
    x = "Active Level",
    y = "Predicted Probability of Death",
    fill = "Exercise Level"
  ) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal()

death by amount of exercise

# Fit a logistic regression model (ensure your model is defined)
 regression_model <- glm(death ~ exercise ,
data = mortality, family = binomial)

 # Add predicted probabilities to the dataset
 regression_data <- mortality %>%
mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))
# Summarize the model results
summary(regression_model)

## 
## Call:
## glm(formula = death ~ exercise, family = binomial, data = mortality)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.7346     0.1617 -10.728  < 2e-16 ***
## exercise1     0.1014     0.1929   0.526  0.59922    
## exercise2     0.5155     0.1885   2.735  0.00624 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1492.0  on 1563  degrees of freedom
## AIC: 1498
## 
## Number of Fisher Scoring iterations: 4

# Bar plot with predicted probabilities
ggplot(regression_data, aes(x = factor(exercise), y = predicted_prob, )) +
  geom_col(position = "dodge") +
  labs(
    title = "Predicted Probabilities of Death by Activity Level",
    x = "Active Level",
    y = "Predicted Probability of Death",
    fill = "Exercise Level"
  ) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal()

this is using the data from 1971 #getting actual bmi

regression_data$ht<-regression_data$ht/100

regression_data$bmi<-regression_data$wt71/(regression_data$ht)^2

#adding in our categorys of bmi
regression_data$weight_category <- ifelse(
  regression_data$bmi < 18.5, "Underweight",
  ifelse(regression_data$bmi < 25, "Ideal weight", "Overweight")
)
# adding in thresholds based on age and gender
regression_data$weight_category <- ifelse(
  regression_data$age < 18 & regression_data$bmi < 19, "Underweight (Child)",
  ifelse(regression_data$age < 18 & regression_data$bmi < 24, "Ideal weight (Child)",
         ifelse(regression_data$sex == "1" & regression_data$bmi >= 24 & regression_data$bmi < 30, "Overweight",
                ifelse(regression_data$bmi >= 30, "Obese", regression_data$weight_category)))
)

#we can also check our data at anytime using

summary(regression_data$weight_category)

##    Length     Class      Mode 
##      1566 character character

head(regression_data)

## # A tibble: 6 × 71
##    seqn qsmk       death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233 Non-smoker     0    NA    NA    NA   175    96 0        42 1         19
## 2   235 Non-smoker     0    NA    NA    NA   123    80 0        36 0         18
## 3   244 Non-smoker     0    NA    NA    NA   115    75 1        56 1         15
## 4   245 Non-smoker     1    85     2    14   148    78 0        68 1         15
## 5   252 Non-smoker     0    NA    NA    NA   118    77 0        40 0         18
## 6   257 Non-smoker     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 59 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <fct>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <fct>, …

#Create a new dataframe with relevant columns

new_data <- regression_data[, c("wt71", "ht", "age","death", "sex", "bmi", "weight_category")]
# Frequency count of weight categories
table(new_data$weight_category)

## 
## Ideal weight        Obese   Overweight  Underweight 
##          755          184          557           70

# Proportion of each weight category giving us more insight then our frequency 
prop.table(table(new_data$weight_category))

## 
## Ideal weight        Obese   Overweight  Underweight 
##   0.48212005   0.11749681   0.35568327   0.04469987

# Proportion of each weight category
prop.table(table(new_data$weight_category))

## 
## Ideal weight        Obese   Overweight  Underweight 
##   0.48212005   0.11749681   0.35568327   0.04469987

now having the bmi categorized we can see if there is any relation with the death figures

continue <- TRUE  # Ensure this is logically valid

# Check the type of the variable `continue`
print(typeof(continue))

## [1] "logical"

if (continue) { 
  # Proceed with further processing
  print("Continuing process...")
} else {
  # Handle case when `continue` is FALSE
  print("Process halted.")
}

## [1] "Continuing process..."

continue <- TRUE  # or FALSE based on your logic
if (continue) {
  # Processing code
} else {
  # Alternative processing
}

## NULL

continue <- TRUE
if (continue) {
  print("Continuing...")
} else {
  print("Process halted.")
}

## [1] "Continuing..."

#table of deaths by weight category
table_death <- table(new_data$weight_category, new_data$death)
table_death

##               
##                  0   1
##   Ideal weight 634 121
##   Obese        142  42
##   Overweight   443 114
##   Underweight   56  14

# Convert it to a proportion table
proportion_table <- prop.table(table_death, margin = 1)  # Proportions within each row (weight category)
print(proportion_table)

##               
##                        0         1
##   Ideal weight 0.8397351 0.1602649
##   Obese        0.7717391 0.2282609
##   Overweight   0.7953321 0.2046679
##   Underweight  0.8000000 0.2000000

#now only looking at the deaths of the weight categorys
dead_data<- new_data%>% 
  filter(death=="1")
# Count number of deaths per weight category
death_counts <- dead_data %>%
  group_by(weight_category) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) 
print(death_counts)

## # A tibble: 4 × 2
##   weight_category count
##   <chr>           <int>
## 1 Ideal weight      121
## 2 Overweight        114
## 3 Obese              42
## 4 Underweight        14

#proportions of death per weight category

#proportion of deaths per weight category
death_proportions <- new_data %>%
  group_by(weight_category) %>%
  summarize(
    death_counts = sum(death == "1"),  # Count deaths
    total_count = n()                  # Total count in each group
  ) %>%
  mutate(
    proportion = death_counts / total_count  # Calculate proportions
  )
print(death_proportions)

## # A tibble: 4 × 4
##   weight_category death_counts total_count proportion
##   <chr>                  <int>       <int>      <dbl>
## 1 Ideal weight             121         755      0.160
## 2 Obese                     42         184      0.228
## 3 Overweight               114         557      0.205
## 4 Underweight               14          70      0.2

#visualisation

#Bar plot for proportions
ggplot(death_proportions, aes(x = reorder(weight_category, -proportion), y = proportion, fill = weight_category)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Proportion of Deaths by Weight Category",
    x = "Weight Category",
    y = "Proportion of Deaths"
  ) +
  scale_fill_manual(values = c("pink", "blue", "purple", "red")) +
  theme_minimal()

#statistical significance

# Ensure correct data types
new_data$death <- as.factor(new_data$death)
new_data$weight_category <- as.factor(new_data$weight_category)

# Contingency table
contingency_table <- table(new_data$weight_category, new_data$death)
print(contingency_table)

##               
##                  0   1
##   Ideal weight 634 121
##   Obese        142  42
##   Overweight   443 114
##   Underweight   56  14

# Chi-square test
chi_test <- chisq.test(contingency_table)
print(chi_test)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 6.8505, df = 3, p-value = 0.07682

# Summarize deaths and totals by weight category
death_summary <- new_data %>%
  group_by(weight_category) %>%
  summarize(death_count = sum(death == "1"), total_count = n(), .groups = "drop")

print(death_summary)

## # A tibble: 4 × 3
##   weight_category death_count total_count
##   <fct>                 <int>       <int>
## 1 Ideal weight            121         755
## 2 Obese                    42         184
## 3 Overweight              114         557
## 4 Underweight              14          70

#proportions test

prop_test <- prop.test(death_summary$death_count, death_summary$total_count)
print(prop_test)

## 
##  4-sample test for equality of proportions without continuity correction
## 
## data:  death_summary$death_count out of death_summary$total_count
## X-squared = 6.8505, df = 3, p-value = 0.07682
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4 
## 0.1602649 0.2282609 0.2046679 0.2000000

# Pairwise comparisons of proportions
pairwise_prop <- pairwise.prop.test(death_summary$death_count, death_summary$total_count)
print(pairwise_prop)

## 
##  Pairwise comparisons using Pairwise comparison of proportions 
## 
## data:  death_summary$death_count out of death_summary$total_count 
## 
##   1    2    3   
## 2 0.23 -    -   
## 3 0.23 1.00 -   
## 4 1.00 1.00 1.00
## 
## P value adjustment method: holm

retesting for statistical significance

# Calculate odds ratios and 95% confidence intervals
exp(cbind(Odds_Ratio = coef(logistic_model), confint(logistic_model)))

## Waiting for profiling to be done...

##                Odds_Ratio        2.5 %      97.5 %
## (Intercept)   0.001778922 0.0007570502 0.003985288
## weight_change 0.976255711 0.9565913643 0.996033196
## age           1.116866204 1.0999815425 1.134925533
## sex1          0.528652542 0.3900780193 0.713498137
## hbp           1.463478198 0.9180982033 2.328244475
## qsmk          1.015031983 0.7247401826 1.413651329
## diabetes      0.556025921 0.3571477052 0.868582671

# Logistic regression
logistic_model <- glm(death ~ weight_category, data = new_data, family = binomial)
summary(logistic_model)

## 
## Call:
## glm(formula = death ~ weight_category, family = binomial, data = new_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.65626    0.09921 -16.695   <2e-16 ***
## weight_categoryObese        0.43810    0.20173   2.172   0.0299 *  
## weight_categoryOverweight   0.29889    0.14447   2.069   0.0386 *  
## weight_categoryUnderweight  0.26996    0.31485   0.857   0.3912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1496.9  on 1562  degrees of freedom
## AIC: 1504.9
## 
## Number of Fisher Scoring iterations: 4

# look at what aspects we have calculated in that had the most significance
logistic_model_interact <- glm(death ~ weight_category * age, data = new_data, family = binomial)
summary(logistic_model_interact)

## 
## Call:
## glm(formula = death ~ weight_category * age, family = binomial, 
##     data = new_data)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -7.455003   0.604797 -12.326   <2e-16 ***
## weight_categoryObese            1.906025   1.099553   1.733    0.083 .  
## weight_categoryOverweight       0.188991   0.920780   0.205    0.837    
## weight_categoryUnderweight     -1.887844   2.365007  -0.798    0.425    
## age                             0.121788   0.011520  10.572   <2e-16 ***
## weight_categoryObese:age       -0.030745   0.021253  -1.447    0.148    
## weight_categoryOverweight:age  -0.002313   0.017411  -0.133    0.894    
## weight_categoryUnderweight:age  0.045586   0.045695   0.998    0.318    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1151.4  on 1558  degrees of freedom
## AIC: 1167.4
## 
## Number of Fisher Scoring iterations: 6

#Extract coefficients and calculate odds ratios (OR)

coef_summary <- tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE)

# Remove the intercept for plotting (optional)
coef_summary <- coef_summary[-1, ]

# Create the odds ratio plot

ggplot(coef_summary, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = p.value < 0.05)) +
  geom_pointrange() +
  scale_color_manual(values = c("red", "black"), name = "Significance") +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray") +
  labs(
    title = "Odds Ratios from Logistic Regression",
    x = "Weight Categories",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal() +
  coord_flip()

Econ30130 Assignment

Michael, Ciara, Sadhbh