“Statistical Analysis of Subliminal Message Effects on Test Scores”

This study investigates the effects of subliminal messages on test scores.

Students who failed a math exam were randomly assigned to receive either a treatment message (“Each day I am getting better at mathematics”) or a control message (a boring message unrelated to math).

All students took an initial test (pre-test), went to a math course during the summer, and then took a post-test. The researchers claim that the positive subliminal message improves test scores.

We will test this claim using the change in test scores (post-test score - pre-test score) for both the treatment and control groups.

# Load required libraries and data set 
library(haven)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("/Users/ciaralouisenoctor/Downloads")
data <- read_dta("subliminal.dta")
head(data)
## # A tibble: 6 × 4
##   pre_test post_test improvement group
##      <dbl>     <dbl>       <dbl> <chr>
## 1       18        24           6 T    
## 2       18        25           7 T    
## 3       21        33          12 T    
## 4       18        29          11 T    
## 5       18        33          15 T    
## 6       20        36          16 T

1. Formulation of Hypotheses

The hypotheses for this test are as follows:

2. Descriptive Statistics

First, we calculate the mean change in test scores for both the treatment and control groups, and summarize the changes

data <- data %>% mutate(change_in_score = post_test - pre_test)

# Summarize the mean change in score by group
group_summary <- data %>%
  group_by(group) %>%
  summarise(mean_change = mean(change_in_score, na.rm = TRUE))

# Print the summary
print(group_summary)
## # A tibble: 2 × 2
##   group mean_change
##   <chr>       <dbl>
## 1 C            7.43
## 2 T           11.4

##3. Now we run a t test to. analyze the data

t_test_result <- t.test(
  change_in_score ~ group,
  data = data,
  alternative = "greater",  
  var.equal = TRUE         
)
print(t_test_result)
## 
##  Two Sample t-test
## 
## data:  change_in_score by group
## t = -2.5646, df = 15, p-value = 0.9892
## alternative hypothesis: true difference in means between group C and group T is greater than 0
## 95 percent confidence interval:
##  -6.686134       Inf
## sample estimates:
## mean in group C mean in group T 
##        7.428571       11.400000

Our Final Interpretation of all of the Results

Mean:” The mean of treatment group was 11.4 and the mean of our control group was 7.43. The mean is the average change in the test scores”

Test-Statistic: “The T-value is -2.5646, which indicates the difference between the groups”

P-value: “The P-value is 0.9892

Since the p-value (0.9892) is much greater than 0.01, we do not reject the null hypothesis at the 99%(C.I) confidence level. This means we don’t have enough evidence to support the claim that the positive subliminal message improved test scores in the treatment group compared to the control group.

One Final Note on the Hypothesis Test

Since we used a one-sided test with alternative = “greater”, this assumes that the treatment group was hypothesized to have a higher mean than the control group. However, the results didn’t support this hypothesis in this case.

“Statistical Analysis of the determinants of Mortality among U.S adults”

This report aims to investigate the determinants of mortality among U.S. adults by analyzing the mortality dataset provided.

We are particularly interested in understanding how factors such as age, gender, income, smoking status, and alcohol use relate to mortality rates.

Our findings aim to inform policy recommendations for the Department of Health, helping to address key health priorities

1. load the nescessary packages and the data set

library(ggplot2) 
library(broom)
library(stargazer) 
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
mortality_data <- readRDS("/Users/ciaralouisenoctor/Downloads/mortality.rds")
head(mortality_data)
## # A tibble: 6 × 67
##    seqn  qsmk death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233     0     0    NA    NA    NA   175    96 0        42 1         19
## 2   235     0     0    NA    NA    NA   123    80 0        36 0         18
## 3   244     0     0    NA    NA    NA   115    75 1        56 1         15
## 4   245     0     1    85     2    14   148    78 0        68 1         15
## 5   252     0     0    NA    NA    NA   118    77 0        40 0         18
## 6   257     0     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 55 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …
summary(mortality_data) 
##       seqn            qsmk            death            yrdth      
##  Min.   :  233   Min.   :0.0000   Min.   :0.0000   Min.   :83.00  
##  1st Qu.:10625   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:85.00  
##  Median :20702   Median :0.0000   Median :0.0000   Median :88.00  
##  Mean   :16639   Mean   :0.2573   Mean   :0.1858   Mean   :87.64  
##  3rd Qu.:22771   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:90.00  
##  Max.   :25061   Max.   :1.0000   Max.   :1.0000   Max.   :92.00  
##                                                    NA's   :1275   
##      modth            dadth           sbp             dbp         sex    
##  Min.   : 1.000   Min.   : 1.0   Min.   : 87.0   Min.   : 47.00   0:762  
##  1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.:115.0   1st Qu.: 70.00   1:804  
##  Median : 6.000   Median :16.0   Median :126.0   Median : 77.00          
##  Mean   : 6.383   Mean   :16.1   Mean   :128.6   Mean   : 77.74          
##  3rd Qu.:10.000   3rd Qu.:24.5   3rd Qu.:139.0   3rd Qu.: 85.00          
##  Max.   :12.000   Max.   :31.0   Max.   :229.0   Max.   :130.00          
##  NA's   :1271     NA's   :1271   NA's   :29      NA's   :33              
##       age        race         income         marital          school     
##  Min.   :25.00   0:1360   Min.   :11.00   Min.   :2.000   Min.   : 0.00  
##  1st Qu.:33.00   1: 206   1st Qu.:17.00   1st Qu.:2.000   1st Qu.:10.00  
##  Median :43.00            Median :19.00   Median :2.000   Median :12.00  
##  Mean   :43.66            Mean   :17.99   Mean   :2.496   Mean   :11.17  
##  3rd Qu.:53.00            3rd Qu.:20.00   3rd Qu.:2.000   3rd Qu.:12.00  
##  Max.   :74.00            Max.   :22.00   Max.   :8.000   Max.   :17.00  
##                           NA's   :59                                     
##  education       ht             wt71             wt82           wt82_71       
##  1:291     Min.   :142.9   Min.   : 39.58   Min.   : 35.38   Min.   :-41.280  
##  2:340     1st Qu.:161.8   1st Qu.: 59.53   1st Qu.: 61.69   1st Qu.: -1.478  
##  3:637     Median :168.2   Median : 69.23   Median : 72.12   Median :  2.604  
##  4:121     Mean   :168.7   Mean   : 70.83   Mean   : 73.47   Mean   :  2.638  
##  5:177     3rd Qu.:175.3   3rd Qu.: 79.80   3rd Qu.: 83.46   3rd Qu.:  6.690  
##            Max.   :198.1   Max.   :151.73   Max.   :136.53   Max.   : 48.538  
##                                                                               
##    birthplace    smokeintensity  smkintensity82_71    smokeyrs    
##  Min.   : 1.00   Min.   : 1.00   Min.   :-80.000   Min.   : 1.00  
##  1st Qu.:22.00   1st Qu.:10.00   1st Qu.:-10.000   1st Qu.:15.00  
##  Median :34.00   Median :20.00   Median : -1.000   Median :24.00  
##  Mean   :31.67   Mean   :20.53   Mean   : -4.633   Mean   :24.59  
##  3rd Qu.:42.00   3rd Qu.:30.00   3rd Qu.:  1.000   3rd Qu.:33.00  
##  Max.   :56.00   Max.   :80.00   Max.   : 50.000   Max.   :64.00  
##  NA's   :90                                                       
##      asthma            bronch              tb                hf          
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.04853   Mean   :0.08365   Mean   :0.01341   Mean   :0.005109  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000000  
##                                                                          
##       hbp         pepticulcer        colitis          hepatitis      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :1.059   Mean   :0.1015   Mean   :0.03448   Mean   :0.01788  
##  3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :2.000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                      
##   chroniccough        hayfever          diabetes          polio        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.05109   Mean   :0.08621   Mean   :0.9898   Mean   :0.01405  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:2.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :2.0000   Max.   :1.00000  
##                                                                        
##      tumor          nervousbreak       alcoholpy       alcoholfreq   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:1.000  
##  Median :0.00000   Median :0.00000   Median :1.0000   Median :2.000  
##  Mean   :0.02363   Mean   :0.02746   Mean   :0.8787   Mean   :1.913  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :2.0000   Max.   :5.000  
##                                                                      
##   alcoholtype    alcoholhowmuch        pica          headache     
##  Min.   :1.000   Min.   : 1.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :3.000   Median : 2.000   Median :0.000   Median :1.0000  
##  Mean   :2.466   Mean   : 3.293   Mean   :0.986   Mean   :0.6328  
##  3rd Qu.:4.000   3rd Qu.: 4.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :48.000   Max.   :2.000   Max.   :1.0000  
##                  NA's   :397                                      
##    otherpain        weakheart         allergies           nerves     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.000  
##  Mean   :0.2433   Mean   :0.02235   Mean   :0.06322   Mean   :0.145  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000  
##                                                                      
##     lackpep            hbpmed       boweltrouble       wtloss       
##  Min.   :0.00000   Min.   :0.000   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00000  
##  Median :0.00000   Median :1.000   Median :1.000   Median :0.00000  
##  Mean   :0.05045   Mean   :1.015   Mean   :1.046   Mean   :0.02618  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :2.000   Max.   :2.000   Max.   :1.00000  
##                                                                     
##    infection     active  exercise  birthcontrol    pregnancies    
##  Min.   :0.000   0:702   0:300    Min.   :0.000   Min.   : 1.000  
##  1st Qu.:0.000   1:715   1:661    1st Qu.:0.000   1st Qu.: 2.000  
##  Median :0.000   2:149   2:605    Median :1.000   Median : 3.000  
##  Mean   :0.145                    Mean   :1.081   Mean   : 3.694  
##  3rd Qu.:0.000                    3rd Qu.:2.000   3rd Qu.: 5.000  
##  Max.   :1.000                    Max.   :2.000   Max.   :15.000  
##                                                   NA's   :866     
##   cholesterol      hightax82         price71         price82     
##  Min.   : 78.0   Min.   :0.0000   Min.   :1.507   Min.   :1.452  
##  1st Qu.:189.0   1st Qu.:0.0000   1st Qu.:2.037   1st Qu.:1.740  
##  Median :216.0   Median :0.0000   Median :2.168   Median :1.815  
##  Mean   :219.9   Mean   :0.1653   Mean   :2.138   Mean   :1.806  
##  3rd Qu.:245.0   3rd Qu.:0.0000   3rd Qu.:2.242   3rd Qu.:1.868  
##  Max.   :416.0   Max.   :1.0000   Max.   :2.693   Max.   :2.103  
##  NA's   :16      NA's   :90       NA's   :90      NA's   :90     
##      tax71            tax82          price71_82         tax71_82     
##  Min.   :0.5249   Min.   :0.2200   Min.   :-0.2027   Min.   :0.0360  
##  1st Qu.:0.9449   1st Qu.:0.4399   1st Qu.: 0.2010   1st Qu.:0.4610  
##  Median :1.0498   Median :0.5060   Median : 0.3360   Median :0.5440  
##  Mean   :1.0580   Mean   :0.5058   Mean   : 0.3324   Mean   :0.5522  
##  3rd Qu.:1.1548   3rd Qu.:0.5719   3rd Qu.: 0.4438   3rd Qu.:0.6220  
##  Max.   :1.5225   Max.   :0.7479   Max.   : 0.6121   Max.   :0.8844  
##  NA's   :90       NA's   :90       NA's   :90        NA's   :90      
##        id            censored     older       
##  Min.   :   1.0   Min.   :0   Min.   :0.0000  
##  1st Qu.: 414.2   1st Qu.:0   1st Qu.:0.0000  
##  Median : 824.5   Median :0   Median :0.0000  
##  Mean   : 821.0   Mean   :0   Mean   :0.2989  
##  3rd Qu.:1228.8   3rd Qu.:0   3rd Qu.:1.0000  
##  Max.   :1629.0   Max.   :0   Max.   :1.0000  
## 

##2. summarising the mortality by smoking status

# Summarize mortality by smoking status
mortality_vs_smoking <- mortality_data %>%
  mutate(smoking_status = ifelse(qsmk == 1, "Smoker", "Non-Smoker")) %>%
  group_by(smoking_status) %>%
  summarise(mortality_rate = mean(death, na.rm = TRUE),
            n = n())  # Number of observations in each group

# Print the summary
print(mortality_vs_smoking)
## # A tibble: 2 × 3
##   smoking_status mortality_rate     n
##   <chr>                   <dbl> <int>
## 1 Non-Smoker              0.172  1163
## 2 Smoker                  0.226   403

visualizing the mortality rate by smoking status

# Visualize mortality rate by smoking status using a bar plot
ggplot(mortality_vs_smoking, aes(x = smoking_status, y = mortality_rate, fill = smoking_status)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(x = "Smoking Status", 
       y = "Mortality Rate", 
       title = "Mortality Rate by Smoking Status") +
  scale_fill_manual(values = c("Smoker" = "pink", "Non-Smoker" = "skyblue")) +
  theme_minimal()

#running an OLS regression on smoking status on mortality

# OLS regression
ols_model <- lm(death ~ qsmk, data = mortality_data)

# Summary of the model
summary(ols_model)
## 
## Call:
## lm(formula = death ~ qsmk, data = mortality_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2258 -0.1720 -0.1720 -0.1720  0.8280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17197    0.01139  15.096   <2e-16 ***
## qsmk         0.05384    0.02246   2.397   0.0166 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3885 on 1564 degrees of freedom
## Multiple R-squared:  0.003661,   Adjusted R-squared:  0.003024 
## F-statistic: 5.748 on 1 and 1564 DF,  p-value: 0.01663
#AIC
aic_value <- AIC(ols_model)
cat("AIC for the OLS regression model:", aic_value, "\n")
## AIC for the OLS regression model: 1486.925

running a logistic regression as well

# Logistic regression
logit_model <- glm(death ~ qsmk, data = mortality_data, family = binomial)

# Summary of the model
summary(logit_model)
## 
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = mortality_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.57174    0.07771 -20.226   <2e-16 ***
## qsmk         0.33959    0.14224   2.387    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1498.2  on 1564  degrees of freedom
## AIC: 1502.2
## 
## Number of Fisher Scoring iterations: 4
# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1502.154

#comparing the AIC to determine which model balances goodness of fit

# Comparing AIC between models
cat("AIC for simple model:", AIC(logit_model), "\n")
## AIC for simple model: 1502.154
cat("AIC for model with controls:", AIC(ols_model), "\n")
## AIC for model with controls: 1486.925

##3. summarising the mortality by Alcohol intake frequency

# Summarize mortality by alcohol frequency
alcohol_mortality_summary <- mortality_data %>%
  group_by(alcoholfreq) %>%
  summarise(mortality_rate = mean(death, na.rm = TRUE),
            n = n())  # Number of observations in each group

# Print the summary
print(alcohol_mortality_summary)
## # A tibble: 6 × 3
##   alcoholfreq mortality_rate     n
##         <dbl>          <dbl> <int>
## 1           0          0.231   325
## 2           1          0.178   219
## 3           2          0.123   494
## 4           3          0.186   328
## 5           4          0.282   195
## 6           5          0         5
# Visualize mortality rate by alcohol frequency using a bar plot
ggplot(alcohol_mortality_summary, aes(x = factor(alcoholfreq), y = mortality_rate, fill = factor(alcoholfreq))) +
  geom_bar(stat = "identity") +
  labs(x = "Alcohol Frequency", 
       y = "Mortality Rate", 
       title = "Mortality Rate by Alcohol Frequency") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

# OLS regression
ols_model <- lm(death ~ alcoholfreq, data = mortality_data)

# Summary of the model
summary(ols_model)
## 
## Call:
## lm(formula = death ~ alcoholfreq, data = mortality_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1943 -0.1888 -0.1861 -0.1806  0.8194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.180594   0.017468  10.338   <2e-16 ***
## alcoholfreq 0.002734   0.007546   0.362    0.717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3892 on 1564 degrees of freedom
## Multiple R-squared:  8.39e-05,   Adjusted R-squared:  -0.0005554 
## F-statistic: 0.1312 on 1 and 1564 DF,  p-value: 0.7172
#AIC Model 
aic_value <- AIC(ols_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1492.538
# Logistic regression
logit_model <- glm(death ~ alcoholfreq, data = mortality_data, family = binomial)

# Summary of the model
summary(logit_model)
## 
## Call:
## glm(formula = death ~ alcoholfreq, family = binomial, data = mortality_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.51213    0.11622 -13.011   <2e-16 ***
## alcoholfreq  0.01807    0.04986   0.362    0.717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1503.6  on 1564  degrees of freedom
## AIC: 1507.6
## 
## Number of Fisher Scoring iterations: 4
# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1507.574

new variable testing

#Create a new variable for change in weight
mortality <- mortality_data %>%
  mutate(weight_change = wt82 - wt71)  # Calculate weight change

# Fit the logistic regression model
logistic_model <- glm(
  death ~ weight_change + age + sex + hbp + qsmk + diabetes,
  data = mortality,
  family = binomial(link = "logit")  # Specify logistic regression
)

# Summarize the model
summary(logistic_model)
## 
## Call:
## glm(formula = death ~ weight_change + age + sex + hbp + qsmk + 
##     diabetes, family = binomial(link = "logit"), data = mortality)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.33175    0.42331 -14.958  < 2e-16 ***
## weight_change -0.02403    0.01030  -2.334  0.01961 *  
## age            0.11053    0.00797  13.868  < 2e-16 ***
## sex1          -0.63742    0.15390  -4.142 3.44e-05 ***
## hbp            0.38082    0.23706   1.606  0.10818    
## qsmk           0.01492    0.17028   0.088  0.93018    
## diabetes      -0.58694    0.22635  -2.593  0.00951 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1122.6  on 1559  degrees of freedom
## AIC: 1136.6
## 
## Number of Fisher Scoring iterations: 5

Predicted Probabilities vs. Weight Change

We can visualize the effect of weight change on the probability of death, as weight change is likely to be one of the significant predictors.

# Create a new column in the dataset to store predicted probabilities
mortality$predicted_prob <- predict(logistic_model, type = "response")

# Plot the predicted probability of death against weight change
ggplot(mortality, aes(x = weight_change, y = predicted_prob)) +
  geom_point(alpha = 0.2) +  # Scatter plot for individual data points
  geom_smooth(method = "loess", color = "blue", size = 1) +  # Add smoothed line
  labs(
    title = "Predicted Probability of Death vs. Weight Change",
    x = "Weight Change (wt82 - wt71)",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

#Visualizing the Impact of Age on the Predicted Probability of Death

# Plot predicted probability of death against age
ggplot(mortality, aes(x = age, y = predicted_prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", color = "skyblue", size = 1) +
  labs(
    title = "Predicted Probability of Death vs. Age",
    x = "Age in 1971",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

#2. Visualizing the Effect of Other Variables You can also explore how sex, diabetes, and other variables affect the probability of death. Since these are categorical variables, you can visualize their effects with bar plots or boxplots.

  1. Effect of Sex on Mortality
# Plot predicted probability of death by sex
ggplot(mortality, aes(x = factor(sex), y = predicted_prob)) +
  geom_boxplot() +
  labs(
    title = "Predicted Probability of Death by Sex",
    x = "Sex (0 = Female, 1 = Male)",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()

Effect of Diabetes on Mortality

# Plot predicted probability of death by diabetes status
ggplot(mortality, aes(x = factor(diabetes), y = predicted_prob)) +
  geom_boxplot() +
  labs(
    title = "Predicted Probability of Death by Diabetes Status",
    x = "Diabetes (0 = No, 1 = Yes)",
    y = "Predicted Probability of Death"
  ) +
  theme_minimal()

mortality_vs_heartfailure <- mortality_data %>%
  mutate(heartfailure = ifelse(hf == 1, "had hf", "no hf")) %>%
  group_by(heartfailure) %>%
  summarise(mortality_rate = mean(death, na.rm = TRUE),
            n = n())  # Number of observations in each group

# Print the summary
print(mortality_vs_heartfailure)
## # A tibble: 2 × 3
##   heartfailure mortality_rate     n
##   <chr>                 <dbl> <int>
## 1 had hf                0.5       8
## 2 no hf                 0.184  1558
# OLS regression
ols_model <- lm(death ~ hf, data = mortality_data)

# Summary of the model
summary(ols_model)
## 
## Call:
## lm(formula = death ~ hf, data = mortality_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5000 -0.1842 -0.1842 -0.1842  0.8158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.184211   0.009844  18.713   <2e-16 ***
## hf          0.315789   0.137729   2.293    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3886 on 1564 degrees of freedom
## Multiple R-squared:  0.00335,    Adjusted R-squared:  0.002713 
## F-statistic: 5.257 on 1 and 1564 DF,  p-value: 0.02199
#AIC
aic_value <- AIC(ols_model)
cat("AIC for the OLS regression model:", aic_value, "\n")
## AIC for the OLS regression model: 1487.415
# Logistic regression
logit_model <- glm(death ~ hf, data = mortality_data, family = binomial)

# Summary of the model
summary(logit_model)
## 
## Call:
## glm(formula = death ~ hf, family = binomial, data = mortality_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.48808    0.06535 -22.770   <2e-16 ***
## hf           1.48808    0.71012   2.096   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1499.7  on 1564  degrees of freedom
## AIC: 1503.7
## 
## Number of Fisher Scoring iterations: 4
# AIC value
aic_value <- AIC(logit_model)
cat("AIC for the logistic regression model:", aic_value, "\n")
## AIC for the logistic regression model: 1503.661

Analyzing the relationship between weak heart smoking, alcohol frequencey on mortality

# Fit logistic regression model including interaction terms
regression_model <- glm(death ~ hf * qsmk * alcoholfreq, 
                        data = mortality, 
                        family = binomial)

# Summary of the model
summary(regression_model)
## 
## Call:
## glm(formula = death ~ hf * qsmk * alcoholfreq, family = binomial, 
##     data = mortality)
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.54455    0.13725 -11.254   <2e-16 ***
## hf                  -17.65150  454.69537  -0.039    0.969    
## qsmk                  0.17298    0.25910   0.668    0.504    
## alcoholfreq          -0.02078    0.06018  -0.345    0.730    
## hf:qsmk              -8.38589  151.58171  -0.055    0.956    
## hf:alcoholfreq        6.78566  151.56610   0.045    0.964    
## qsmk:alcoholfreq      0.08735    0.10974   0.796    0.426    
## hf:qsmk:alcoholfreq        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1489.6  on 1559  degrees of freedom
## AIC: 1503.6
## 
## Number of Fisher Scoring iterations: 12
# Add predicted probabilities to the dataset
mortality <- mortality %>%
  mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))

# View first few rows of the updated dataset
head(mortality)
## # A tibble: 6 × 69
##    seqn  qsmk death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233     0     0    NA    NA    NA   175    96 0        42 1         19
## 2   235     0     0    NA    NA    NA   123    80 0        36 0         18
## 3   244     0     0    NA    NA    NA   115    75 1        56 1         15
## 4   245     0     1    85     2    14   148    78 0        68 1         15
## 5   252     0     0    NA    NA    NA   118    77 0        40 0         18
## 6   257     0     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 57 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …
# Assuming 'mortality' has been updated with predicted probabilities
library(ggplot2)

# Ensure relevant variables are factors for proper grouping
mortality <- mortality %>%
  mutate(
    hf = factor(hf, levels = c(0, 1), labels = c("No Heart Failure", "Heart Failure")),
    qsmk = factor(qsmk, levels = c(0, 1), labels = c("Non-smoker", "Smoker")),
    alcoholfreq = factor(alcoholfreq, levels = c(0, 1), labels = c("No Alcohol", "Alcohol Consumer"))
  )

# Create the plot
ggplot(mortality, aes(x = hf, y = predicted_prob, color = factor(qsmk), shape = factor(alcoholfreq))) +
  geom_point(alpha = 0.7) +  # Use alpha to make points semi-transparent
  labs(
    title = "Predicted Probability of Death by Heart Failure, Smoking, and Alcohol Consumption",
    x = "Heart Failure (1 = Yes, 0 = No)",
    y = "Predicted Probability of Death",
    color = "Smoking Status",
    shape = "Alcohol Consumption"
  ) +
  theme_minimal() +  # Optional: change to a minimal theme for clarity
  theme(legend.position = "top")  # Optional: position legend at the top
## Warning: Removed 1022 rows containing missing values or values outside the scale range
## (`geom_point()`).

new regression including income, death alcohol frequency smoke intenstity and education

regression_data <- mortality %>%
  select(death, income, alcoholfreq, smokeintensity, education) %>%  # Select relevant columns
  filter(!is.na(death))  # Remove rows with missing values for 'death'

# Then run your regression model
model <- glm(death ~ income + alcoholfreq + smokeintensity + education, 
             data = regression_data, 
             family = binomial)

predicted probabilities

# Load necessary libraries
library(dplyr)

# Fit a logistic regression model (ensure your model is defined)
regression_model <- glm(death ~ income + alcoholfreq + smokeintensity + education,
                        data = mortality, family = binomial)

# Add predicted probabilities to the dataset
regression_data <- mortality %>%
  mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))

# View the first few rows of the modified dataset
head(regression_data)
## # A tibble: 6 × 69
##    seqn qsmk       death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233 Non-smoker     0    NA    NA    NA   175    96 0        42 1         19
## 2   235 Non-smoker     0    NA    NA    NA   123    80 0        36 0         18
## 3   244 Non-smoker     0    NA    NA    NA   115    75 1        56 1         15
## 4   245 Non-smoker     1    85     2    14   148    78 0        68 1         15
## 5   252 Non-smoker     0    NA    NA    NA   118    77 0        40 0         18
## 6   257 Non-smoker     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 57 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <fct>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <fct>, …

##graphing Income

graph_income <- ggplot(regression_data, aes(x = income, y = predicted_prob)) +
  geom_point(alpha = 0.3, color = "blue") +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(title = "Predicted Probability of Death vs. Income",
       x = "Income",
       y = "Predicted Probability of Death") +
  theme_minimal()

print(graph_income)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1048 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1048 rows containing missing values or values outside the scale range
## (`geom_point()`).

##graphing education

graph_education <- ggplot(regression_data, aes(x = factor(education), y = predicted_prob)) +
  geom_boxplot(fill = "orange", color = "black", alpha = 0.7) +
  labs(title = "Predicted Probability of Death vs. Education Levels",
       x = "Education Level",
       y = "Predicted Probability of Death") +
  theme_minimal()

print(graph_education)
## Warning: Removed 1048 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

genuine health status on mortality

# Load necessary libraries
library(dplyr)

# Fit logistic regression model
regression_model <- glm(death ~ asthma + bronch + tb + hf + hbp + pepticulcer + 
                         colitis + hepatitis + chroniccough + diabetes + polio, 
                         family = binomial, data = mortality)

# Summarize the model results
summary(regression_model)
## 
## Call:
## glm(formula = death ~ asthma + bronch + tb + hf + hbp + pepticulcer + 
##     colitis + hepatitis + chroniccough + diabetes + polio, family = binomial, 
##     data = mortality)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.47603    0.10234 -14.423  < 2e-16 ***
## asthma           0.24692    0.29647   0.833 0.404927    
## bronch           0.05816    0.24426   0.238 0.811801    
## tb               0.38623    0.53032   0.728 0.466436    
## hfHeart Failure  1.61919    0.73401   2.206 0.027386 *  
## hbp              0.85394    0.21591   3.955 7.65e-05 ***
## pepticulcer      0.69891    0.19428   3.598 0.000321 ***
## colitis         -0.12711    0.37599  -0.338 0.735314    
## hepatitis        0.16835    0.50743   0.332 0.740060    
## chroniccough     0.94140    0.27249   3.455 0.000551 ***
## diabetes        -1.13700    0.20648  -5.507 3.66e-08 ***
## polio           -0.77026    0.75272  -1.023 0.306164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1431.1  on 1554  degrees of freedom
## AIC: 1455.1
## 
## Number of Fisher Scoring iterations: 4
# Add predicted probabilities to the dataset
mortality <- mortality %>%
  mutate(predicted_prob = predict(regression_model, newdata = ., type = "response"))

# Plot predicted probabilities based on asthma, heart failure, and diabetes (for example)
library(ggplot2)
ggplot(mortality, aes(x = asthma, y = predicted_prob, color = factor(hf), shape = factor(diabetes))) +
  geom_point(alpha = 0.7) +
  labs(title = "Predicted Probability of Death by Asthma, Heart Failure, and Diabetes",
       x = "Asthma (1 = Yes, 0 = No)",
       y = "Predicted Probability of Death",
       color = "Heart Failure (1 = Yes, 0 = No)",
       shape = "Diabetes (1 = Yes, 0 = No)") +
  theme_minimal()