Week 10 Data Dive

Creating the Binary Column

The binary column being created is based on the column Cause.Name and is going to be constructed with a value of 1 equating to the cause of death being Heart disease and 0 being simply not heart disease.

# Adding binary column based on the value 'Heart disease' in 'Cause.Name'
data <- data |>
  mutate(
    heart_disease_binary = if_else(Cause.Name == "Heart disease", 1, 0)
  )

# Verifying proper completion
data |>
  count(heart_disease_binary)

## # A tibble: 2 × 2
##   heart_disease_binary     n
##                  <dbl> <int>
## 1                    0  9360
## 2                    1   936

Logistic Regression Models

Model 1

# Build a logistic regression model using Year and State to predict whether a record is Heart disease or not Heart disease
heart_model <- glm(
  heart_disease_binary ~ Year + State,
  data = data,
  family = binomial
)

# Display the logistic regression model summary so the coefficients, standard errors, z-values, and p-values can be reviewed
summary(heart_model)

## 
## Call:
## glm(formula = heart_disease_binary ~ Year + State, family = binomial, 
##     data = data)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -2.303e+00  1.327e+01  -0.173    0.862
## Year                       2.417e-14  6.608e-03   0.000    1.000
## StateAlaska                4.166e-14  3.496e-01   0.000    1.000
## StateArizona               5.037e-14  3.496e-01   0.000    1.000
## StateArkansas              4.808e-14  3.496e-01   0.000    1.000
## StateCalifornia            4.818e-14  3.496e-01   0.000    1.000
## StateColorado              4.652e-14  3.496e-01   0.000    1.000
## StateConnecticut           5.096e-14  3.496e-01   0.000    1.000
## StateDelaware              4.670e-14  3.496e-01   0.000    1.000
## StateDistrict of Columbia  4.886e-14  3.496e-01   0.000    1.000
## StateFlorida               4.830e-14  3.496e-01   0.000    1.000
## StateGeorgia               4.725e-14  3.496e-01   0.000    1.000
## StateHawaii                4.306e-14  3.496e-01   0.000    1.000
## StateIdaho                 3.763e-14  3.496e-01   0.000    1.000
## StateIllinois              4.933e-14  3.496e-01   0.000    1.000
## StateIndiana               4.216e-14  3.496e-01   0.000    1.000
## StateIowa                  4.848e-14  3.496e-01   0.000    1.000
## StateKansas                4.808e-14  3.496e-01   0.000    1.000
## StateKentucky              5.191e-14  3.496e-01   0.000    1.000
## StateLouisiana             4.170e-14  3.496e-01   0.000    1.000
## StateMaine                 5.242e-14  3.496e-01   0.000    1.000
## StateMaryland              4.791e-14  3.496e-01   0.000    1.000
## StateMassachusetts         4.527e-14  3.496e-01   0.000    1.000
## StateMichigan              4.750e-14  3.496e-01   0.000    1.000
## StateMinnesota             5.172e-14  3.496e-01   0.000    1.000
## StateMississippi           4.177e-14  3.496e-01   0.000    1.000
## StateMissouri              4.595e-14  3.496e-01   0.000    1.000
## StateMontana               4.570e-14  3.496e-01   0.000    1.000
## StateNebraska              4.768e-14  3.496e-01   0.000    1.000
## StateNevada                4.975e-14  3.496e-01   0.000    1.000
## StateNew Hampshire         4.826e-14  3.496e-01   0.000    1.000
## StateNew Jersey            4.624e-14  3.496e-01   0.000    1.000
## StateNew Mexico            4.880e-14  3.496e-01   0.000    1.000
## StateNew York              4.517e-14  3.496e-01   0.000    1.000
## StateNorth Carolina        4.469e-14  3.496e-01   0.000    1.000
## StateNorth Dakota          4.790e-14  3.496e-01   0.000    1.000
## StateOhio                  4.813e-14  3.496e-01   0.000    1.000
## StateOklahoma              4.706e-14  3.496e-01   0.000    1.000
## StateOregon                4.832e-14  3.496e-01   0.000    1.000
## StatePennsylvania          4.786e-14  3.496e-01   0.000    1.000
## StateRhode Island          4.777e-14  3.496e-01   0.000    1.000
## StateSouth Carolina        4.768e-14  3.496e-01   0.000    1.000
## StateSouth Dakota          4.874e-14  3.496e-01   0.000    1.000
## StateTennessee             4.664e-14  3.496e-01   0.000    1.000
## StateTexas                 4.756e-14  3.496e-01   0.000    1.000
## StateUnited States         4.713e-14  3.496e-01   0.000    1.000
## StateUtah                  4.697e-14  3.496e-01   0.000    1.000
## StateVermont               1.040e-13  3.496e-01   0.000    1.000
## StateVirginia              4.824e-14  3.496e-01   0.000    1.000
## StateWashington            5.158e-14  3.496e-01   0.000    1.000
## StateWest Virginia         4.825e-14  3.496e-01   0.000    1.000
## StateWisconsin             4.641e-14  3.496e-01   0.000    1.000
## StateWyoming               5.003e-14  3.496e-01   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6273.1  on 10295  degrees of freedom
## Residual deviance: 6273.1  on 10243  degrees of freedom
## AIC: 6379.1
## 
## Number of Fisher Scoring iterations: 5

Summary and Explanation of the First Model

This first model used Year and State since they were simple, intuitive explanatory variables that could test whether heart disease outcomes differed across time and location. However, the results showed essentially no predictive value: the coefficients were effectively zero, the p-values were 1.000, and the residual deviance was nearly identical to the null deviance. Since the model learned almost nothing from these variables, it was not a strong final choice and a different set of predictors was needed.

Model 2

# Build a logistic regression model using age-adjusted death rate and deaths per 100k to predict whether a record is Heart disease
heart_model <- glm(
  heart_disease_binary ~ Age.adjusted.Death.Rate + Deaths_per_100k,
  data = data,
  family = binomial
)

# Display the logistic regression model summary so the updated coefficients and statistical significance can be reviewed
summary(heart_model)

## 
## Call:
## glm(formula = heart_disease_binary ~ Age.adjusted.Death.Rate + 
##     Deaths_per_100k, family = binomial, data = data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.4844317  0.0411369 -60.394   <2e-16 ***
## Age.adjusted.Death.Rate  0.0001096  0.0008331   0.132    0.895    
## Deaths_per_100k          0.0009989  0.0007677   1.301    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6273.1  on 10295  degrees of freedom
## Residual deviance: 6190.7  on 10293  degrees of freedom
## AIC: 6196.7
## 
## Number of Fisher Scoring iterations: 5

Summary and Explanation of the Second Model

This second model used Age.adjusted.Death.Rate and Deaths_per_100k since both variables seemed more directly tied to mortality patterns and were more likely than Year and State to distinguish heart disease rows from non-heart-disease rows. Although this model fit slightly better overall than the first one, neither predictor was statistically significant, with p-values of 0.895 and 0.193, so there was still not strong evidence that these variables were useful predictors in this logistic regression. Since the explanatory variables did not show meaningful independent effects, this version was not chosen as the final model.

Model 3

# Build a logistic regression model using Deaths and Total_Population to predict whether a record is Heart disease or not Heart disease
heart_model <- glm(
  heart_disease_binary ~ Deaths + Total_Population,
  data = data,
  family = binomial
)

# Display the logistic regression model summary so the coefficients, standard errors, z-values, and p-values can be reviewed
summary(heart_model)

## 
## Call:
## glm(formula = heart_disease_binary ~ Deaths + Total_Population, 
##     family = binomial, data = data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.301e+00  3.566e-02 -64.509  < 2e-16 ***
## Deaths            6.893e-07  2.515e-07   2.741  0.00613 ** 
## Total_Population -1.279e-09  1.042e-09  -1.227  0.21967    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6273.1  on 10295  degrees of freedom
## Residual deviance: 6266.6  on 10293  degrees of freedom
## AIC: 6272.6
## 
## Number of Fisher Scoring iterations: 5

Summary and Explanation of the Third Model

This third model used Deaths and Total_Population since both were plausible quantitative predictors, and it was reasonable to test whether the number of deaths and the population size together helped distinguish heart disease rows from other causes. This version performed better than the earlier models since Deaths was statistically significant (p = 0.00613), while Total_Population was not (p = 0.21967). Although this was a strong candidate, it was not selected as the final model since the simpler model using Deaths alone still gave a statistically significant result and was easier to explain.

Model 4

# Build a logistic regression model using Deaths as the only predictor of whether a record is Heart disease or not Heart disease
heart_model <- glm(
  heart_disease_binary ~ Deaths,
  data = data,
  family = binomial
)

# Display the logistic regression model summary so the coefficient, standard error, z-value, and p-value can be reviewed
summary(heart_model)

## 
## Call:
## glm(formula = heart_disease_binary ~ Deaths, family = binomial, 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.312e+00  3.459e-02 -66.832   <2e-16 ***
## Deaths       5.024e-07  2.058e-07   2.441   0.0146 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6273.1  on 10295  degrees of freedom
## Residual deviance: 6268.3  on 10294  degrees of freedom
## AIC: 6272.3
## 
## Number of Fisher Scoring iterations: 5

Summary and Explanation of the Fourth Model

This final model used Deaths as the only explanatory variable since it was the simplest model that still produced a statistically significant predictor. The coefficient for Deaths is positive (5.024e-07), which means that as the number of deaths increases, the log-odds that a row is classified as Heart disease rather than not Heart disease also increase. In practical terms, higher values of Deaths are associated with a greater probability that the observation corresponds to Heart disease, and since its p-value is 0.0146, that relationship is statistically significant at the 0.05 level.

For the intercept, the estimate is -2.312, which represents the log-odds of a row being Heart disease when Deaths = 0. That value is not especially meaningful on its own in a real-world sense, but it serves as the baseline from which the effect of Deaths is added. This model was selected as the final version since it was easier to explain than the larger models while still giving a statistically meaningful result.

Confidence Interval

# Calculate a 95% confidence interval for the Deaths coefficient using the estimate plus or minus 1.96 times the standard error
deaths_estimate <- coef(summary(heart_model))["Deaths", "Estimate"]
deaths_se <- coef(summary(heart_model))["Deaths", "Std. Error"]

deaths_ci_lower <- deaths_estimate - 1.96 * deaths_se
deaths_ci_upper <- deaths_estimate + 1.96 * deaths_se

deaths_ci_lower

## [1] 9.903683e-08

deaths_ci_upper

## [1] 9.058363e-07

Confidence Interval Summary and Explanation

A 95% confidence interval was calculated for the Deaths coefficient in the final logistic regression model by taking the estimated coefficient and adding and subtracting 1.96 times its standard error. The resulting interval was approximately (9.903683e-08, 9.058363e-07), which gives a plausible range of values for the true coefficient of Deaths in the population. Since the entire confidence interval is above 0, the results suggest that the effect of Deaths is positive, meaning that higher numbers of deaths are associated with higher log-odds that a row is classified as Heart disease, and this supports the earlier conclusion that the relationship is statistically significant.

Week 10 Data Dive Summary

This Week 10 data dive showed that model choice matters just as much as model execution: several reasonable-looking predictor sets were tested, but only one produced a clear and interpretable result. The final logistic regression suggested that records with larger death counts were more likely to correspond to Heart disease, while other candidate predictors did not add meaningful explanatory value in this dataset. Building the confidence interval helped confirm that this relationship was not only statistically significant, but also consistently positive within a plausible range of values. Overall, the analysis suggests that some variables in this dataset are much more informative than others for classification, and it raises a useful follow-up question about whether a differently structured response variable or additional predictors could produce an even stronger and more substantively meaningful model.