Part 1 - Introduction

One of the many wonderful aspects of the United States of America is its diversity of people and culture. Historically, outside of major port and immigration centers, cultures tend to be clustered geographically. This will manifest not only in genetic differences, but also in differences in diet and activity. A question can now be raised if there is a relationship between geographic location, as measured by region, division, or state, and the death rate (per 100,000) due to disease, possibly over time? In this research project we will focus on heart disease.

knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, message = FALSE)

# --- SET RANDOM SEED FOR REPRODUCIBILITY --- 
random_seed <- 15
set.seed(random_seed)

#  --- LOAD LIBRARIES ---
library(recipes)
library(caret)
library(rpart.plot)
library(jsonlite)
library(MASS)
library(tidyverse)
library(ggthemes)
library(glmnet)
library(data.table)
library(rsample)
library(corrr)

# Will be used for US Maps
library(maps)
us_states <- map_data("state")

# Will be used to build our Deep Learning model
library(tensorflow)
library(keras)

# Set random seed for TensorFlow
if(Sys.info()['sysname'] == 'Windows') {
  # TensorFlow bug with setting random seed
  tensorflow::tf$random$set_seed(random_seed)
} else {
  use_session_with_seed(random_seed, 
                        disable_gpu = TRUE, 
                        disable_parallel_cpu = TRUE, 
                        quiet = FALSE)
}

#' Attempt to load cached data falling back to remote URL loading
#'
#' To save redownloading data from the internet every time, this 
#' function will cache URL data the first time and for all future
#' references, it will us the locally cached file rather than
#' redownloading from the internet.
#' @param fn The local cache filename (character)
#' @param url The remote URL to load the data (character)
#' @param type The type of data (affects which loading fuction). Valid: csv, json (character)
#' @examples
#' myData <- load_cache_file('./data/myfile.csv', 'http://someurl.com/somefile.csv')
#' @return The desired data as a data.frame
#' @export
load_cache_file <- function(fn, url, type) {
  # TODO - Add error chacking of parameters (fn, url and type)
  
  if (!(file.exists(fn))) {
    if (type == 'csv') {
      data <- fread(url) 
    } else if (type == 'json') {
      data <- fromJSON(url, flatten = TRUE)
    }

    write.csv(data, file = fn, row.names = FALSE)
  } else {
    data <- fread(fn)
  }
  
  return(data)
}


#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
  par(mfrow = c(1, 2))
  hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
       xlab = "Residuals",col="lightgreen")
  lines(density(model[["residuals"]], kernel = "ep"),col="blue", lwd=3)
  curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)
  qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
  qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
  par(mfrow = c(1, 1))
}


#' Calculate the R-Squared given observed and predicted data
#'
#' @param y     Observed values in an object that supports vector operations
#' @param y_hat Predicted values in an object that supports vector operations
#' @examples
#' r_squared(test_labels, predicted_labels)
#' @return Float between 0 and 1.0
#' @export
r_squared <- function(y, y_hat) {
  return(1 - sum((y - y_hat)^2) / sum((y - mean(y))^2))
}

Part 2 - Data

Load raw data


# --- Load LCD Data ---
# Note: we keep an original copy of the data in case we need to reference pre-modified data
LCD_fn <- './data/lcd.csv'
LCD_url <- 'https://data.cdc.gov/resource/bi63-dtpu.json?$limit=50000'
LCD_orig <- load_cache_file(LCD_fn, LCD_url, 'json')

LCD <- LCD_orig

# --- Load 2018 All data ---
CD1_fn <- './data/nst-est2018-alldata.csv'
CD1_url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/national/totals/nst-est2018-alldata.csv"
CD1_orig <- load_cache_file(CD1_fn, CD1_url, 'csv')

CD1 <- CD1_orig

# --- Load 2010 State Age/Sex Data ---
CD2_fn <- './data/st-est00int-agesex.csv'
CD2_url <- "https://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/state/st-est00int-agesex.csv"
CD2_orig <- load_cache_file(CD2_fn, CD2_url, 'csv')

CD2 <- CD2_orig

Prepare Data Frames

# --- CLEAN LCD DATA ---
# fromJSON doesn't allow defining column classes on import
names(LCD) <- c("Year", "113 Cause Name", "Cause Name" ,"State" ,"Deaths" ,
                "Age-adjusted Death Rate")

LCD$Year <- as.integer(LCD$Year)
LCD$Deaths <- as.integer(LCD$Deaths)
LCD$`Age-adjusted Death Rate` <- as.double(LCD$`Age-adjusted Death Rate`)
setDT(LCD)

# --- EXTRACT CD1 DATA ---
C1Extract <- CD1[, c("NAME", paste0("POPESTIMATE", 2010:2017))]
setkey(C1Extract, "NAME")

# --- EXTRACT CD2 DATA ---
C2Extract <- CD2[SEX == 0L & AGE == 999L,
                 c("NAME",  paste0("POPESTIMATE", 2000:2009))]
setkey(C2Extract, "NAME")

# --- CREATE POPULATION DATAFRAME ---
Pop <- C2Extract[C1Extract, on = "NAME"]
setnames(Pop, names(Pop), c("State", 2000:2017))

Population <- melt(Pop, id.vars = "State", variable.name = "Year",
                   value.name = "Population", variable.factor = FALSE)
Population$Year <- as.integer(Population$Year)

# --- PREPARE HEART DISEASE DATAFRAME ---
DT <- Population[LCD[Year > 1999L], on = c("State == State", "Year == Year")]
SD <- data.table(State = state.name,
                 Region = state.region,
                 Division = state.division)
DT <- SD[DT, on = "State"]
setkey(DT, Year, State)

DT[, `DeathRate` := Deaths / Population * 1e5]
DTHD <- DT[`Cause Name` == "Heart disease" &
              !(State %chin% c("United States",
                               "District of Columbia",
                               "Puerto Rico"))]

head(DTHD)

Source

The investigation will be based on publicly available data on leading causes of death comes from the National Center for Health Statistics (NHCS). Population data will be based on the statewide census data for both 2010–2018 and 2000-2009 from the US Census Bureau.

ETL

The dataset from the CDC contains 10868 observations comprising data from 53 locations—the 50 states, the US as a whole, and the District of Columbia—across 11 causes of death—the top 10 together with all—over the years 1999–2017. In order to perform aggregations, the data will be joined with US statewide population estimates for those years. Focusing on heart disease, there will be 50 location observations (the US should be a sum of the rest and DC will be ignored) across the 18 years for investigated cases of heart disease.

As the Age-adjusted death rate depends on unseen age-cohort populations, there is no accurate way to calculate division or regional aggregate rates without that information, the pure rate-per-100K people, called DeathRate will be used as the dependent variable.

The NHCS data will be downloaded via their API as a JSON file and converted to a data frame. The census data will be downloaded directly as CSV files and also converted to data frames. The population data, especially that for 2000-2010, requires tidying. Both data sets need to have the population estimate fields extracted from the larger collection of fields.

The earlier data needs more work. It is also split by age group and sex, but includes the data for “all” as well. Therefore, knowing that all sexes is coded as 0 and all locations as 999, only rows with those features are extracted, but those columns themselves are not.

Once the population fields are extracted, the two census databases are joined to each other using the “NAME” field—the geographic location—as the index. At this point the data is in “wide” format, and needs to be converted to “long” format, especially as geographic divisions and regions need to be added. Therefore, the data is melted into long format. This data is now joined to a database of US regions and divisions. This becomes the master disease table. The heart-diseases specific information is extracted to its own table for ease of analysis.

Part 3 - Exploratory data analysis

Summary Statistics

Summary statistics for the heart-disease data by region and division are below:

options(digits = 4L)

DTHD[, .(Mean = mean(DeathRate),
         SD = sd(DeathRate),
         Min = min(DeathRate),
         FirstQ = quantile(DeathRate, 0.25),
         Median = median(DeathRate),
         ThirdQ = quantile(DeathRate, 0.75),
         Max = max(DeathRate),
         IQR = IQR(DeathRate)), by = Region]


DTHD[, .(Mean = mean(DeathRate),
         SD = sd(DeathRate),
         Min = min(DeathRate),
         FirstQ = quantile(DeathRate, 0.25),
         Median = median(DeathRate),
         ThirdQ = quantile(DeathRate, 0.75),
         Max = max(DeathRate),
         IQR = IQR(DeathRate)), by = Division]

State View

A graphical view of the overall kernel-smoothed density and average rate over time is shown below. It is faceted by state, colored by division, and for the trend chart, the line-type indicates the region.


# We need to merge/join state data to build ggplot maps
us_states$region <- str_to_title(us_states$region)
DTHD_states <- left_join(us_states, DTHD, by = c("region" = "State"))

# Display Deathrate on US Map
ggplot(data = DTHD_states,
       mapping = aes(x = long, y = lat, group = group, fill = DeathRate)) +
  geom_polygon(color = "gray30", size = 0.1) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  scale_fill_gradient(low = "white", high = "#CB454A") +
  labs(title = "Death Rates") + 
  theme_map() + 
  labs(fill = "Death rate per 100,000 population",
       title = "Heart Related Deaths by State, 2000-2017") +
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


# Display Deathrates by Year on US Map
ggplot(data = subset(DTHD_states, Year > 1999),
       mapping = aes(x = long, y = lat, group = group, fill = DeathRate)) +
  geom_polygon(color = "gray30", size = 0.05) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  scale_fill_gradient(low = "white", high = "#CB454A") +
  theme_map() + 
  facet_wrap(~ Year, ncol = 4) +
    theme(legend.position = "bottom", strip.background = element_blank()) +
    labs(fill = "Death rate per 100,000 population",
         title = "Heart Related Deaths by State, 2000-2017") +
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


# Display ordered Boxplots by state to visualize best/worst states by DeathRate
ggplot(DTHD, aes(x = reorder(State, DeathRate, median),
                 y = DeathRate, color = Region)) + 
  geom_boxplot() +
  geom_hline(yintercept = median(DTHD$DeathRate), color = "gray60") +
  coord_flip() +
  labs(y = "Death rate per 100,000 population", x = "State",
       title = "Heart Related Deaths by State, 2000-2017") +
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


# Display Deathrate density plots by state
ggplot(DTHD, aes(x = DeathRate)) +
  stat_density(aes(fill = Division), bw = "nrd0") + facet_wrap(~State)


# Display Deathrate over time by state
ggplot(DTHD, aes(x = Year, y = DeathRate)) +
  geom_path(aes(color = Division, linetype = Region)) + facet_wrap(~State)

Regional View

As striking as the displays may be, treating each state as its own factor would fall prey to having too many variables and not enough data. Looking at the data as a boxplot by Region would be better.


# Add a column with Region median DeathRate
DTHD_regions <- DTHD %>% 
  group_by(Region) %>% 
  mutate(RegionMedian = median(DeathRate, na.rm = TRUE))
DTHD_regions <- left_join(us_states, DTHD_regions, by = c("region" = "State"))

# Display Regions on US Map
ggplot(data = DTHD_regions,
       mapping = aes(x = long, y = lat, group = group, fill = Region)) +
  geom_polygon(color = "gray30", size = 0.1) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  labs(title = "Regions") + 
  theme_map() + 
  labs(fill = "Regions")


# Display median Deathrate for each Region on US Map
ggplot(data = DTHD_regions,
       mapping = aes(x = long, y = lat, group = group, fill = RegionMedian)) +
  geom_polygon(color = "gray30", size = 0.1) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  scale_fill_gradient(low = "green", high = "red") +
  labs(title = "Regional Death Rates") + 
  theme_map() + 
  labs(fill = "Regional Death Rate")


# Boxplot of DeathRate by Region
ggplot(DTHD, aes(x = Region, y = DeathRate, color = Region)) + 
  geom_boxplot() +
  coord_flip()

The West region appears to be noticeably different from the other regions, which provides us with some evidence that geographic differentiation may be valuable. A second item of interest is that the South seems to have the most variation.

The DeathRate variable is normalized to rate per 100K lives, so it can be analyzed across years. However, it is prudent to look at the trend of the rate over time as well.

ggplot(DTHD, aes(x = Year, y = DeathRate, color = Region, group = Region)) +
  stat_summary(fun.data = "mean_se")

The rates all follow the same pattern, decreasing until about 2010 and then increasing. As there is no fundamentally different pattern between regions, the inference will begin with them aggregated. Note the singularity of the West region and wider spread of South is visible in this chart as well. Lastly, while more difficult to see, now that the trends were made apparent, they can be followed in the succession of boxplots by year below.

ggplot(DTHD, aes(x = Region, y = DeathRate, color = Region)) + geom_boxplot() +
  coord_flip() + facet_wrap( ~ Year)

Divisional View

A similar exploration of dividing the data by the nine divisions, instead of the four regions, follows below.


# Add a column with Division median DeathRate
DTHD_divisions <- DTHD %>% 
  group_by(Division) %>% 
  mutate(DivisionMedian = median(DeathRate, na.rm = TRUE))
DTHD_divisions <- left_join(us_states, DTHD_divisions, by = c("region" = "State"))

# Display Divisions on US Map
ggplot(data = DTHD_divisions,
       mapping = aes(x = long, y = lat, group = group, fill = Division)) +
  geom_polygon(color = "gray90", size = 0.1) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  labs(title = "Divisions") + 
  theme_map() + 
  labs(fill = "Divisions")


# Display median Deathrate for each Division on US Map
ggplot(data = DTHD_divisions,
       mapping = aes(x = long, y = lat, group = group, fill = DivisionMedian)) +
  geom_polygon(color = "gray30", size = 0.1) +
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
  scale_fill_gradient(low = "green", high = "red") + 
  labs(title = "Divisional Death Rates") + 
  theme_map() + 
  labs(fill = "Divisional Death Rate")


# Boxplot 
ggplot(DTHD, aes(x = Division, y = DeathRate, color = Division)) +
  geom_boxplot() + coord_flip()


# DeathRate over time
ggplot(DTHD, aes(x = Year, y = DeathRate, color = Division, group = Division)) +
  stat_smooth(method = "loess", se = FALSE, aes(lty = Region)) +
  scale_color_brewer(type = 'qual', palette = 'Set1')

For the boxplots, the divisions are colored by region for identification purposes. For the mean over time, it is easier to digest if the divisions are differently colored and the line type reflects the regions. Here, most of the divisions follow the same path over time, with the possible exception of the Eath South Central division in the South region and the Mountain division of the West region. Nevertheless, it does not immediately prevent us from looking at all years simultaneously. It is also apparent that there is enough intra-regional difference that a model using divisions would likely perform better than one using solely regions.

Part 4 - Inference

Linear Models

As we are not privy to the underlying data from the NHCS, just their aggregated data, we are restricted in the models we may build. Furthermore, the point of this observational study is not to create a model of future deaths for prediction purposes. Rather, it is to demonstrate that the use of the geographical variables is significant. There clearly is some relationship between time rate, which is also clearly nonlinear. To show the benefit of geography, a comparison between a model solely dependent on Year and then one with added geography will be made. If geography is a valuable indicator, the new model should show increased \(R^2\) and reduced error.

Year

As seen above, the relationship between DeathRate and Year is non-linear. One option is to add a quadratic term which will capture the curvature. Of course, if extended too far, the quadratic term would be ludicrous, but it may be valuable for short-term prediction. A second option would be to add a cut-off. The simplest would be to pick a year based on the empirical evidence, and treat that as a factor. A more sophisticated third method would put in dummy variables for each year and test the cutoff that way.

# Run Linear Regression using Year to predict DeathRate
year1_lm <- lm(DeathRate ~ Year, data = DTHD)

(yr1_lm_s <- summary(year1_lm))
## 
## Call:
## lm(formula = DeathRate ~ Year, data = DTHD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -140.55  -28.65   -1.12   29.78  121.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5984.771    564.208    10.6   <2e-16 ***
## Year          -2.875      0.281   -10.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.7 on 898 degrees of freedom
## Multiple R-squared:  0.104,  Adjusted R-squared:  0.103 
## F-statistic:  105 on 1 and 898 DF,  p-value: <2e-16

residPlot(yr1_lm_s)

As expected, looking solely at Year a linear model works very poorly. Yes, Year is clearly significant based on the F-statistics and t-statistics, but the amount of variability explained is very low as measured by the adjusted \(R^2\) of 0.1035. The residuals are basically normally distributed, and the QQ plot, while not perfect, is not egregious.

Next is the quadratic year model:

# Run Linear Regression using Year and quadratic Year Term
year2_lm <- lm(DeathRate ~ Year + I(Year ^ 2), data = DTHD)

(yr2_lm_s <- summary(year2_lm))
## 
## Call:
## lm(formula = DeathRate ~ Year + I(Year^2), data = DTHD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -155.56  -28.97    0.56   30.69  114.09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.58e+06   2.40e+05    6.57  8.3e-11 ***
## Year        -1.57e+03   2.39e+02   -6.56  9.0e-11 ***
## I(Year^2)    3.89e-01   5.94e-02    6.55  9.7e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.7 on 897 degrees of freedom
## Multiple R-squared:  0.145,  Adjusted R-squared:  0.143 
## F-statistic: 76.3 on 2 and 897 DF,  p-value: <2e-16

residPlot(yr2_lm_s)

This is a somewhat better model, based on the F-statistics, t-statistics, and slightly more variability explained is very low as measured by the adjusted \(R^2\) of 0.1434. The residual behavior is similar to that of the linear model although the QQ plot looks a little worse.

However, note what happened to the intercept. Due to the presence of Year and Year ^ 2, the intercept is now three orders of magnitude above the average deaths rates. While this performs better locally, a few years out and the quadratic age factor will overwhelm the model making it near useless.

The next step is to replace the year with cutoffs and possibly “spline” the results out of linear pieces.

# Add logical columns to help with identifying year cutoffs
for (i in seq(min(LCD$Year), max(LCD$Year))) {
  DTHD[, paste0("Post", i) := Year > i]
}

yr_mult_formula <- reformulate(
  termlabels = c(paste0("Post", seq(min(LCD$Year) + 1L, max(LCD$Year) - 1L))),
  response = "DeathRate")

# Run Linear Regression 
yr_mult_lm <- lm(yr_mult_formula, data = DTHD)

yr_mult_lm_final <- stepAIC(yr_mult_lm, direction = "both",
                            scope = list(upper = yr_mult_lm, lower = ~ 1),
                             scale = 0, trace = FALSE)

(yr_lmf_s <- summary(yr_mult_lm_final))
## 
## Call:
## lm(formula = DeathRate ~ Post2003 + Post2005 + Post2008 + Post2014, 
##     data = DTHD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151.67  -29.75    0.21   31.28  116.17 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    239.94       3.03   79.29  < 2e-16 ***
## Post2003TRUE   -20.83       5.24   -3.97  7.7e-05 ***
## Post2005TRUE   -13.95       5.52   -2.52   0.0118 *  
## Post2008TRUE   -11.06       4.28   -2.59   0.0099 ** 
## Post2014TRUE     7.34       4.28    1.71   0.0869 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.8 on 895 degrees of freedom
## Multiple R-squared:  0.145,  Adjusted R-squared:  0.141 
## F-statistic: 37.9 on 4 and 895 DF,  p-value: <2e-16

residPlot(yr_lmf_s)

The stepAIC function from the MASS package was used to select the “best” model using maximum likelihood, starting with the saturated model including all years as cutoffs except the first and the last. This model is nearly identical to the quadratic year model when measured by its adjusted \(R^2\) of 0.1411. It captures the curvature by identifying “elbows” at 2003, 2005, and 2008, and the curve upward at 2014. However, it is getting more difficult to say that the residuals exhibit normal behavior. This stands to reason as it is becoming quite clear that a linear model does not capture all the information buried in the data.

As neither of the “acceptable” models show an adjusted \(R^2\) greater than 0.1411, if geography increases this significantly, it would indicate that our hypothesis is valid.

Regions

The first step would be to add Regions to the “best” model using only Year. Instead, Region will be added to the saturated cut-off model and then the stepAIC procedure will be applied. While not explicitly looking for interactions, this may allow for a better selection of years based on region.

Adding regions naively would select the Northeast region as the base. This would be problematic as two of the other regions are close to it but West is not. The intercept would be affected and may make the deviations from base for the other regions less relevant. Re-leveling the data so that West is considered the baseline will make the findings more relevant. The two simple models using region only below show this difference.

# Linear regression  - use Region to predict DeathRate
region1_lm <- lm(DeathRate ~ Region, data = DTHD)

(r1_lm_s <- summary(region1_lm))
## 
## Call:
## lm(formula = DeathRate ~ Region, data = DTHD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.69 -25.63  -0.06  24.44 124.51 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           226.14       2.92   77.45   <2e-16 ***
## RegionSouth             5.46       3.65    1.50    0.135    
## RegionNorth Central    -7.17       3.86   -1.86    0.064 .  
## RegionWest            -61.68       3.80  -16.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.2 on 896 degrees of freedom
## Multiple R-squared:  0.354,  Adjusted R-squared:  0.352 
## F-statistic:  164 on 3 and 896 DF,  p-value: <2e-16

# Change the reference level to the West Region and rerun the Linear Regression
DTHD$Region <- relevel(DTHD$Region, ref = "West")
region2_lm <- lm(DeathRate ~ Region, data = DTHD)

(r2_lm_s <- summary(region2_lm))
## 
## Call:
## lm(formula = DeathRate ~ Region, data = DTHD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.69 -25.63  -0.06  24.44 124.51 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           164.46       2.43    67.7   <2e-16 ***
## RegionNortheast        61.68       3.80    16.2   <2e-16 ***
## RegionSouth            67.14       3.27    20.5   <2e-16 ***
## RegionNorth Central    54.51       3.51    15.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.2 on 896 degrees of freedom
## Multiple R-squared:  0.354,  Adjusted R-squared:  0.352 
## F-statistic:  164 on 3 and 896 DF,  p-value: <2e-16

Note that the adjusted \(R^2\) remains the same at 0.3522, but the parameters are more significant in their distance from West than from Northeast.

Now for the addition of Region to Year:

rg_mult_formula <- reformulate(
  termlabels = c('Region', paste0("Post", seq(min(LCD$Year) + 1L,
                                              max(LCD$Year) - 1L))),
  response = "DeathRate")
rg_mult_lm <- lm(rg_mult_formula, data = DTHD)
rg_mult_lm_final <- stepAIC(rg_mult_lm, direction = "both",
                            scope = list(upper = rg_mult_lm, lower = ~ 1),
                             scale = 0, trace = FALSE)
rg_lmf_s <- summary(rg_mult_lm_final)
rg_lmf_s
## 
## Call:
## lm(formula = DeathRate ~ Region + Post2000 + Post2003 + Post2005 + 
##     Post2008 + Post2014, data = DTHD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104.59  -20.86    3.36   23.79   92.02 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           201.25       4.99   40.36  < 2e-16 ***
## RegionNortheast        61.68       3.35   18.42  < 2e-16 ***
## RegionSouth            67.14       2.88   23.28  < 2e-16 ***
## RegionNorth Central    54.51       3.09   17.63  < 2e-16 ***
## Post2000TRUE           -9.31       5.35   -1.74  0.08212 .  
## Post2003TRUE          -18.50       4.23   -4.37  1.4e-05 ***
## Post2005TRUE          -13.95       4.23   -3.30  0.00102 ** 
## Post2008TRUE          -11.06       3.28   -3.38  0.00077 ***
## Post2014TRUE            7.34       3.28    2.24  0.02542 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.8 on 891 degrees of freedom
## Multiple R-squared:  0.501,  Adjusted R-squared:  0.497 
## F-statistic:  112 on 8 and 891 DF,  p-value: <2e-16
residPlot(rg_lmf_s)

The model is clearly significant as can be seen from the F and t statistics. There is also a marked improvement in the adjusted \(R^2\) of the model at 0.4965. Moreover, the residuals look better than the year-only model as does the QQ-plot.

Divisions

The next step is to subdivide on divisions. For the same reason as above, the data will be re-leveled to consider the Pacific region as the base.

DTHD$Division <- relevel(DTHD$Division, ref = "Pacific")
div_mult_formula <- reformulate(
  termlabels = c('Division', paste0("Post", seq(min(LCD$Year) + 1L,
                                              max(LCD$Year) - 1L))),
  response = "DeathRate")
div_mult_lm <- lm(div_mult_formula, data = DTHD)
div_mult_lm_final <- stepAIC(div_mult_lm, direction = "both",
                            scope = list(upper = div_mult_lm, lower = ~ 1),
                             scale = 0, trace = FALSE)
div_lmf_s <- summary(div_mult_lm_final)
div_lmf_s
## 
## Call:
## lm(formula = DeathRate ~ Division + Post2000 + Post2003 + Post2005 + 
##     Post2008 + Post2014, data = DTHD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99.09 -17.33   1.58  18.07 108.26 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  195.75       5.20   37.68  < 2e-16 ***
## DivisionNew England           54.87       4.28   12.82  < 2e-16 ***
## DivisionMiddle Atlantic       91.81       5.16   17.78  < 2e-16 ***
## DivisionSouth Atlantic        56.40       4.03   13.99  < 2e-16 ***
## DivisionEast South Central    99.88       4.74   21.06  < 2e-16 ***
## DivisionWest South Central    77.89       4.74   16.42  < 2e-16 ***
## DivisionEast North Central    69.02       4.47   15.44  < 2e-16 ***
## DivisionWest North Central    53.59       4.14   12.95  < 2e-16 ***
## DivisionMountain               8.95       4.03    2.22  0.02670 *  
## Post2000TRUE                  -9.31       4.90   -1.90  0.05759 .  
## Post2003TRUE                 -18.50       3.87   -4.78  2.1e-06 ***
## Post2005TRUE                 -13.95       3.87   -3.60  0.00033 ***
## Post2008TRUE                 -11.06       3.00   -3.69  0.00024 ***
## Post2014TRUE                   7.34       3.00    2.45  0.01466 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30 on 886 degrees of freedom
## Multiple R-squared:  0.584,  Adjusted R-squared:  0.578 
## F-statistic: 95.7 on 13 and 886 DF,  p-value: <2e-16
residPlot(div_lmf_s)

This is an even better model in that the adjusted \(R^2\) has risen to 0.5781, which is more than half of the variation explained. However, the residuals show some lumpiness in the lower tail, which is more clearly seen in the QQ-plot. While not proof positive of a problem it is indicative of the model being sub-optimal. Which is once again to be expected. The object here is not to create the best predictive model of heart disease-related deaths, but to indicate if geography plays a role.

Penalized Regression

When dealing with many possible variables, as these annual dummy variables are, one can use the machinery of lasso. Lasso, or more properly known as penalized regression using L1, is a kind of regression that penalizes multiple parameters with a linear penalty. Its relative, ridge regression, uses a quadratic penalty. The benefit of lasso is that by having the penalty be sub-quadratic, variable selection occurs as well. With quadratic penalties, the variable weights may approach 0 asymptotically, but never hit it.

Lasso regression in R is usually carried out by the glmnet package, written by the originators of the method, Hastie and Tibshirani. It doesn’t use the simple formula method that base R models use, but requires a model matrix. The model.matrix function in base (stats) R can usually turn most formula objects into matrices.

MM <- model.matrix(div_mult_lm)
YY <- DTHD$DeathRate
fit_lasso <- glmnet(MM, YY, family = 'gaussian', alpha = 1)
fit_lasso
## 
## Call:  glmnet(x = MM, y = YY, family = "gaussian", alpha = 1) 
## 
##    Df  %Dev Lambda
## 1   0 0.000  18.40
## 2   1 0.027  16.80
## 3   2 0.062  15.30
## 4   3 0.102  13.90
## 5   4 0.145  12.70
## 6   4 0.183  11.60
## 7   5 0.214  10.50
## 8   7 0.241   9.61
## 9   7 0.270   8.76
## 10  7 0.294   7.98
## 11  7 0.314   7.27
## 12  8 0.335   6.62
## 13  9 0.353   6.04
## 14 11 0.369   5.50
## 15 12 0.386   5.01
## 16 12 0.400   4.57
## 17 12 0.411   4.16
## 18 12 0.421   3.79
## 19 12 0.428   3.45
## 20 14 0.436   3.15
## 21 14 0.444   2.87
## 22 16 0.456   2.61
## 23 16 0.478   2.38
## 24 16 0.496   2.17
## 25 17 0.511   1.98
## 26 17 0.523   1.80
## 27 17 0.533   1.64
## 28 18 0.542   1.49
## 29 18 0.549   1.36
## 30 18 0.555   1.24
## 31 18 0.561   1.13
## 32 18 0.565   1.03
## 33 18 0.568   0.94
## 34 18 0.571   0.86
## 35 19 0.574   0.78
## 36 19 0.576   0.71
## 37 19 0.578   0.65
## 38 19 0.579   0.59
## 39 20 0.580   0.54
## 40 20 0.581   0.49
## 41 20 0.581   0.45
## 42 20 0.582   0.41
## 43 20 0.582   0.37
## 44 21 0.582   0.34
## 45 22 0.583   0.31
## 46 22 0.584   0.28
## 47 22 0.584   0.26
## 48 22 0.584   0.23
## 49 22 0.585   0.21
## 50 22 0.585   0.19
## 51 22 0.585   0.18
## 52 22 0.585   0.16
## 53 22 0.586   0.15
## 54 22 0.586   0.13
## 55 23 0.586   0.12
## 56 23 0.586   0.11
## 57 23 0.586   0.10
## 58 23 0.586   0.09
## 59 23 0.586   0.08
## 60 23 0.586   0.08
## 61 23 0.586   0.07
## 62 23 0.586   0.06
## 63 23 0.586   0.06
## 64 23 0.586   0.05
## 65 23 0.586   0.05
## 66 23 0.586   0.04
## 67 23 0.586   0.04
## 68 23 0.586   0.04
## 69 23 0.586   0.03
## 70 23 0.586   0.03
## 71 24 0.586   0.03
## 72 24 0.586   0.02
r2_bestStep <- which(abs(fit_lasso$dev.ratio - div_lmf_s$adj.r.squared) ==
                    min(abs(fit_lasso$dev.ratio - div_lmf_s$adj.r.squared)))
plot(fit_lasso, label = TRUE)

Unfortunately, this hasn’t helped much. The saturated model can explain at most 0.5864 of the variance, and to explain more than one-third of the deviance, lasso requires 8 non-zero coefficients. The model which explains about 0.5781 of the deviance—the value from the best regional and year-cutoff model—is model number 37, which has a lambda penalty of 0.6471 and 19 coefficients which are:

coef(fit_lasso, s = fit_lasso$lambda[[r2_bestStep]])
## 27 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## (Intercept)                204.4881
## (Intercept)                  .     
## DivisionNew England         41.7007
## DivisionMiddle Atlantic     77.8280
## DivisionSouth Atlantic      43.5001
## DivisionEast South Central  86.2735
## DivisionWest South Central  64.2848
## DivisionEast North Central  55.6627
## DivisionWest North Central  40.5719
## DivisionMountain            -0.9781
## Post2000TRUE                -3.9030
## Post2001TRUE                -2.7119
## Post2002TRUE                -4.9833
## Post2003TRUE               -13.1333
## Post2004TRUE                -1.4861
## Post2005TRUE                -8.3622
## Post2006TRUE                -6.3136
## Post2007TRUE                -1.2184
## Post2008TRUE                -6.3474
## Post2009TRUE                 .     
## Post2010TRUE                 .     
## Post2011TRUE                 .     
## Post2012TRUE                 .     
## Post2013TRUE                 .     
## Post2014TRUE                 4.0612
## Post2015TRUE                 .     
## Post2016TRUE                 0.3987

This is very close, but less parsimonious, than the 13-parameter stepAIC model above.

One other interesting note is that of the behavior of the Mountain division, which as the ninth variable in the sequence above is depicted by the number 9. The glmnet parameter plot shows that the all the “significant” parameters, depending on the penalty, stay as positive or negative when compared to the baseline. The very first “significant” parameter—number 9—is different. It starts negative, and becomes even more strongly negative until around 15 parameters, at which point it slopes upwards. As the penalty changes and the saturated model is approached, the coefficient actually becomes positive.

Tree-based method

It is clear that the true relationship between DeathRate and the geographies, even when accounting for the year, is non-linear. The linear model has, however, shown that the geographies are relevant. We can show this with a tree-based model as well. Moreover, a tree-based model will give us an idea of the relative importance of a variables by its distance from the root. As we are not trying to create a new predictive model but to identify characteristics of the observed data, all the data will be used for training. Lastly, since a tree-based model answers a yes-no question at each node, there is no need to use the dummy variables indicating pre- or post- a given year—the Year variable may be trained on directly.

dt_m <- caret::train(DeathRate ~ Division + Year, data = DTHD, tuneLength = 19L,
              method = 'rpart2')
rpart.plot(dt_m$finalModel, type = 1L)

As surmised from the penalized regression result, the Mountain division behaves differently enough to be the first choice in the selected model. Then 2006 is selected as the key differentiator—not 2010 as was considered above. Perhaps that is when the downward trend ended and the average between 2006 and 2017 is basically flat. Following is East South Central and Middle Atlantic, the former of which was also noticed in the exploratory section. Lastly, it is interesting to note that prior to 2006 (the right side of the tree) the East South Central and Middle Atlantic divisions are different enough to split off directly. All other divisions depend on whether or not it is after 2003. Prior to 2003 they all end in the same bucket. This is reinforced by the stepAIC model which also found Post2003 and Post2005 to be highly significant. This is also reinforced by the Lasso model above. Looking at the variable importance plots, the most important was variable 9, Mountain, followed by variables 13, 15, and 5 which are Post2003, Post2005, and East South Central respectively. Not in the exact same order as the tree, but close enough for adding confidence.

Population

It would be interesting to see whether the state population has any direct correlation with DeathRate. We could hypothesize that higher state populations might lead to better healthcare (decrease deathrate) or greater stress (increase deathrate). Conversely, lower state populations might lead to better living (decrease deathrate) and/or less accessible healthcare (increase deathrate). At this point we have NO way to attribute causality, but we can explore where there are correlations between Population and DeathRate. Since the population has a long tail, we log transform Population in the model.

hist(DTHD$Population)

hist(log(DTHD$Population), breaks = 15)


# Run Linear Regression using Population to predict DeathRate
pop_lm <- lm(DeathRate ~ log(Population), data = DTHD)
(pop_lm_s <- summary(pop_lm))
## 
## Call:
## lm(formula = DeathRate ~ log(Population), data = DTHD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -113.81  -29.52   -2.95   30.47  149.31 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       140.98      22.98    6.14  1.3e-09 ***
## log(Population)     4.57       1.51    3.02   0.0026 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46 on 898 degrees of freedom
## Multiple R-squared:  0.01,   Adjusted R-squared:  0.00893 
## F-statistic:  9.1 on 1 and 898 DF,  p-value: 0.00263

residPlot(pop_lm_s)


# Calcuate correlation between Population and DeathRate
DTHD %>% 
  dplyr::select(Population, DeathRate) %>% 
  correlate(method = "spearman", diagonal = 1) %>%
  focus(Population)

Interestingly, while there is a significant positive relationship between log(Population) and the DeathRate within each State, it only explains a very small percent of the variability with an adjusted \(R^2\) of 0.0089. The residuals are essentially normally distributed, and the QQ plot, is quite straight.

Model Summary

# Build summary table of model results from various simple models
model_list <- list(
  yr1_lm_s,
  yr2_lm_s,
  yr_lmf_s,
  r1_lm_s,
  r2_lm_s,
  rg_lmf_s,
  div_lmf_s,
  pop_lm_s
)

model_names <- c(
  "year1_lm",
  "year2_lm",
  "yr_mult_lm_final",
  "region1_lm",
  "region2_lm",
  "rg_mult_lm_final",
  "div_mult_lm_final",
  "pop_lm"
)

model_kpis <- c(
  "r2",
  "adjr2",
  "sigma",
  "Ftest",
  "Fnumdf",
  "Fdendf"
)

extract_kpis <- function(model_summary) {
  return(c(r2    = model_summary$r.squared, 
           adjr2 = model_summary$adj.r.squared, 
           sigma = model_summary$sigma, 
           model_summary$fstatistic))
}

results <- matrix(NA, 
                  nrow = length(model_names), 
                  ncol = 6, 
                  dimnames = list(model_names, model_kpis))

for (i in 1:length(model_list)) { 
    results[i,] = extract_kpis(model_list[[i]])
}

results
##                        r2    adjr2 sigma   Ftest Fnumdf Fdendf
## year1_lm          0.10447 0.103470 43.72 104.755      1    898
## year2_lm          0.14534 0.143430 42.74  76.267      2    897
## yr_mult_lm_final  0.14496 0.141139 42.79  37.934      4    895
## region1_lm        0.35434 0.352180 37.17 163.910      3    896
## region2_lm        0.35434 0.352180 37.17 163.910      3    896
## rg_mult_lm_final  0.50100 0.496518 32.76 111.821      8    891
## div_mult_lm_final 0.58418 0.578079 29.99  95.748     13    886
## pop_lm            0.01003 0.008926 45.97   9.097      1    898

Deep Learning Model

While we aren’t interested in a model to predict future values, it is interesting to train a deep learning models to see whether it can gain further explanatory value from given features through non-linear combinations. For this evaluation, we hold out the 2017 year data and train the model with the years, 2000-2016.

We will include State, Region, Division, Year and Population as features and predict the outcome, DeathRate. Note that there is huge co-linearity between State, Division and Region which can be problematic with other ML algorithms, but Deep Learning models are mostly insensitive. We include these additional features as surrounding geographies may provide predictive value for a given state. In a future analysis, we could explore dropping out Region or Division.

Prepare Data

# Add a column with mean Division DeathRate
DTHD_engineered <- DTHD

# Factors didn't work for recipes below - columns need to be charater vectors
DTHD_engineered$Region <- as.character(DTHD_engineered$Region)
DTHD_engineered$Division <- as.character(DTHD_engineered$Division)

# Calculate the mean DeathRate per Division and add this as a column
dm <- DTHD_engineered %>% 
  group_by(Division) %>% 
  mutate(DivisionMean = mean(DeathRate, na.rm = TRUE))

DTHD_engineered$DivisionMean <- dm$DivisionMean

# separate out training data
dthd.training <- DTHD_engineered %>%
  dplyr::select(State, Region, Division, Year, Population, DeathRate) %>%
  filter(Year < 2017)

# separate out test data
dthd.test <- DTHD_engineered %>%
  dplyr::select(State, Region, Division, Year, Population, DeathRate) %>%
  filter(Year == 2017)

# Create recipe for transforming columns
rec_obj <- recipe(DeathRate ~ ., data = dthd.training) %>%
  step_log(Population) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  prep(data = dthd.training)

# Apply recipe to prepare data for input into training
train_data <- bake(rec_obj, new_data = dthd.training) %>% 
  dplyr::select(-DeathRate)
test_data  <- bake(rec_obj, new_data = dthd.test) %>% 
  dplyr::select(-DeathRate)

# Response variables for training and testing sets
train_labels <- dthd.training %>% dplyr::select(DeathRate, -Division) 
test_labels  <- dthd.test %>% dplyr::select(DeathRate, -Division)

# convert everything to matrices
train_data <- as.matrix(train_data)
dimnames(train_data) <- NULL

test_data <- as.matrix(test_data)
dimnames(test_data) <- NULL

train_labels <- as.matrix(train_labels)
dimnames(train_labels) <- NULL

test_labels <- as.matrix(test_labels)
dimnames(test_labels) <- NULL

Define the Model


#' Wrapper to build a Keras Model
#' 
#' @param input_nodes   integer - number of input features
#' @param output_nodes  integer - number of output nodes
#' @param FLAGS         list - hyperparameter options
#' @return model
#' @export
build_model <- function(input_nodes, output_nodes, FLAGS) {
    # Set random seed for TensorFlow
    if(Sys.info()['sysname'] == 'Windows') {
      # TensorFlow bug with setting random seed
      tensorflow::tf$random$set_seed(random_seed)
    } else {
      use_session_with_seed(random_seed, 
                            disable_gpu = TRUE, 
                            disable_parallel_cpu = TRUE, 
                            quiet = FALSE)
    }

  # Build our Keras Model
  model <- keras_model_sequential() %>%
    layer_dense(units = FLAGS$neuron1, activation = 'relu', 
                input_shape = list(input_nodes) ) %>%
    layer_dense(units = FLAGS$neuron2, activation = 'relu') %>%
    layer_dense(units = FLAGS$neuron3, activation = 'relu') %>%
    layer_dense(units = output_nodes)
  
  # compile the model for use
  model %>% compile(
    loss      = 'mae',  # "mse"
    optimizer = optimizer_rmsprop(), # optimizer_rmsprop(),  lr = 0.002
    metrics   = list("mean_absolute_error") #list("mean_absolute_error")
  )
  
  return(model)
}

# Display training progress by printing a single dot for each completed epoch.
print_dot_callback <- callback_lambda(
  on_epoch_end = function(epoch, logs) {
    if (epoch %% 80 == 0) cat("\n")
    cat(".")
  }
) 


# The patience parameter is the amount of epochs to check for improvement.
early_stop <- callback_early_stopping(monitor = "val_loss", patience = 20)

Train the model

Note that we did a grid search of hyperparameters to optimize the model. Also, R-Squared is not normally a metric we calculate with models. Typically, we score models with Mean Absolute Error (MAE), RMSE, or even log versions of these. We added a function to calculate R-Squared for the model so we could directly compare performance with previous models in this project.


# Store off the scores from each hypertuning loop for evaluation
scores <- list()
iteration <- 1

# Code used for hypertuning is commented out
#for (neuron1 in c(128, 256, 512)) {
#  for (neuron2 in c(32, 64, 128, 256)) {
#   for (neuron3 in c(32, 64, 128, 256)) {
#     for (lr in c(0.0001, 0.001, 0.01)) {
      
#       FLAGS <- list('neuron1'=neuron1, 'neuron2'=neuron2, 'lr'=lr)

        # These were the optimal hyperparameters we tested
        FLAGS <- list('neuron1' = 256, 
                      'neuron2' = 32, 
                      'neuron3' = 32, 
                      'lr' = 0.001)
  
        # Build the model with provided parameters
        model <- build_model(ncol(train_data), 1, FLAGS)
        
        # Fit the model 
        history <- model %>% fit(
            train_data, 
            train_labels, 
            epochs = 500, 
            batch_size = 20, 
            validation_split = 0.2,
            verbose = 0,
            callbacks = list(print_dot_callback)
        )
## 
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ....................
  
        # Evaluate on test data and labels
        score <- model %>% evaluate(test_data, test_labels, batch_size = 128)
        
        # Calculate R squared for both the training and test data so 
        # we can compare with other models
        train_r2 <- r_squared(train_labels, model %>% predict(train_data))
        test_r2 <- r_squared(test_labels, model %>% predict(test_data))
        
        scores[[iteration]] <- list("n1" = FLAGS$neuron1, 
                                    "n2" = FLAGS$neuron2, 
                                    "n3" = FLAGS$neuron3, 
                                    "lr" = FLAGS$lr, 
                                    "score" = score, 
                                    "trn_r2" = train_r2, 
                                    "test_r2" = test_r2)
        #print(scores[[iteration]])
        
        iteration <- iteration + 1
#    }
#  }
#}

Adjusting hyperparameters lead to models with R-squared from a low of ~0.3 to a high of ~0.7741. The optimal result of hypertuning led to a model with three hidden layers with 256, 32, and 32 hidden nodes respectively. We found a learning rate of 0.001 to be optimal. Note that with more time, we could have explored additional hidden layers, dropout layers, etc. to further refine our model.

Our R-squared on the training data was 0.7741 and on the holdout test data, 0.7228.

When training models, there is variability. In general we expect the training scores to be slightly higher than test scores as a model memorizes and overfits to the training data. As long as the training score and test score are close, we are fine and the model is generalizing. If the training score were much higher then we might suspect overfitting, and conversely, if the test score were much higher, we might suspect information leakage.

Note that we set this up as a time series, training on the 2000-2016 data and validated with the 2017 data. This was intentional—we wanted the model to pick up on trends within the data to better understand how to interpret the Year feature.

Evaluate Model


# Plot the accuracy of the training data 
plot(history, metrics = "mean_absolute_error", smooth = FALSE)


# Evaluate on test data and labels
score <- model %>% evaluate(test_data, test_labels, batch_size = 128)

# Calculate R squared for both the training and test data so 
# we can compare with other models
train_r2 <- r_squared(train_labels, model %>% predict(train_data))
test_r2 <- r_squared(test_labels, model %>% predict(test_data))

The Deep Learning Keras model shows that ~74% of the model variability can be explained by geographic location, Population, Year and various interactions between them. This is an improvement over simpler models that are not able to tease out interactions. We could go back and manually add interaction terms to the simple models along with generating additional features to try and capture complexity, but Deep Learning is often able to discover these patterns without extra work. However, this new model is a black box so if we wanted to understand those interactions in the hope of finding actionable insights, we might need to do further analysis. The simple models offer more direct interpretability which makes them easier to understand and possibly infer, in the case of cardiac health outcomes, policy options or steps for further research.

Part 5 - Conclusion

There is a statistically significant relationship between geographic location in the United States and the straight death rate per 100K people due to heart disease. This would indicate that it would be worthwhile to investigate differences in diet, behavior, genetics, and culture in the different US divisions to perhaps create more fine-tuned and efficient treatments and prevention efforts.

The rate has been changing over time as well. A model comprising merely division and time, while clearly significant, cannot explain more than around 60% of the variance from the general mean. A richer feature set would prove valuable.

Lastly, attention should be paid to the Mountain and East South Central divisions as their behavior in the models differs from the other divisions.

References

NCHS - Leading Causes of Death: United States
State Population Totals and Components of Change: 2010–2018
Intercensal Estimates of the Resident Population by Single Year of Age and Sex for States and the United States: April 1, 2000 to July 1, 2010
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, https://web.stanford.edu/~hastie/Papers/glmnet.pdf
Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.

DATA 607 Final Project

Team ADM: Avraham Adler, Donny Lofland, Michael Yampol

December 11, 2019