With a high rate of avoidable deaths and chronic disease as well as an obesity rate at two times higher than the OECD average, the forecast for American health is gloomy. With that in mind, the purpose of this project was ultimately to answer three questions: 1) What United States counties are most favorable for an active, healthy lifestyle? 2) What are the differentiating characteristics that make them so? and 3) What might the best regression model be for modeling the relationship between our healthy lifestyle metric and these differentiating characteristics?
Our methodology for addressing these questions consisted of: Data Gathering & Preprocessing, Data Exploration & Preparation, and Multivariate Regression Analysis. We created our dependent health score variable, identified the top 10 healthiest counties, extended assumptions, gathered independent variables and brought all data into a consistent format. We then operated upon our dataset, identified our strongest and weakest predictors, and built our model with 10 of the original 24 independent variables.
With raw and transformed data, we then optimized on the coefficient of determination (R-squared), explored numerous linear regression vs. regularization models and selected the best regression model for modeling the relationship between healthy lifestyle and our independent variables. We used the R-squared and RMSE values as part of our selection criterion and found that a linear regression model applied to transformed data was our strongest model, largely due to its strong predictive performance with unseen data.
Key Words: Feature Creation, Feature Selection, County Level Health, Regression, Regularization
The U.S. spends more on health care than any other OECD county yet it has the lowest life expectancy, highest number of hospitalizations from preventable causes, highest rate of avoidable deaths and a chronic disease burden and obesity rate at two times higher than the OECD average.
The forecast for American health is gloomy, yet identifying the issue could aid in addressing it. One way of doing so would be to highlight the areas across the United States where the general populace are living long and living well, while exploring why these areas are able to succeed. Recognizing influential correlations and shedding light on the areas most conducive for leading a long and healthy life may provide example for emulation and answer the following:
In an attempt to answer these questions we created our own healthy lifestyle metric, gathered typical and atypical variables, explored their relationships, and then explored the efficacy of numerous regression models. The data contained entries for all 3006 American counties as well as corresponding states, health scores, populations, alcohol consumption rates and numerous features to be discussed during the Literature Review section.
For this project I focused my research and literature review around health outcome drivers and means of feature selection.
While the scope of this project was to explore related United States county data, I took a broad approach and considered international and higher-level data to best determine which variables might be fused to create the dependent variable and which variables might be best to consider as independent variables:
The first group of researched variables dealt with our dependent variable and the variables (perceived as) most correlated with healthy lifestyle.
Bhardwaj, Amiri, Buchwald, and Amram identified social and environmental correlates of healthy aging and longevity. They found that demographic, environmental, and social factors all play a significant role in predicting an individual’s likelihood to lead a healthy life and live until an older age. From their study, the importance of longevity as a factor in living a healthy lifestyle was reinforced, as were a number of factors we might consider as independent variables when building a regression model (ie. education, socioeconomic status).
Kaminsky, Lessler, Sharfstein, and Wallace, observed the effect of clustering based on sociodemographic context when considering county-level percentile health rankings vs. nationwide rankings. They considered smoking, motor vehicle crash deaths, and obesity as factors and found that clustering based on sociodemographic characteristics allowed for a better understanding of how other factors may shape the prevalence of health outcomes. From their study, the importance of obesity as a healthy lifestyle factor was reinforced, as was the importance of nuance when interpreting health outcome variables at a county level.
The final factor for consideration in the make-up of our dependent variable was physical activity. The motivation for its inclusion came from Azevedo, Hallal, Wells, and Victoria’s research. Although their study was primarily focused on physical activity amongst adolescents, they considered the short term and long term effects of physical activity and found that “there is an indirect effect on all health benefits resulting from adult physical activity”.
Our dependent variable was a fusion of longevity, obesity, and physical activity county-level data. These three are important factors that take short and long term health into consideration, and thus their amalgamation provides a strong representation of healthy lifestyle (as a metric) at a county-level.
An, Li, and Jiang explored the impact of environmental determinants (via EQI) on physical inactivity among U.S. adults at the county-level. Although, environment and physical activity vary substantially across the United States, the study identified an inverse relationship between environmental quality and physical inactivity.
Holick and Wacker provided an in-depth glance into the incredible health benefits of sun exposure and increased vitamin D production. They cite an inverse association with latitude and many chronic illnesses including: cancer, autoimmune disease, diabetes, depression, seasonal affect disorder, hypertension, etc. and recommended increased sun exposure (without sun burn) and vitamin D fortification and supplementation as a natural remedy to many of these ailments.
From consideration of the environmental quality index (EQI) and yearly sunlight, we move to that of socialization and crowding by considering population density and the effect (positive or negative) it may play on the lifestyle of its local inhabitants. Sven Bremberg analyzed the role population density played on life expectancy and years lost, and found that mortality rates were consistently high in Finland, Norway and Sweden in less densely populated municipalities. We might extend, from this, the notion that (to a certain point), a higher population density is more favorable for a longer, healthier life.
In A National Study of the Association Between Food Environments and County Level Health Outcomes, Ahern, Brown, and Dukas applied linear regression models to estimate the relationship between food availability and access variables (ie. per capita grocery stores) with health outcomes. Positive health outcomes were linked to more per capita full‐service restaurants, grocery stores, fitness and recreation centers, and direct farm sales, while negative health outcomes were linked to more fast-food restaurants and convenience stores. As a general takeaway we might extend that access to nutrient-dense foods if favorable to healthier lifestyle whereas living in a “food desert” would favor the opposite.
In Smoking, alcohol consumption and mental health: Data from the Brazilian study of Cardiovascular Risks in Adolescents (ERICA) Ferreira, Jardim, Jardim, Sousa, and Rosa found that second hand smoke, smoking, and alcohol consumption are all adversely associated with psychological distress in the adolescent population. When we extend this from adolescent Brazilians to consider adult Americans, it will be interesting to observe the strength of correlation between bad habits at scale and a county’s standing health-wise.
Islam, Leer, Teo, et al. found that among patients with coronary heart disease or stroke event, the prevalence of healthy lifestyle behaviors was relatively low regardless of whether the nation under consideration was high vs. medium vs. low income, but the behaviors were at their lowest levels in poorer countries. From this, we may extend that higher income counties (in our situation) are more likely to follow a healthy lifestyle – not smoking, quality diet, regular exercise, etc.
In Length of Unemployment and Health-related Outcomes, Hammarstrom, Janlert, and Winefield studied the relationship between cumulative length of intermittent spells of unemployment and different health-related outcomes using data from a longitudinal study of school leavers in a mid-size town in northern Sweden. Respondents were followed for 14 years and the researchers concluded that ‘cumulative length of unemployment is correlated with deteriorated health and health behavior’. Extended from this mid-size Swedish town to the US at a county-level, we may expect a strong, negative correlation between unemployment and healthy lifestyle.
One of the major differentiating factors between this research project and those made reference to above lies in the fact that I created my own dependent variable (as a combination of health factor variables) and then read in a broad range of health and non-health related independent variables. The research above motivated the creation of a custom dataset to then be explored for relationships between variables.
To succinctly summarize the preceding write up, the source, type, and name of each variable under consideration were put into tabular form:
Source | Type | Variable |
---|---|---|
Bhardwaj, Amiri, Buchwald, and Amram | Dependent | Longevity |
Kaminsky, Lessler, Sharfstein, and Wallace | Dependent | Obesity |
Azevedo, Hallal, Wells, and Victoria | Dependent | Physical Activity |
An, Li, and Jiang | Independent | EQI (Environmental Quality) |
Holick and Wacker | Independent | Sunlight |
Bremberg | Independent | Population Density |
Ahern, Brown, and Dukas | Independent | Nutrition |
Ferreira, Jardim, Jardim, Sousa, and Rosa | Independent | Alcohol Consumption |
Islam, Leer, Teo, et al. | Independent | Income |
Hammarstrom, Janlert, and Winefield | Independent | Unemployment |
Julian Faraway’s Linear Models with R and Simon Sheather’s A Modern Approach to Regression with R proved invaluable as references for applied regression, feature selection, model building and model selection. All Literature Review sources are cited in the Literature section of the Bibliography.
Our research was focused on attempting to answer the following questions:
Our methodology for answering these questions was a three step process: Data Gathering & Preprocessing, Data Exploration & Preparation, and Multivariate Regression Analysis. The nature of each is described briefly below:
We train-test split our data, maintained evaluation metrics for each model, and then compared and contrasted model performances on seen (training) and unseen (testing) data.
In order to create our dependent ‘health score’ variable, I considered life expectancy, obesity, and physical activity data downloaded from the Institute for Health Metrics and Evaluation. The data were then converted to a csv-compatible form, a subset of columns were selected for consideration, and then sub-variable were normalized and congregated across multiple iterations in the manner documented below.
Male life expectancy, 2010 (years)
, Female life expectancy, 2010 (years)
, Difference in male life expectancy, 1985-2010 (years)
, and Difference in female life expectancy, 1985-2010 (years)
.Male obesity prevalence, 2009 (%)
, Female obesity prevalence, 2009 (%)
, Difference in male obesity prevalence, 2001-2009 (percentage points)
, and Difference in female obesity prevalence, 2001-2009 (percentage points)
.Male sufficient physical activity prevalence, 2009 (%)
, Female sufficient physical activity prevalence, 2009 (%)
, Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)
, and Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)
.For sake of conciseness, the majority of exploratory details and plots for our dependent variable’s creation have been remitted (although code is available in the Appendix). Prior to moving on though, a number of interesting observations were made regarding our sub-variable’s distributions:
Our dependent variable was built upon these sub-variables, and as such, the notes above provide important context regarding the target variable upon which this project is based.
For each over-arching variable (life expectancy, obesity, or physical activity), we read in the 4 corresponding variables listed above, normalized, compiled the over-arching variable as a summation of normalized sub-variables and then normalized the result.
The normalization of sub-variables and (later) over-arching variables was done in order to bring all data to a 0-to-1 scale via the following equation:
\[ Transformed.Values = \frac{(Values - Min)}{(Max - Min)} \]
Upon normalization of our over-arching variables, we created our dependent ‘healthy lifestyle’ variable as a combination of longevity, obesity, and physical activity:
\[ Lifestyle = Normalized.Life - Normalized.Obesity + Normalized.Activity \]
The result was normalized, bringing our cycle of thrice normalizing and twice congregating to a close, with a dependent variable on a 0 to 1 scale as shown in the distribution below:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3708 0.4642 0.4639 0.5546 1.0000
For our ‘healthy lifestyle’ metric, we observed a normal distribution whose peak was centered between 0.4 and 0.5. When we consulted the summary statistics, we verified a mean of 0.4639 and a median of 0.4642. Confirming a slight left skew to our dependent variable’s distribution.
Using our health metric, we then filtered county data for the top 10 healthiest counties:
State | County | health_score |
---|---|---|
Colorado | Routt | 1.0000000 |
California | Marin | 0.9889641 |
Colorado | Pitkin | 0.9833580 |
Wyoming | Teton | 0.9778477 |
Colorado | Eagle | 0.9745991 |
California | San Francisco | 0.9139634 |
Colorado | Summit | 0.9072650 |
Colorado | Douglas | 0.9054401 |
Utah | Summit | 0.9014540 |
Colorado | Gunnison | 0.8978052 |
From the list above, we noted (6) Colorado, (2) California, (1) Utah and (1) Wyoming county and extended assumptions regarding factors that may come into play for our top 10 healthiest counties:
Later, we would revisit these assumptions while selecting features based on their importance. Greater importance would be indicative of higher likelihood of the feature in question being one of our differentiating factors (recall our second question).
With our health metric created and a brief investigation into the top 10 healthiest counties, we moved on to reading in, exploring, and preparing our independent variables.
We sought county-level data where the counties of the United States would form the basis of our observations and variables could include numerous standard and non-standard health-related metrics:
Alcohol Consumption
sourced from IHME.Heart Disease
sourced from the IHMEEducation
, Unemployment
, and Population
sourced from the USDA.Environmental Quality Index
sourced from the EPA.Food Insecurity
sourced by request from Feeding America.Sunlight
sourced from CDC Wonder.Poverty
and MedianIncome
sourced from the US Census Bureau.While the IHME is an independent global health research center associated with the University of Washington and Feeding America is one of the largest nonprofit organizations in the United States, every other data source was connected to the United States government. The reputability and dependability of the source / institution motivated the election of these sources.
Datasets were downloaded from their respective platforms, simplified to a .csv-compatible form (obviously impertinent variables and header rows were dropped), uploaded to Github and then read in via R’s built-in read_csv() function. This operation was completed (9) times because the majority of our variables were listed in separate sets.
Bringing each of our sources to a consistent format was of the utmost importance for us to merge dataframes and (later) explore our data.
In order to do so:
State
names were abbreviated they were converted to the full name,County
variable,State
and County
observations.The result of merging all variables into one master dataframe was a 3154 observation x 25 variable dataframe. With our master dataset created, we proceeded to its exploration and preparation.
For our exploratory approach, we got to know our data’s structure and value ranges, visualized the relationship our variables had with one another and the target variable via correlation matrix, and explored the pertinence of each variable and the corresponding distribution for features we would select. We then prepared our data (handled NA’s and outliers, normalized, and created features) and proceeded to optimize a linear regression model based on our transformed data.
We utilized the built-in glimpse() and summary() methods to gain insight into the dimensions, variable characteristics, and value ranges for our training dataset:
## Rows: 3,154
## Columns: 25
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",~
## $ County <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bu~
## $ health_score <dbl> 0.4597290, 0.5185262, 0.2292573, 0.2954267, 0.2025867,~
## $ Hvy <dbl> 14.7, 16.6, 16.9, 15.4, 14.3, 18.8, 16.8, 13.9, 17.7, ~
## $ HvyPctChg <dbl> -2.7, 1.2, 17.5, 1.4, 3.1, 21.4, 8.7, 12.4, 10.9, 12.1~
## $ Bng <dbl> 31.1, 30.4, 32.9, 31.9, 29.4, 38.1, 33.6, 31.0, 33.9, ~
## $ BngPctChg <dbl> -1.1, -1.8, 3.4, -2.3, -3.4, 15.9, 8.1, 3.9, 0.0, -5.0~
## $ Mortality <dbl> 316.36, 272.04, 255.09, 378.09, 307.90, 322.56, 382.55~
## $ MortalityChg <dbl> -52.47, -36.95, -45.95, -31.16, -39.89, -68.95, -39.47~
## $ LTHighSchool <dbl> 11.5, 9.2, 26.8, 20.9, 19.5, 25.3, 15.0, 15.6, 18.4, 1~
## $ HighSchool <dbl> 33.6, 27.7, 35.6, 44.9, 33.4, 40.3, 45.2, 32.8, 36.7, ~
## $ SomeCollege <dbl> 28.4, 31.3, 26.0, 23.8, 34.0, 22.3, 23.7, 33.2, 31.6, ~
## $ College <dbl> 26.6, 31.9, 11.6, 10.4, 13.1, 12.1, 16.1, 18.5, 13.3, ~
## $ EQI <dbl> 0.284681678, 0.754969418, -0.005637936, 0.304039747, 1~
## $ FoodInsecurity <dbl> 15.6, 12.9, 21.9, 15.1, 13.6, 20.5, 19.1, 17.4, 16.4, ~
## $ Sun <dbl> 18557.98, 19101.34, 18642.40, 18282.04, 17606.36, 1866~
## $ Unemployment <dbl> 2.7, 2.7, 3.8, 3.1, 2.7, 3.6, 3.6, 3.5, 2.9, 2.9, 2.7,~
## $ UnemploymentChg <dbl> -2.4, -2.6, -4.5, -3.3, -2.7, -3.2, -3.3, -3.0, -2.5, ~
## $ Poverty <dbl> 12.1, 10.1, 27.1, 20.3, 16.3, 30.0, 21.6, 17.2, 19.6, ~
## $ Income <dbl> 58233, 59871, 35972, 47918, 52902, 31906, 39944, 47747~
## $ Population <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 19448, 1136~
## $ Births <dbl> 624, 2304, 256, 240, 651, 109, 213, 1269, 354, 222, 55~
## $ Deaths <dbl> 541, 2326, 312, 252, 657, 109, 272, 1532, 441, 343, 50~
## $ NetMig <dbl> 254, 5377, -128, 41, 65, -73, -123, -461, -259, 303, 2~
## $ PopChg <dbl> 1298, 40969, -2771, -521, 504, -813, -1499, -4967, -96~
We noted 3154 observations x 25 variables: 2 categorical variables, 23 numeric variables, significant NA counts for numerous variables, and quite a difference in the magnitude of values based on the variables.
Our categorical variables were State
, our state identifier and County
, our county identifier. These variables were not of much use for anything beyond identification and would be excluded.
Our dependent variable health_score
was quantitative and provided a measure of ‘healthy lifestyle’ as a function of longevity, obesity, and physical activity. This would be our “target” variable, the one against which we’d measure the impact of our independent variables.
As for our quantitative, independent variables:
Hvy
: % of population that drink heavily. “Heavy” drinking is defined as the consumption, on average, of more than one drink per day for women or two drinks per day for men in the past 30 days.HvyPctChg
: % change in heavy drinking from 2005 to 2009.Bng
: % of population that binge drink. “Binge” drinking is defined as the consumption of more than four drinks for women or five drinks for men on a single occasion at least once in the past 30 days.BngPctChg
: % change in binge drinking from 2005 to 2009.Mortality
: mortality rate in 2014 as a result of heart disease.MortalityChg
: % change in mortality rate from 2005 to 2014 as a result of heart disease.LTHighSchool
: % of population educated less than high school from 2015-2019.HighSchool
: % of population educated at a high school level (and no further) from 2015-2019.SomeCollege
: % of population who received some college education from 2015-2019.College
: % of population who completed at least a Bachelor’s level education from 2015-2019.EQI
: Environmental Quality Index.FoodInsecurity
: % of population whose food intake was disrupted by lack of resources in 2018.Sun
: average daily sunlight (KJ/m^2) in 2011.Unemployment
: rate of unemployment in 2019.UnemploymentChg
: change in unemployment rate from 2015 to 2019.Poverty
: rate of poverty in 2019.Income
: median household income in 2019.Population
: county-level population for 2019.Births
: county-level births for 2019.Deaths
: county-level deaths for 2019.NetMig
: county-level net migration for 2019.PopChg
: county-level rate of population change from 2010 to 2019.We dropped NA values from consideration (150 observations) and turned our attention to exploring the relationship our independent variables had with one another and with the target via correlation matrix. We considered only variables with a correlation significance > 0.2 in our plot:
From the above plot, we extended that multicollinearity was a concern. As such, we considered the elimination of a large portion of the variables we’d carried upto this point. To then proceed with only those with strong predictive promise and / or unique value added (ie. BngPctChg
or Sun
).
Based on these guidelines, it appeared that we should proceed with at least BngPctChg
, Mortality
, College
, FoodInsecurity
, Unemployment
, Income
, and PopChg
. These variables were relatively unique from one another and had a strong correlation with health_score
.
To clarify we consulted a table (code available in Appendix) with the proportion of missing data and correlation with our target variable. We found that none of our independent variables were missing data, that Hvy
, MortalityChg
, NetMig
had a weak correlation with health_score
, and that the remainder of our variables carried a relatively strong correlation with the target.
To address weak target correlation and multicollinearity we considered the exclusion of 13 variables : Hvy
, MortalityChg
, and NetMig
due to weak correlation with the target variable and HvyPctChg
Bng
, LTHighSchool
, HighSchool
, SomeCollege
, Unemployment
, Poverty
, Population
, Births
, and Deaths
due to multicollinearity.
Before doing so, we utilized the Boruta function for feature ranking and selection as confirmation. In doing so, MortalityChg
, BngPctChg
, Unemployment
, HvyPctChg
, and Population
were identified as the variables with the lowest importance.
In addition, we observed that Mortality
, College
, FoodInsecurity
, LTHighSchool
, Sun
, EQI
, Income
, PopChg
, Bng
, and UnemploymentChg
were our strongest predictors.
Earlier, from the top 10 healthiest counties, we’d extended the assumption that sunshine, median income, sparser population clusters, and friendliness to an active, healthy lifestyle (ie. environment) may be differentiating factors. From our list of the strongest predictors, we confirm that these variables do indeed play a role in county-level health.
Their inclusion added to our ability to build a strong model. Based on variable importance, correlation with health_score
and multicollinearity, we proceeded with these variables and visited their corresponding histograms:
From the histograms above, we noted a number of non-normal (ie. PopChg
), bimodal (ie. Sun
), normal right skewed (ie. College
), normal left skewed (ie. EQI
), and relatively normal (ie. Mortality
) distributions.
The wide-ranging difference in scales and non-normal distributions could have proven problematic in applying generalized linear regression models, so we elected, as a part of our data preparation, to explore the impact of normalization. Additionally, we would explore the impact of outlier removal and feature creation (ie. converting non-normal variables to dummy “flag” variables) to address the issue of non-normal distributions.
First, we generated a baseline model with our 10 independent variables and observed an R^2 of 0.6603. Once we’d established our baseline model and performance statistics, we set out to optimize the results.
We proceeded to use the R^2, the coefficient of determination, to assess the impact of each step on the strength of our model. If a step was seen to have no (or a negative) impact, it was deemed inessential. Whereas if a step was seen to have a positive impact, it was deemed essential. The results of each step are summarized below:
As a final transformation step, we utilized the stepAIC() function to optimize our model based on AIC score - a measure of goodness of fit and model complexity. During feature creation, we’d added 7 features and we wanted to ensure that each of these features was indeed improving our predictive capability rather than just making our model more complex.
After AIC optimization, model complexity was reduced from 17 to 16 independent variables, and predictive capability was maintained. It was a useful step that appears to have enhanced our model.
As a next step, we’d proceed with model building and selection.
In order to strike the greatest balance between model complexity, goodness of fit and error, we explored two more linear regression models before moving on to regularization.
We explored linear regression models with raw and AIC-optimized data, and proceeded to use regularization (ridge and lasso) regression models on our transformed and raw data. The aim was to explore different models and select the strongest amongst them. To do so, we train-test split our data (raw and transformed), constructed our models, and then evaluated each model’s performance on seen and unseen data.
To conclude, we compared and contrasted the performance of all of our models using the R-squared and Root Mean Square Error (RMSE) as a part of our selection criteria.
As a means of comparing our (earlier) transformed model to a true baseline, a model with all of the original features and no operations performed, we considered a model with all 22 independent variables (from earlier).
The perceived strength of this “raw” model would have been the number of features whereas the weakness would be the higher incidence of over-fitting.
With this in mind, we also considered an AIC-optimized model. A model where variables were automatically selected in a step-wise manner based on the predictive potential that their inclusion might bring.
Corresponding performance metrics are noted in the ‘Model Selection’ section.
Ridge regression is an extension of linear regression where model complexity and the potential for over-fitting are addressed by adding a penalty parameter to penalize large coefficients.
One of the major differences between linear and regularized regression is the use of a lambda
tuning parameter. To automate the task of finding the optimal lambda value, we made use of the cv.glmnet() function.
When lambda is zero, the penalty term has no effect, whereas when lambda increases to infinity, the shrinkage penalty grows and our coefficients approach zero. Our optimal lambda was computed as 0.003162278 (for raw data) and we can see from the above plot the critical role optimal lambda computation can have when evaluating our model.
We used the glmnet package to build our regularized regression models (with alpha
= 0 for ridge regression and alpha
= 1 for lasso regression). Being that the corresponding function does not work with dataframes, we used the dummyVars() function from the caret package to create our model matrices and then used the predict() function to create numeric model matrices for both training and test data.
Our first ridge regression model was trained on raw data while the second was trained on transformed data. Performance metrics are noted in the ‘Model Selection’ section.
LASSO (Least Absolute Shrinkage and Selection Operator) regression is an extension of linear regression where model complexity and the potential for over-fitting are addressed by limiting the sum of the absolute values of model coefficients.
Similar to ridge regression, the first step was to find the optimal lambda value and we automated this task through the use of the cv.glmnet() function.
When lambda approaches zero, the loss function of our model approaches the OLS function and we consider more variables, whereas when lambda grows, the regularization term has a greater effect and we see fewer variables in the model. Our optimal lambda was computed as 0.001 for raw data. We observed in the plots above the effect our lambda value (on different scales) had on corresponding coefficient values.
A similar model-fitting approach to ridge regression was taken (utilizing the glmnet package), with our first LASSO model trained on raw data and the second trained on transformed data. Performance metrics are noted in the ‘Model Selection’ section.
We output the performance metrics of all models in a table to select the model with the strongest predictive promise. We evaluated performance using the R-squared value and Root Mean Squared Error (RMSE). Lower RMSE and higher R-squared values were indicative of a stronger model.
To succinctly summarize our model optimization attempts and results, we captured the method used, the data the method was applied to, as well as the number of variables considered, and corresponding evaluation metrics for raw and transformed data:
Model | Method | Data | Var_Num | R2_train | RMSE_train | R2_test | RMSE_test |
---|---|---|---|---|---|---|---|
1 | Linear | raw | 22 | 0.6678 | 0.0759 | 0.6678 | 0.0827 |
2 | Linear | raw(AIC) | 17 | 0.673 | 0.0759 | 0.673 | 0.0842 |
3 | Linear | transformed | 16 | 0.671 | 0.0764 | 0.71 | 0.0815 |
4 | Ridge | raw | 22 | 0.6734 | 0.0761 | 0.659 | 0.0789 |
5 | Ridge | transformed | 16 | 0.6732 | 0.0764 | 0.632 | 0.0814 |
6 | Lasso | raw | 22 | 0.6681 | 0.0767 | 0.6642 | 0.0783 |
7 | Lasso | transformed | 16 | 0.6724 | 0.0765 | 0.6354 | 0.0811 |
From the above model statistics, we can extend that:
Var_Num
: Models 3, 5, and 7 are favorable for model complexity being that they have the lowest variable count (16). Our data transformations appear to have led to the optimal level of simplicity, followed closely by the AIC-optimized raw linear model.R2_train
: Model 4 is the highest performing, followed by Models 5, 2, 7, and then 3. A higher R squared value for this column is indicative of stronger predictive performance for (seen) training data. Regularization and raw (broader) data appears to be correlated with stronger predictive outcomes on training data. With that said, over-training can lead to over-fitting and thus this metric is a secondary measure to performance on unseen data.RMSE_train
: Models 1, 2, and then 4 have the lowest error for (seen) training data,R2_test
: Model 3 is the highest performing (by a relatively significant margin), followed by Models 2, 1, 6 and then 4. A higher R squared value for this column is indicative of stronger predictive performance on (unseen) testing data. Our data transformations appear to have led to the strongest predictive model for unseen data.RMSE_test
: Models 6, 4, 7, 5, and then 3 have the lowest error for (unseen) testing data. Regularization methods appear to reduce error.To find the point of balance between model complexity, goodness of fit and error, we elected Model 3. Model 3, linear regression applied to transformed data, was the most promising of models due to its simplicity, relatively low error rate (all models performed well here), and strong performance on unseen data. This last metric, unseen data performance, was the deciding factor.
We started out with the goal of answering the following 3 questions:
Being that readily-available data that would have suited the scope of this project was difficult to find, the scope and aim of the project expanded to include the creation of a healthy lifestyle metric and the compilation of a set of independent variables to be read in in conjunction. In other words, a large portion of this project was dedicated toward the creation of the dataset we would consider for questions 2 and 3.
In the data gathering and pre-processing phase, we created our dependent health score variable as a combination of life expectancy, obesity, and physical activity data read in from the IHME website. Over-arching metrics (a combination of 4 sub-metrics) were summed and normalized to bring our variable onto a 0 to 1 scale. We also utilized our health score to filter for the top 10 healthiest counties. We then extended a few assumptions regarding the factors that might come into play for the healthiest counties and we took these factors into consideration when reading in our independent variables. Finally, we read in county-level standard and non-standard health-related variables (ie. Alcohol Consumption, Heart Disease, and Education) from different sources. We then brought the data from each of these sources into a consistent format so that we could merge data frames and explore our data.
In the data exploration and preparation phase, we first interpreted our 3154 observations x 25 variables. We then dropped impertinent variables and NA values, identified our strongest and weakest predictors (strongest predictors included: Mortality, College, FoodInsecurity, LTHighSchool, Sun, EQI, Income, PopChg, Bng, and UnemploymentChg), and proceeded with only 10 of the original 24 independent variables (due to multicollinearity and weak target-correlation). Through normalization, feature creation, and AIC-optimization, we optimized on the coefficient of determination (R-squared). We found our optimal (transformed) linear regression model included 16 independent variables with a coefficient of determination of 0.6734.
In the multivariate regression analysis phase, we expanded upon this transformed linear regression model. We set out to explore linear regression models with raw data and AIC-optimized data, and then proceeded to use regularization (ridge and lasso) regression models on both our transformed and raw data. The aim was to explore different models and select the strongest amongst them. To do so, we train-test split our data (raw and transformed), constructed our models, and then evaluated each model’s performance on seen and unseen data. We then output the performance metrics of all models in a table, where the R-squared and RMSE values were used as part of our selection criterion. The linear regression model applied to transformed data was found to be the strongest model, largely due to its stand-out performance on unseen (testing) data.
Note: 3154 was greater than the 3009 observations which we’d noted as the number of US counties. We considered a broader range of “territories” (ie. in Alaska) where the observations were consistent across our dataset.
Over the course of the project, while researching, and especially during the final phase, I realized a number of ways in which this project could have been taken to a higher level and I’d like to share these potential refinements here.
They include, but are not limited to:
In completing this research project, reference was made to the following
Being that data gathering and pre-processing was a major focus of this project, I’ve dedicated a separate RPubs publication to document the creation of our dependent ‘healthy lifestyle’ metric as well as the pre-processing of our independent variables. Note: some light adaptations to this original code have been made above.
#Import libraries
library(tidyverse)
library(dplyr)
library(readr)
library(ggplot2)
library(RCurl)
library(rvest)
library(stringr)
library(tidyr)
library(kableExtra)
library(BBmisc)
library(tm)
library(sqldf)
library(inspectdf)
library(corrplot)
library(MASS)
library(caret)
library(glmnet)
options(scipen = 9)
set.seed(123)
#---User-defined function(s)---#
#Adapted correlation matrix used in EDA section:
<- function(dataframe, significance_threshold){
plot_corr_matrix <- paste0('Correlation Matrix for significance > ',
title
significance_threshold)
<- dataframe %>% mutate_if(is.character, as.factor)
df_cor
<- df_cor %>% mutate_if(is.factor, as.numeric)
df_cor #run a correlation and drop the insignificant ones
<- cor(df_cor, use = 'na.or.complete')
corr #prepare to drop duplicates and correlations of 1
lower.tri(corr,diag=TRUE)] <- NA
corr[#drop perfect correlations
== 1] <- NA
corr[corr #turn into a 3-column table
<- as.data.frame(as.table(corr))
corr #remove the NA values from above
<- na.omit(corr)
corr #select significant values
<- subset(corr, abs(Freq) > significance_threshold)
corr #sort by highest correlation
<- corr[order(-abs(corr$Freq)),]
corr #print table
# print(corr)
#turn corr back into matrix in order to plot with corrplot
<- reshape2::acast(corr, Var1~Var2, value.var="Freq")
mtx_corr
#plot correlations visually
corrplot(mtx_corr,
title=title,
mar=c(0,0,1,0),
method='color',
tl.col="black",
na.label= " ",
addCoef.col = 'black',
number.cex = .9)
}
#Evaluation metrics used in Model Building / Selection section:
= function(model, df, predictions, target){
eval_metrics = df[,target] - predictions
resids = resids**2
resids2 = length(predictions)
N = as.character(round(summary(model)$r.squared, 4))
r2 = as.character(round(summary(model)$adj.r.squared, 4))
adj_r2 print(adj_r2) #Adjusted R-squared
print(as.character(round(sqrt(sum(resids2)/N), 4))) #RMSE
}
## --- Model Selection --- ##
#Create Kable table to succinctly summarize variable research
<- c('Bhardwaj, Amiri, Buchwald, and Amram', 'Kaminsky, Lessler, Sharfstein, and Wallace', 'Azevedo, Hallal, Wells, and Victoria', 'An, Li, and Jiang', 'Holick and Wacker', 'Bremberg', 'Ahern, Brown, and Dukas', 'Ferreira, Jardim, Jardim, Sousa, and Rosa', 'Islam, Leer, Teo, et al.', 'Hammarstrom, Janlert, and Winefield')
Source <- c('Dependent', 'Dependent', 'Dependent', 'Independent', 'Independent', 'Independent', 'Independent', 'Independent', 'Independent', 'Independent')
Type <- c('Longevity', 'Obesity', 'Physical Activity', 'EQI (Environmental Quality)', 'Sunlight', 'Population Density', 'Nutrition', 'Alcohol Consumption', 'Income', 'Unemployment')
Variable
<- cbind(Source, Type, Variable)
LRVS
%>%
LRVS kbl(caption = "Literature Review: Variable Summary") %>%
kable_minimal() %>%
kable_styling(latex_options = "hold_position")
## --- Dependent Variable Creation --- ##
#Read in life expectancy data, convert to tibble, and select pertinent columns:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/IHME_LifeExpectancy.csv")
longevity <- as_tibble(longevity)
life_table <- life_table %>% dplyr::select(1:2,13:16) %>% na.omit()
life_table
#Read in obesity data, convert to tibble, and select pertinent columns:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/IHME_Obesity.csv")
obesity <- as_tibble(obesity)
obesity_table <- obesity_table %>% dplyr::select(1:2,5:6,9:10) %>% na.omit()
obesity_table
#Read in physical activity data, convert to tibble, and select pertinent columns:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/IHME_PhysicalActivity.csv")
activity <- as_tibble(activity)
act_table <- act_table %>% dplyr::select(1:2,5:6,9:10) %>% na.omit()
act_table
###LIFE EXPECTANCY DATA: exploration, normalization, and compilation
#Explore life expectancy data at a county level:
glimpse(life_table)
summary(life_table)
#Extract variables of interest
<- life_table$`Male life expectancy, 2010 (years)`
m1 <- life_table$`Female life expectancy, 2010 (years)`
f1 <- life_table$`Difference in male life expectancy, 1985-2010 (years)`
dm1 <- life_table$`Difference in female life expectancy, 1985-2010 (years)`
df1
#Normalize data scale to be from 0 to 1
= (m1-min(m1))/(max(m1)-min(m1))
n_m1 = (f1-min(f1))/(max(f1)-min(f1))
n_f1 = (dm1-min(dm1))/(max(dm1)-min(dm1))
n_dm1 = (df1-min(df1))/(max(df1)-min(df1))
n_df1
#Histogram of original vs. normalized data
##Life expectancy histograms
par(mfrow=c(2,2))
hist(m1, breaks=10, xlab="Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(n_m1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(f1, breaks=10, xlab="Age (years)", col="lightblue", main="Female life expectancy, 2010")
hist(n_f1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female life expectancy, 2010")
##Longevity improvement histograms
par(mfrow=c(2,2))
hist(dm1, breaks=10, xlab="Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(n_dm1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(df1, breaks=10, xlab="Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")
hist(n_df1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")
#Add normalized variables together
<- n_m1 + n_dm1 + n_f1 + n_df1
life
#Normalize activity to 0-1 range
= (life-min(life))/(max(life)-min(life))
n_life #head(n_life)
#Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(life, breaks=10, xlab="Score", col="lightblue", main="Longevity metric")
hist(n_life, breaks=10, xlab="Normalized Score", col="lightblue", main="Longevity metric")
summary(n_life) #slight left skew
###OBESITY DATA: exploration, normalization, and compilation
#Explore obesity data at a county level:
glimpse(obesity_table)
summary(obesity_table)
#Extract variables of interest
<- obesity_table$`Male obesity prevalence, 2009 (%)`
m2 <- obesity_table$`Female obesity prevalence, 2009 (%)`
f2 <- obesity_table$`Difference in male obesity prevalence, 2001-2009 (percentage points)`
dm2 <- obesity_table$`Difference in female obesity prevalence, 2001-2009 (percentage points)`
df2
#Normalize
= (m2-min(m2))/(max(m2)-min(m2))
n_m2 = (f2-min(f2))/(max(f2)-min(f2))
n_f2 = (dm2-min(dm2))/(max(dm2)-min(dm2))
n_dm2 = (df2-min(df2))/(max(df2)-min(df2))
n_df2
#Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(n_m2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(f2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")
hist(n_f2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")
par(mfrow=c(2,2))
hist(dm2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(n_dm2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(df2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")
hist(n_df2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")
#Add normalized variables together
<- n_m2 + n_dm2 + n_f2 + n_df2
fat
#Normalize activity to 0-1 range
= (fat-min(fat))/(max(fat)-min(fat))
n_fat
#head(n_fat)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(fat, breaks=10, xlab="Score", col="lightblue", main="Obesity metric")
hist(n_fat, breaks=10, xlab="Normalized Score", col="lightblue", main="Obesity metric")
summary(n_fat) #right skewed
###PHYSICAL ACTIVITY DATA: exploration, normalization, and compilation
glimpse(act_table)
summary(act_table)
#Explore and normalize male physical activity data
<- act_table$`Male sufficient physical activity prevalence, 2009 (%)`
m3 <- act_table$`Female sufficient physical activity prevalence, 2009 (%)`
f3 <- act_table$`Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)`
dm3 <- act_table$`Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)`
df3
#Normalized Data
= (m3-min(m3))/(max(m3)-min(m3))
n_m3 = (f3-min(f3))/(max(f3)-min(f3))
n_f3 = (dm3-min(dm3))/(max(dm3)-min(dm3))
n_dm3 = (df3-min(df3))/(max(df3)-min(df3))
n_df3
#Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(n_m3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(f3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity prevalence, 2009")
hist(n_f3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity prevalence, 2009")
par(mfrow=c(2,2))
hist(dm3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(n_dm3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(df3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity difference, 2009")
hist(n_df3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity difference, 2009")
#Add normalized variables together
<- n_m3 + n_dm3 + n_f3 + n_df3
active
#Normalize activity to 0-1 range
= (active-min(active))/(max(active)-min(active))
n_active
#head(n_active)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(active, breaks=10, xlab="Score", col="lightblue", main="Physical activity metric")
hist(n_active, breaks=10, xlab="Normalized Score", col="lightblue", main="Physical activity metric")
summary(n_active) #slight right skew
###DEPENDENT VARIABLE CREATION: sum, normalize, and plot
#Add normalized variables together
<- n_life - n_fat + n_active
lifestyle
#Normalize health to 0-1 range
= (lifestyle-min(lifestyle))/(max(lifestyle)-min(lifestyle))
normalized_lifestyle
#Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(lifestyle, breaks=10, xlab="Score", col="lightblue", main="Health metric")
hist(normalized_lifestyle, breaks=10, xlab="Normalized Score", col="lightblue", main="Health metric")
summary(normalized_lifestyle)
#head(normalized_lifestyle)
#create new df with state | county | health score
<- life_table %>%
starter_df ::select(1:2)
dplyr
$health_score <- normalized_lifestyle
starter_df<- filter(starter_df, `health_score` > 0.895) #top 10
healthiest_counties <- healthiest_counties[order(-healthiest_counties$`health_score`),] #descending order
healthiest_counties
#head(starter_df)
#nrow(healthiest_counties) #10
%>%
healthiest_counties kbl() %>%
kable_minimal() %>%
kable_styling(latex_options = "hold_position")
## --- Independent Variable Preprocessing --- ##
#Read in alcohol data, convert to tibble, and drop impertinent observation:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/AlcoholConsumption.csv")
alcohol <- as_tibble(alcohol)
alcohol <- alcohol[-1,] #drop State = National
alcohol #dim(alcohol) #3178 x 6
<- alcohol[alcohol$Location != alcohol$State,] #drop state observations from Location column
alcohol #dim(alcohol) #3178 - 3127 = 51 observations dropped
#rename columns
<- alcohol %>% rename(
alcohol County = Location,
Hvy = Hvy_2012,
Bng = Binge_2012,
HvyPctChg = `HvyPctChg_2005-2012`,
BngPctChg = `BingePctChg_2005-2012`)
#drop excess verbage from County column
<- c("and", "Area", "Borough", "Census", "City", "County", "Division", "Municipality", "Parish")
stopwords $County <- removeWords(alcohol$County, stopwords)
alcohol#head(alcohol)
#Read in heart data, convert to tibble, and drop impertinent observation:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/CardiovascularDisease.csv")
heart <- as_tibble(heart)
heart <- subset(heart, select = -c(`Mortality Rate, 2010*`)) #drop 2010
heart <- heart[-1,] #drop State = National
heart #dim(heart) #3193 x 4
# remove State == Location
<- heart[heart$Location != heart$State,]
heart dim(heart) #3193 - 3143 = 50 observations removed
# retitle columns
<- heart %>% rename(
heart County = Location,
Mortality_2005 = `Mortality Rate, 2005*`,
Mortality_2014 = `Mortality Rate, 2014*`)
# retain value EXCLUSIVELY for Mortality Rate columns
$Mortality_2005 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2005))
heart$Mortality_2014 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2014))
heart
#convert columns to proper type
$Mortality_2005 <- as.double(heart$Mortality_2005)
heart$Mortality_2014 <- as.double(heart$Mortality_2014)
heart
#drop excess verbage from County column
$County <- removeWords(heart$County, stopwords)
heart$County <- gsub("(.*),.*", "\\1", heart$County) #remove everything after comma
heart
# add Chg column
$MortalityChg <- heart$Mortality_2014 - heart$Mortality_2005
heart
#finalize format of df
<- subset(heart, select = -c(`Mortality_2005`)) #drop 2005
heart <- heart %>% rename( Mortality = Mortality_2014)
heart
#head(heart)
#Read in education data, convert to tibble, and drop impertinent observation:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Education.csv")
education <- as_tibble(education)
education <- education[-1,] #drop State = National
education
$State <- state.name[match(education$State, state.abb)] #convert state abbreviation to name
education
<- education[education$`Area name` != education$State,] #drop state observations from Area name column
education #dim(education) #3281 - 3233 = 48 observations dropped
#rename columns
<- education %>% rename(
education County = `Area name`,
LTHighSchool = `Percent of adults with less than a high school diploma, 2015-19`,
HighSchool = `Percent of adults with a high school diploma only, 2015-19`,
SomeCollege = `Percent of adults completing some college or associate's degree, 2015-19`,
College = `Percent of adults with a bachelor's degree or higher, 2015-19`)
#drop excess verbage from County column
$County <- removeWords(education$County, stopwords)
education
#head(education) #verify
#Read in eqi data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/EnvironmentalQualityIndex.csv")
eqi <- as_tibble(eqi)
eqi <- subset(eqi, select = -c(3:7)) #drop indices that makeup EQI score
eqi
$State <- state.name[match(eqi$State, state.abb)] #convert state abbreviation to name
eqi
#rename columns
<- eqi %>% rename(
eqi County = County_Name,
EQI = environmental_quality_index)
#drop excess verbage from County column
$County <- removeWords(eqi$County, stopwords)
eqi
#head(eqi) #verify
#dim(eqi) #3281 x 6
#Read in food insecurity data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/FoodInsecurity.csv")
food <- as_tibble(food)
food #head(food)
#drop FIPS
<- subset(food, select=-c(FIPS))
food #dim(food) #3142 x 3: no need to drop observations
#convert State to full name
$State <- state.name[match(food$State, state.abb)] #convert state abbreviation to name
food
#rename columns
<- food %>% rename(
food County = `County, State`,
FoodInsecurity = `2018 Food Insecurity Rate`)
#remove excess verbage from County
$County <- removeWords(food$County, stopwords)
food$County <- gsub("(.*),.*", "\\1", food$County) #remove everything after comma
food
#drop % from Food Insecurity and convert to double
$FoodInsecurity = as.double(gsub("[\\%,]", "", food$FoodInsecurity))
food
#head(food) #verify
#Read in sun data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Sunlight.csv")
sun <- as_tibble(sun)
sun #dim(sun) #3161 x 3
#rename column
<- sun %>% rename(Sun = `Avg Daily Sunlight`)
sun
#drop excess verbage from County column
$County <- removeWords(sun$County, stopwords)
sun$County <- gsub("(.*),.*", "\\1", sun$County) #remove everything after comma
sun
#head(sun) #verify
#Read in une ployment data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Unemployment.csv")
unemp <- as_tibble(unemp)
unemp <- unemp[-1,] #drop State = National
unemp <- subset(unemp, select=-c(3)) #drop 2016
unemp
$State <- state.name[match(unemp$State, state.abb)] #convert state abbreviation to name
unemp<- unemp[unemp$area_name != unemp$State,] #drop state observations from Area name column
unemp
<- unemp %>% rename(
unemp County = area_name,
Unemployment = Unemployment_rate_2019,
UnemploymentChg = `Unemployment_chg_2016-2019`) #rename columns
$County <- removeWords(unemp$County, stopwords) #drop excess verbage from County column
unemp$County <- gsub("(.*),.*", "\\1", unemp$County) #remove everything after comma
unemp
#dim(unemp) #3274 - 3224 = 50 dropped
#head(unemp) #verify
#Read in wealth data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Wealth.csv")
wealth <- as_tibble(wealth)
wealth <- wealth[-1,] #drop State = National
wealth #dim(wealth) #3193 x 4
$State <- state.name[match(wealth$State, state.abb)] #convert state abbreviation to name
wealth<- wealth[wealth$County != wealth$State,] #drop state observations from Area name column
wealth #dim(wealth) #3193 - 3143 = 50 observations dropped
#convert columns to proper type
$PovertyRate <- as.double(wealth$PovertyRate)
wealth$MedianHouseholdIncome <- as.numeric(gsub(",","",wealth$MedianHouseholdIncome))
wealth
#rename columns
<- wealth %>% rename(
wealth Poverty = PovertyRate,
Income = MedianHouseholdIncome) #rename columns
$County <- removeWords(wealth$County, stopwords) #drop excess verbage from County column
wealth
#head(wealth)
#Read in population data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/population.csv")
pop <- as_tibble(pop)
pop
# rename columns
<- pop %>% rename(
pop State = STNAME,
County = CTYNAME,
Pop_2010 = CENSUS2010POP,
Population = POPESTIMATE2019,
Births = BIRTHS2019,
Deaths = DEATHS2019,
NetMig = NETMIG2019) #rename columns
#add population change variable
$PopChg <- pop$Population - pop$Pop_2010
pop
<- subset(pop, select=-c(3)) #drop 2010
pop
#dim(pop) #3193 x 7
<- pop[pop$County != pop$State,] #drop state observations from Area name column
pop #dim(pop) #3193 - 3141 = 52 observations dropped
$County[1802] <- "Dona Ana County" #invalid UTF-8
pop$County <- removeWords(pop$County, stopwords) #drop excess verbage from County column
pop
#head(pop)
## --- Merge df's --- ##
##1. merge health score and alcohol df's
#Trim white space
$County <- trimws(alcohol$County)
alcohol$County <- trimws(starter_df$County)
starter_df#SQL join
<- sqldf("SELECT *
df FROM starter_df
LEFT JOIN alcohol ON starter_df.State = alcohol.State AND starter_df.County = alcohol.County")
#Remove extra State, County columns
<- subset(df, select=-c(4,5))
df
##2. merge heart to df
#Trim white space
$County <- trimws(heart$County)
heart#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN heart ON df.State = heart.State AND df.County = heart.County")
#remove extra State, County columns
<- subset(df, select=-c(8,9))
df
##3. merge education to df
#Trim white space
$County <- trimws(education$County)
education#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN education ON df.State = education.State AND df.County = education.County")
#remove extra State, County columns
<- subset(df, select=-c(10,11))
df
##4. merge eqi to df
#Trim white space
$County <- trimws(eqi$County)
eqi#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN eqi ON df.State = eqi.State AND df.County = eqi.County")
#remove extra State, County columns
<- subset(df, select=-c(14,15))
df
##5. merge food to df
#Trim white space
$County <- trimws(food$County)
food#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN food ON df.State = food.State AND df.County = food.County")
#remove extra State, County columns
<- subset(df, select=-c(15,16))
df
##6. merge sun to df
#Trim white space
$County <- trimws(sun$County)
sun#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN sun ON df.State = sun.State AND df.County = sun.County")
#remove extra State, County columns
<- subset(df, select=-c(16,17))
df
##7. merge unemp to df
#Trim white space
$County <- trimws(unemp$County)
unemp#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN unemp ON df.State = unemp.State AND df.County = unemp.County")
#remove extra State, County columns
<- subset(df, select=-c(17,18))
df
##8. merge wealth to df
#Trim white space
$County <- trimws(wealth$County)
wealth#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN wealth ON df.State = wealth.State AND df.County = wealth.County")
#remove extra State, County columns
<- subset(df, select=-c(19,20))
df
##9. merge pop to df
#Trim white space
$County <- trimws(pop$County)
pop#SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN pop ON df.State = pop.State AND df.County = pop.County")
#remove extra State, County columns
<- subset(df, select=-c(21,22))
df
##verify variables and dimensions
head(df)
dim(df) #3154 x 25
## --- EDA --- ##
#baseline EDA: glimpse() and summary()
glimpse(df)
#summary(df)
#drop NAs from consideration
<- drop_na(df)
df #dim(df) #3154 - 3004 = 150 dropped observations
#Select numeric variables
<- as.data.frame(df[3:25])
df_num #dim(df_num)
#head(df_num)
#Utilize custom-built correlation matrix generation function
plot_corr_matrix(df_num, 0.2)
#Compute proportion of missing data per variable
<- colnames(df_num)
v <- function(x) sum(!complete.cases(x)) / 3004
incomplete <- sapply(df_num[v], incomplete)
Missing_Data #head(Missing_Data) #verify
#Compute correlation between each variable and TARGET
<- function(x, y) cor(y, x, use = "na.or.complete")
target_corr <- sapply(df_num[v], target_corr, y=df_num$health_score)
HealthScore_Corr #head(HealthScore_Corr) #verify
#Bind and output Missing Data and Correlation with Target
<- data.frame(cbind(Missing_Data, HealthScore_Corr))
MDHSC %>%
MDHSC kbl(caption = "Proportion of Missing Data vs. Correlation with Health Score") %>%
kable_minimal() %>%
kable_styling(latex_options = "hold_position")
#Utilize Boruta for feature ranking and selection
library(Boruta)
# Perform Boruta search
<- Boruta(health_score ~ ., data=na.omit(df_num), doTrace=0, maxRuns = 1000)
boruta_output
#Get significant variables including tentatives
<- getSelectedAttributes(boruta_output, withTentative = TRUE)
boruta_signif #print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
#Feature selection (address weak target correlation and multicollinearity)
<- df_num[, c(1,4,6,8,11:14,16,18,23)]
select_df #head(select_df) #verify
#Histograms for all variables
%>%
select_df gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol=4) +
geom_histogram(bins=20,color="darkblue", fill="lightblue")
#Ensure reproducibility
set.seed(123)
#Train-test split data
= sort(sample(nrow(select_df), nrow(select_df)*.75))
dt <- select_df[dt,]
train <- select_df[-dt,]
test #dim(train) #2253 x 11
#dim(test) # 751 x 11
###LINEAR REGRESSION (baseline model): prior to outlier handling, normalization, feature engineering
<- lm(health_score ~., data = train)
model_1 summary(model_1) #R^2 = 0.6603, vars = 10
#Imputation / removal (address NAs): 0 missing values
#Normalization: 0 impact on R^2
<- function(x){(x- min(x, na.rm = TRUE)) /(max(x, na.rm = TRUE)-min(x, na.rm = TRUE))}
norm_minmax <- as.data.frame(lapply(train, norm_minmax))
norm_df #head(norm_df) #verify
#Outlier handling: reduced R^2
#cooksD <- cooks.distance(model_1)
#influential <- as.numeric(names(cooksD)[(cooksD > (10 * mean(cooksD, na.rm = TRUE)))])
#train[influential,] #verify outliers - 20 rows
#out_df <- train[-influential,]
#Feature engineering
#combine datasets so we don't have to make features twice
$dataset <- 'train'
train$dataset <- 'test'
test<- rbind(train, test)
final_df
#Creating new features (started with 1st, 3rd quartile values then adjusted)
##strong EQI and lots of sun
$env_and_sun <- as.factor(ifelse(final_df$EQI > 0.90017 & final_df$Sun > 16350, 1, 0))
final_df##high proportion college graduates and high income (3rd quartile)
$college_and_income <- as.factor(ifelse(final_df$College > 20.0 & final_df$Income > 65000, 1, 0))
final_df##low binge drinking and high income (older, more responsible?)
$lowDrink_and_HiIncome <- as.factor(ifelse(final_df$Bng < 32.5 & final_df$Income > 61629, 1, 0))
final_df##high binge drinking and high income (expendable income, socially active)
$HiDrink_and_HiIncome <- as.factor(ifelse(final_df$Bng > 35.0 & final_df$Income > 61629, 1, 0))
final_df##high food insecurity and high mortality and less than HS
$LowFood_HiDeath_LowEd <- as.factor(ifelse(final_df$FoodInsecurity > 15.7 & final_df$Mortality > 500.00 & final_df$LTHighSchool > 20.0, 1, 0))
final_df##low income and less than HS and growth in unemployment
$LoIncome_LowEd_HiUnemp <- as.factor(ifelse(final_df$Income < 30000 & final_df$UnemploymentChg < -1.7 & final_df$LTHighSchool > 16.8, 1, 0))
final_df##big PopChg and low unemployment
$HiPop_LoUnemp <- as.factor(ifelse(final_df$PopChg > 1595.5 & final_df$UnemploymentChg > -1, 1, 0))
final_df
#Transformed (model 2): after handling outliers, normalization, feature engineering
<- final_df %>% filter(dataset == 'train') %>% dplyr::select(-dataset)
train2 <- final_df %>% filter(dataset == 'test') %>% dplyr::select(-dataset)
test2 <- final_df %>% dplyr::select(-dataset)
dat <- lm(health_score ~., data = train2)
model_2 #summary(model_2) #R^2 = 0.6735, vars = 17
###LINEAR REGRESSION (transformed model): features created, AIC optimized
#stepAIC(model_2)
<- lm(formula = health_score ~ Mortality + College + FoodInsecurity +
model_3 + Sun + EQI + Income + PopChg + Bng + UnemploymentChg +
LTHighSchool + lowDrink_and_HiIncome + HiDrink_and_HiIncome +
env_and_sun + LoIncome_LowEd_HiUnemp + HiPop_LoUnemp,
LowFood_HiDeath_LowEd data = train2)
#summary(model_3) #R^2 = 0.6734, vars = 16
# Predict and evaluate raw_lm model on training data
= predict(model_3, newdata = train2)
predictions eval_metrics(model_3, train2, predictions, target = 'health_score') #0.671, 0.0764
# Predict and evaluate raw_lm model on testing data
= predict(model_3, newdata = test2)
predictions eval_metrics(model_3, test2, predictions, target = 'health_score') #0.671, 0.0815
## --- Model Building --- ##
###LINEAR REGRESSION (raw data)
#Train-test split data
= sort(sample(nrow(df_num), nrow(df_num)*.75))
dt2 <- df_num[dt2,]
train_raw <- df_num[-dt2,]
test_raw #dim(train_raw) #2253 x 23
#dim(test_raw) # 751 x 23
#Train raw_lm model:
<- lm(health_score ~., data = train_raw)
raw_lm #summary(raw_lm) #R^2 = 0.6808, vars = 22
# Predict and evaluate raw_lm model on training data
= predict(raw_lm, newdata = train_raw)
predictions eval_metrics(raw_lm, train_raw, predictions, target = 'health_score') #0.6678, 0.0759
# Predict and evaluate raw_lm model on testing data
= predict(raw_lm, newdata = test_raw)
predictions eval_metrics(raw_lm, test_raw, predictions, target = 'health_score') #0.6678, 0.0827
###LINEAR REGRESSION (raw data, AIC optimized)
#stepAIC(raw_lm)
#Train aic_raw_lm model:
<- lm(formula = health_score ~ Hvy + HvyPctChg + Bng + Mortality +
aic_raw_lm + SomeCollege + College + EQI + FoodInsecurity +
HighSchool + Unemployment + UnemploymentChg + Poverty + Population +
Sun + Deaths + NetMig, data = train_raw)
Births
#summary(aic_raw_lm) #R^2 = 0.6802, vars = 17
# Predict and evaluate aic_raw_lm model on training data
= predict(aic_raw_lm, newdata = train_raw)
predictions eval_metrics(aic_raw_lm, train_raw, predictions, target = 'health_score') #0.6671, 0.0761
# Predict and evaluate aic_raw_lm model on testing data
= predict(aic_raw_lm, newdata = test_raw)
predictions eval_metrics(aic_raw_lm, test_raw, predictions, target = 'health_score') #0.6671, 0.0828
###RIDGE REGRESSION (raw data)
#Specify column names
= c('Hvy', 'HvyPctChg', 'BngPctChg', 'MortalityChg', 'HighSchool', 'SomeCollege', 'College', 'Unemployment', 'Poverty', 'Population', 'Births', 'Deaths', 'NetMig', 'Mortality', 'FoodInsecurity', 'LTHighSchool', 'Sun', 'EQI', 'Income', 'PopChg', 'Bng', 'UnemploymentChg', 'health_score')
cr_raw
#Generate dummy variables from data (if applicable)
<- dummyVars(health_score ~ ., data = df_num[,cr_raw])
dummies = predict(dummies, newdata = train_raw[,cr_raw]) #2253 x 22
train_dummies = predict(dummies, newdata = test_raw[,cr_raw]) #751 x 22
test_dummies print(dim(train_dummies)); print(dim(test_dummies))
#Create numeric model matrices
= as.matrix(train_dummies)
x = train_raw$health_score
y_train = as.matrix(test_dummies)
x_test = test_raw$health_score
y_test
<- 10^seq(2, -3, by = -.1) #specify lambda sequence
lambdas
#Train model
= glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
raw_ridge_reg summary(raw_ridge_reg)
#Compute optimal lambda
<- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
raw_cv_ridge <- raw_cv_ridge$lambda.min
ol #0.003162278
ol
#Compute R^2 from true and predicted values
<- function(true, predicted, df) {
eval_results <- sum((predicted - true)^2)
SSE <- sum((true - mean(true))^2)
SST <- round((1 - SSE / SST),4)
R_square = round(sqrt(SSE/nrow(df)),4)
RMSE
# Model performance metrics
data.frame(RMSE = RMSE, Rsquare = R_square)
}
#Predict and evaluate raw_ridge_reg model on training data
<- predict(raw_ridge_reg, s = ol, newx = x)
predictions_train eval_results(y_train, predictions_train, train_raw) #RMSE: 0.0761, RSquare: 0.6734
#Predict and evaluate raw_ridge_reg model on testing data
<- predict(raw_ridge_reg, s = ol, newx = x_test)
predictions_test eval_results(y_test, predictions_test, test_raw) #RMSE: 0.0789, RSquare: 0.659
#Ridge regression lambda plot
plot(raw_ridge_reg)
###RIDGE REGRESSION (transformed data)
#Specify column names
= c('Mortality', 'College', 'FoodInsecurity', 'LTHighSchool', 'Sun', 'EQI', 'Income', 'PopChg','Bng','UnemploymentChg','env_and_sun','lowDrink_and_HiIncome','HiDrink_and_HiIncome','LowFood_HiDeath_LowEd','LoIncome_LowEd_HiUnemp','HiPop_LoUnemp','health_score')
cols_reg
#Generate dummy variables from data (if applicable)
<- dummyVars(health_score ~ ., data = dat[,cols_reg])
dummies = predict(dummies, newdata = train2[,cols_reg]) #2253 x 22
train_dummies2 = predict(dummies, newdata = test2[,cols_reg]) #751 x 22
test_dummies2 print(dim(train_dummies2)); print(dim(test_dummies2))
#Create numeric model matrices
= as.matrix(train_dummies2)
x2 = train2$health_score
y_train2 = as.matrix(test_dummies2)
x_test2 = test2$health_score
y_test2
#Train model
= glmnet(x2, y_train2, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
ridge_reg summary(ridge_reg)
#Compute optimal lambda
<- cv.glmnet(x2, y_train2, alpha = 0, lambda = lambdas)
cv_ridge2 <- cv_ridge2$lambda.min
ol2 #0.003162278
ol2
#Predict and evaluate ridge_reg model on training data
<- predict(ridge_reg, s = ol2, newx = x2)
predictions_train2 eval_results(y_train2, predictions_train2, train2) #RMSE: 0.0764, RSquare: 0.6732
#Predict and evaluate ridge_reg model on training data
<- predict(ridge_reg, s = ol2, newx = x_test2)
predictions_test2 eval_results(y_test2, predictions_test2, test2) #RMSE: 0.0814, RSquare: 0.632
###LASSO REGRESSION (raw data)
# Setting alpha = 1 implements lasso regression
<- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)
lasso_reg
#Compute optimal lambda
<- lasso_reg$lambda.min
lambda_best #0.001
lambda_best
<- glmnet(x, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE)
raw_lasso
<- predict(raw_lasso, s = lambda_best, newx = x)
predictions_train eval_results(y_train, predictions_train, train_raw) #RMSE: 0.0767, Rsquare: 0.6681
<- predict(raw_lasso, s = lambda_best, newx = x_test)
predictions_test eval_results(y_test, predictions_test, test_raw) #RMSE: 0.0783, Rsquare: 0.6642
#Lasso regression lambda plots
<- par(mfrow=c(1, 2))
op plot(lasso_reg$glmnet.fit, "norm", label=TRUE)
plot(lasso_reg$glmnet.fit, "lambda", label=TRUE)
par(op)
###LASSO REGRESSION (transformed data)
# Setting alpha = 1 implements lasso regression
<- cv.glmnet(x2, y_train2, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)
lasso_reg2
#Compute optimal lambda
<- lasso_reg2$lambda.min
lambda_best2 #0.001
lambda_best2
<- glmnet(x2, y_train2, alpha = 1, lambda = lambda_best2, standardize = TRUE)
lasso_model
<- predict(lasso_model, s = lambda_best2, newx = x2)
predictions_train2 eval_results(y_train2, predictions_train2, train2) #RMSE: 0.0765, Rsquare: 0.6724
<- predict(lasso_model, s = lambda_best2, newx = x_test2)
predictions_test2 eval_results(y_test2, predictions_test2, test2) #RMSE: 0.0811, Rsquare: 0.6354
## --- Model Selection --- ##
#Create Kable table to succinctly summarize model optimization results
<- c('1', '2', '3', '4', '5', '6', '7')
Model <- c('Linear', 'Linear', 'Linear', 'Ridge', 'Ridge', 'Lasso', 'Lasso')
Method <- c('raw', 'raw(AIC)', 'transformed', 'raw', 'transformed', 'raw', 'transformed')
Data <- c(22, 17, 16, 22, 16, 22, 16)
Var_Num <- c(0.6678, 0.673, 0.671, 0.6734, 0.6732, 0.6681, 0.6724)
R2_train <- c(0.0759, 0.0759, 0.0764, 0.0761, 0.0764, 0.0767, 0.0765)
RMSE_train <- c(0.6678, 0.673, 0.71, 0.659, 0.632, 0.6642, 0.6354)
R2_test <- c(0.0827, 0.0842, 0.0815, 0.0789, 0.0814, 0.0783, 0.0811)
RMSE_test
<- cbind(Model, Method, Data, Var_Num, R2_train, RMSE_train, R2_test, RMSE_test)
output
%>%
output kbl(caption = "Regression Model Comparison") %>%
kable_minimal() %>%
kable_styling(latex_options = "hold_position")