Background

The purpose of this project will be to investigate the impact of a variety of health and lifestyle metrics on county-level health scores.

We set out with the goal of answering the following 3 questions:

What United States counties are most favorable for an active, healthy lifestyle?
What are the differentiating characteristics that make them so?
What might the best regression model be for modeling the relationship between our healthy lifestyle metric and these differentiating characteristics?

The purpose of this document, more specifically, is to document the process of creating our dependent ‘healthy lifestyle’ metric as well as the pre-processing of our independent variables. It’s to answer Q1 and document the compilation of the dataset to be used for this Final Project.

……………………………………………………………………..

Dependent Variable Creation

In order to create our dependent ‘health score’ variable, we’ve first got to familiarize ourselves with the data at hand.

Life expectancy, obesity, and physical activity data were downloaded from IHME, converted to a csv-compatible form, where then a subset of columns were selected for our consideration for the creation of a dependent ‘health score’ variable.

For life expectancy data we read in the following columns: Male life expectancy, 2010 (years), Female life expectancy, 2010 (years), Difference in male life expectancy, 1985-2010 (years), and Difference in female life expectancy, 1985-2010 (years).
For obesity data we read in the following columns: Male obesity prevalence, 2009 (%), Female obesity prevalence, 2009 (%), Difference in male obesity prevalence, 2001-2009 (percentage points), and Difference in female obesity prevalence, 2001-2009 (percentage points).
For physical activity data we read in the following columns: Male sufficient physical activity prevalence, 2009 (%), Female sufficient physical activity prevalence, 2009 (%), Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points), and Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points).

Life Expectancy Data

We start by exploring life expectancy data at a county level:

glimpse(life_table)

## Rows: 3,142
## Columns: 6
## $ State                                                     <chr> "Alabama", "~
## $ County                                                    <chr> "Autauga", "~
## $ `Male life expectancy, 2010 (years)`                      <dbl> 73.3, 75.0, ~
## $ `Female life expectancy, 2010 (years)`                    <dbl> 78.8, 80.3, ~
## $ `Difference in male life expectancy, 1985-2010 (years)`   <dbl> 5.2, 3.8, 5.~
## $ `Difference in female life expectancy, 1985-2010 (years)` <dbl> 1.8, 1.5, 1.~

summary(life_table)

##     State              County          Male life expectancy, 2010 (years)
##  Length:3142        Length:3142        Min.   :63.90                     
##  Class :character   Class :character   1st Qu.:73.00                     
##  Mode  :character   Mode  :character   Median :75.00                     
##                                        Mean   :74.71                     
##                                        3rd Qu.:76.50                     
##                                        Max.   :81.70                     
##  Female life expectancy, 2010 (years)
##  Min.   :72.7                        
##  1st Qu.:78.4                        
##  Median :79.8                        
##  Mean   :79.7                        
##  3rd Qu.:81.1                        
##  Max.   :85.0                        
##  Difference in male life expectancy, 1985-2010 (years)
##  Min.   :-1.500                                       
##  1st Qu.: 2.900                                       
##  Median : 3.900                                       
##  Mean   : 3.919                                       
##  3rd Qu.: 4.900                                       
##  Max.   :13.000                                       
##  Difference in female life expectancy, 1985-2010 (years)
##  Min.   :-3.500                                         
##  1st Qu.: 0.600                                         
##  Median : 1.500                                         
##  Mean   : 1.528                                         
##  3rd Qu.: 2.400                                         
##  Max.   : 8.400

We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based upon the output above, that:

On average, males live to be ~75 years old while females live to be ~80 years old. Thus, females live ~5yrs more than males on average.
On average, male life expectancy increased by ~4 years while female life expectancy increased by ~1.5 years. Thus, male life expectancy increased at a greater rate than female life expectancy from 1985-2010.

Normalization

From here, we move on to normalizing our life expectancy data and bringing it to a 0 to 1 scale. To do so we apply the following formula:

\[ Transformed.Values = \frac{(Values - Min)}{(Max - Min)} \] We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:

#Extract variables of interest
m1 <- life_table$`Male life expectancy, 2010 (years)`
f1 <- life_table$`Female life expectancy, 2010 (years)`
dm1 <- life_table$`Difference in male life expectancy, 1985-2010 (years)`
df1 <- life_table$`Difference in female life expectancy, 1985-2010 (years)`

#Normalize data scale to be from 0 to 1
n_m1 = (m1-min(m1))/(max(m1)-min(m1))
n_f1 = (f1-min(f1))/(max(f1)-min(f1))
n_dm1 = (dm1-min(dm1))/(max(dm1)-min(dm1))
n_df1 = (df1-min(df1))/(max(df1)-min(df1))

#Histogram of original vs. normalized data
##Life expectancy histograms
par(mfrow=c(2,2))
hist(m1, breaks=10, xlab="Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(n_m1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(f1, breaks=10, xlab="Age (years)", col="lightblue", main="Female life expectancy, 2010")
hist(n_f1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female life expectancy, 2010")

##Longevity improvement histograms
par(mfrow=c(2,2))
hist(dm1, breaks=10, xlab="Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(n_dm1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(df1, breaks=10, xlab="Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")
hist(n_df1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")

From the above plots we observe:

male life expectancy follows a normal, left skewed distribution with a peak at 75. Once we normalize our scales, this distribution centers on ~0.6-0.7.
female life expectancy follows a normal, left skewed distribution with a peak at 80. Once we normalize our scales, this distribution centers on ~0.6-0.7.
male longevity improvement follows a relatively normal, right skewed distribution with a peak at ~3-4. Once we normalize our scales, this distribution centers on 0.3.
female longevity improvement follows a relatively normal, right skewed distribution with a peak at ~1-2. Once we normalize our scales, this distribution centers on ~0.4-0.5.

Congregation

From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:

#Add normalized variables together
life <- n_m1 + n_dm1 + n_f1 + n_df1

#Normalize activity to 0-1 range
n_life = (life-min(life))/(max(life)-min(life))
#head(n_life)

# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(life, breaks=10, xlab="Score", col="lightblue", main="Longevity metric")
hist(n_life, breaks=10, xlab="Normalized Score", col="lightblue", main="Longevity metric")

summary(n_life) #slight left skew

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4106  0.5039  0.4970  0.5804  1.0000

Our longevity metric appears to have been properly normalized. We observe a slight left skew and a peak centered at ~0.5-0.6.

When we consult the summary statistics, we verify a mean of 0.4970 and a median of 0.5039.

Obesity Data

We continue our EDA by exploring obesity data at a county level:

glimpse(obesity_table)

## Rows: 3,142
## Columns: 6
## $ State                                                                    <chr> ~
## $ County                                                                   <chr> ~
## $ `Male obesity prevalence, 2009 (%)`                                      <dbl> ~
## $ `Female obesity  prevalence, 2009 (%)`                                   <dbl> ~
## $ `Difference in male obesity prevalence, 2001-2009 (percentage points)`   <dbl> ~
## $ `Difference in female obesity prevalence, 2001-2009 (percentage points)` <dbl> ~

summary(obesity_table)

##     State              County          Male obesity prevalence, 2009 (%)
##  Length:3142        Length:3142        Min.   :17.8                     
##  Class :character   Class :character   1st Qu.:34.3                     
##  Mode  :character   Mode  :character   Median :36.2                     
##                                        Mean   :35.9                     
##                                        3rd Qu.:38.1                     
##                                        Max.   :45.6                     
##  Female obesity  prevalence, 2009 (%)
##  Min.   :17.20                       
##  1st Qu.:34.80                       
##  Median :37.60                       
##  Mean   :38.08                       
##  3rd Qu.:41.00                       
##  Max.   :59.30                       
##  Difference in male obesity prevalence, 2001-2009 (percentage points)
##  Min.   :-2.900                                                      
##  1st Qu.: 5.900                                                      
##  Median : 7.200                                                      
##  Mean   : 7.171                                                      
##  3rd Qu.: 8.475                                                      
##  Max.   :15.800                                                      
##  Difference in female obesity prevalence, 2001-2009 (percentage points)
##  Min.   :-1.400                                                        
##  1st Qu.: 5.300                                                        
##  Median : 6.700                                                        
##  Mean   : 6.754                                                        
##  3rd Qu.: 8.100                                                        
##  Max.   :16.400

We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based on the output above, that:

On average, 38% of females were obsese whereas 36% of males were. Thus, females have a slightly higher incidence of obesity than males.
On average, the male obesity rate increased by ~7.2% while the rate of female obesity increased by ~6.7%. Thus, males got fatter at a greater rate than females from 2001 to 2009.

Normalization

From here, we move on to normalizing our obesity data and bringing it to a 0 to 1 scale (by applying the formula noted earlier).

We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:

#Extract variables of interest
m2 <- obesity_table$`Male obesity prevalence, 2009 (%)` 
f2 <- obesity_table$`Female obesity  prevalence, 2009 (%)` 
dm2 <- obesity_table$`Difference in male obesity prevalence, 2001-2009 (percentage points)` 
df2 <- obesity_table$`Difference in female obesity prevalence, 2001-2009 (percentage points)`

#Normalize
n_m2 = (m2-min(m2))/(max(m2)-min(m2))
n_f2 = (f2-min(f2))/(max(f2)-min(f2))
n_dm2 = (dm2-min(dm2))/(max(dm2)-min(dm2))
n_df2 = (df2-min(df2))/(max(df2)-min(df2))

#Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(n_m2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(f2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")
hist(n_f2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")

par(mfrow=c(2,2))
hist(dm2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(n_dm2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(df2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")
hist(n_df2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")

From the above plots we observe:

male obesity follows a normal, right skewed distribution with a peak at ~37-38. Once we normalize our scales, this distribution centers on ~0.6-0.7.
female obesity follows a normal distribution with a peak at ~35-40. It’s worth noting the difference in scales from male-to-female since that’s the reason we normalize. Once we normalize our scales, this distribution centers on ~0.4-0.5.
male obesity increase follows a relatively normal, slightly right skewed distribution with a peak at ~7. Once we normalize our scales, this distribution centers on ~0.5-0.6.
female obesity increase follows a normal distribution with a peak atright skewed distribution with a peak at ~7. Once we normalize our scales, this distribution centers on ~0.4-0.5.

Congregation

fat <- n_m2 + n_dm2 + n_f2 + n_df2
#Normalize activity to 0-1 range
n_fat = (fat-min(fat))/(max(fat)-min(fat))

#head(n_fat)

# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(fat, breaks=10, xlab="Score", col="lightblue", main="Obesity metric")
hist(n_fat, breaks=10, xlab="Normalized Score", col="lightblue", main="Obesity metric")

summary(n_fat) #right skewed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5089  0.5819  0.5783  0.6565  1.0000

Our obesity metric appears to have been properly normalized. We observe a slight left skew and a peak centered at ~0.5-0.6.

When we consult the summary statistics, we verify a mean of 0.5783 and a median of 0.5819.

Physical Activity Data

We continue our EDA by exploring physical activity data at a county level:

glimpse(act_table)

## Rows: 3,142
## Columns: 6
## $ State                                                                                         <chr> ~
## $ County                                                                                        <chr> ~
## $ `Male sufficient physical activity  prevalence, 2009 (%)`                                     <dbl> ~
## $ `Female sufficient physical activity  prevalence, 2009 (%)`                                   <dbl> ~
## $ `Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)`   <dbl> ~
## $ `Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)` <dbl> ~

summary(act_table)

##     State              County         
##  Length:3142        Length:3142       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##  Male sufficient physical activity  prevalence, 2009 (%)
##  Min.   :34.4                                           
##  1st Qu.:51.0                                           
##  Median :55.5                                           
##  Mean   :55.0                                           
##  3rd Qu.:58.9                                           
##  Max.   :75.7                                           
##  Female sufficient physical activity  prevalence, 2009 (%)
##  Min.   :27.40                                            
##  1st Qu.:43.80                                            
##  Median :49.20                                            
##  Mean   :48.68                                            
##  3rd Qu.:53.70                                            
##  Max.   :73.10                                            
##  Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)
##  Min.   :-11.400                                                                          
##  1st Qu.: -0.500                                                                          
##  Median :  1.900                                                                          
##  Mean   :  1.954                                                                          
##  3rd Qu.:  4.200                                                                          
##  Max.   : 16.700                                                                          
##  Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)
##  Min.   :-6.200                                                                             
##  1st Qu.: 2.300                                                                             
##  Median : 4.500                                                                             
##  Mean   : 4.661                                                                             
##  3rd Qu.: 6.800                                                                             
##  Max.   :18.300

We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based on the output above, that:

On average, 55% of males vs. 48.7% of females received sufficient physical activity in 2009. Thus, males recorted a higher level of physical activity.
On average, males had a 1.9% increase in physical activity from 2001 to 2009 whereas females reported a 4.7% increase over the same period. Thus, females increased their activity levels at a greater rate than males from 2001 to 2009.

Normalization

From here, we move on to normalizing our physical activity data and bringing it to a 0 to 1 scale (by applying the formula noted in the Life Expectancy Data section).

We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:

#Explore and normalize male physical activity data
m3 <- act_table$`Male sufficient physical activity  prevalence, 2009 (%)`
f3 <- act_table$`Female sufficient physical activity  prevalence, 2009 (%)`
dm3 <- act_table$`Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)`  
df3 <- act_table$`Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)` 

#Normalized Data
n_m3 = (m3-min(m3))/(max(m3)-min(m3))
n_f3 = (f3-min(f3))/(max(f3)-min(f3))
n_dm3 = (dm3-min(dm3))/(max(dm3)-min(dm3))
n_df3 = (df3-min(df3))/(max(df3)-min(df3))

# Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(n_m3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(f3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity prevalence, 2009")
hist(n_f3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity prevalence, 2009")

par(mfrow=c(2,2))
hist(dm3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(n_dm3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(df3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity difference, 2009")
hist(n_df3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity difference, 2009")

From the above plots we observe:

male physical activity follows a relatively normal distribution with a peak between 55 and 60. Once we normalize our scales, this distribution centers on ~0.5-0.6.
female physical activity also follows a relatively normal distribution with a peak between 55 and 60. Once we normalize our scales, this distribution centers on ~0.5-0.6.
male activity difference follows a relatively normal distribution with a peak between 0 and 5. Once we normalize our scales, this distribution centers on ~0.4-0.5.
female activity difference follows a relatively normal distribution with a peak at 5. Once we normalize our scales, this distribution centers on ~0.4-0.5.

Congregation

active <- n_m3 + n_dm3 + n_f3 + n_df3
#Normalize activity to 0-1 range
n_active = (active-min(active))/(max(active)-min(active))

#head(n_active)

# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(active, breaks=10, xlab="Score", col="lightblue", main="Physical activity metric")
hist(n_active, breaks=10, xlab="Normalized Score", col="lightblue", main="Physical activity metric")

summary(n_active) #slight right skew

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4152  0.5150  0.5086  0.6051  1.0000

Our physical activity metric appears to have been properly normalized. We observe a normal distribution whose peak is centered between 0.5 and 0.6.

When we consult the summary statistics, we verify a mean of 0.5086 and a median of 0.5150.

Dependent Variable Creation

From here, we move on to creating our dependent ‘healthy lifestyle’ variable as a combination of the longevity, obesity, and physical activity metrics we’ve explored, normalized, and congregated upto this point. To do so we apply the following formula:

\[ Lifestyle = Normalized.Life - Normalized.Obesity + Normalized.Activity \]

We sum our normalized ‘umbrella’ variables with longevity and physical activity as positive indicators and obesity as a negative indicator. We then normalize the summation to ensure we’re on a 0-1 scale and output the histograms of our summation and our normalized summation for comparison:

lifestyle <- n_life - n_fat + n_active

#Normalize health to 0-1 range
normalized_lifestyle = (lifestyle-min(lifestyle))/(max(lifestyle)-min(lifestyle))

#Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(lifestyle, breaks=10, xlab="Score", col="lightblue", main="Health metric")
hist(normalized_lifestyle, breaks=10, xlab="Normalized Score", col="lightblue", main="Health metric")

summary(normalized_lifestyle)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3708  0.4642  0.4639  0.5546  1.0000

#head(normalized_lifestyle)

For our ‘healthy lifestyle’ metric, we observe a normal distribution whose peak is centered between 0.4 and 0.5.

When we consult the summary statistics, we verify a mean of 0.4639 and a median of 0.4642.

……………………………………………………………………..

Top 10 Healthiest Counties

As a next step, we utilize our health score metric to filter through county data for the top 10 healthiest counties:

#create new df with state | county | health score
starter_df <- life_table %>% 
    dplyr::select(1:2)

starter_df$health_score <- normalized_lifestyle
healthiest_counties <- filter(starter_df, `health_score` > 0.895) #top 10
healthiest_counties <- healthiest_counties[order(-healthiest_counties$`health_score`),] #descending order

#head(starter_df)
#nrow(healthiest_counties) #10
kable(healthiest_counties)

State	County	health_score
Colorado	Routt	1.0000000
California	Marin	0.9889641
Colorado	Pitkin	0.9833580
Wyoming	Teton	0.9778477
Colorado	Eagle	0.9745991
California	San Francisco	0.9139634
Colorado	Summit	0.9072650
Colorado	Douglas	0.9054401
Utah	Summit	0.9014540
Colorado	Gunnison	0.8978052

From the list above, we see (6) Colorado counties, (2) California counties, (1) Utah and (1) Wyoming county. From this, we extend a few assumptions regarding factors that might come into play for the healthiest counties:

sunshine,
median income,
sparser population clusters (aside from San Fransisco), and
friendliness to an active, healthy lifestyle.

It will be interesting to see whether these factors carry once we’ve built our regression model.

From this point, we move on to reading in, exploring, and preparing our independent variables. We have to ensure the format of our data sets align …

Independent Variable Pre-Processing

Read in independent variable data

Alcohol

#Read in alcohol data, convert to tibble, and drop impertinent observation:
alcohol <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/AlcoholConsumption.csv")
alcohol <- as_tibble(alcohol)
alcohol <- alcohol[-1,] #drop State = National
#dim(alcohol) #3178 x 6

alcohol <- alcohol[alcohol$Location != alcohol$State,] #drop state observations from Location column
#dim(alcohol) #3178 - 3127 = 51 observations dropped

#rename columns
alcohol <- alcohol %>% rename( 
    County = Location,
    Hvy = Hvy_2012,
    Bng = Binge_2012,
    HvyPctChg = `HvyPctChg_2005-2012`,
    BngPctChg = `BingePctChg_2005-2012`)

#drop excess verbage from County column
stopwords <- c("and", "Area", "Borough", "Census", "City", "County", "Division", "Municipality", "Parish")
alcohol$County <- removeWords(alcohol$County, stopwords)
#head(alcohol)

For the Alcohol dataset we end up with a 3127 observation x 6 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

Cardiovascular Disease

#Read in heart data, convert to tibble, and drop impertinent observation:
heart <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/CardiovascularDisease.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   Location = col_character(),
##   `Mortality Rate, 2005*` = col_character(),
##   `Mortality Rate, 2010*` = col_character(),
##   `Mortality Rate, 2014*` = col_character()
## )

heart <- as_tibble(heart)
heart <- subset(heart, select = -c(`Mortality Rate, 2010*`)) #drop 2010
heart <- heart[-1,] #drop State = National
#dim(heart) #3193 x 4

# remove State == Location
heart <- heart[heart$Location != heart$State,]
dim(heart) #3193 - 3143 = 50 observations removed

## [1] 3143    4

# retitle columns
heart <- heart %>% rename( 
    County = Location,
    Mortality_2005 = `Mortality Rate, 2005*`,
    Mortality_2014 = `Mortality Rate, 2014*`)

# retain value EXCLUSIVELY for Mortality Rate columns
heart$Mortality_2005 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2005))
heart$Mortality_2014 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2014))

#convert columns to proper type
heart$Mortality_2005 <- as.double(heart$Mortality_2005)
heart$Mortality_2014 <- as.double(heart$Mortality_2014)

#drop excess verbage from County column
heart$County <- removeWords(heart$County, stopwords)
heart$County <- gsub("(.*),.*", "\\1", heart$County) #remove everything after comma

# add Chg column
heart$MortalityChg <- heart$Mortality_2014 - heart$Mortality_2005

#finalize format of df
heart <- subset(heart, select = -c(`Mortality_2005`)) #drop 2005
heart <- heart %>% rename( Mortality = Mortality_2014)

#head(heart)

For the heart dataset we end up with a 3143 observation x 5 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column. It’s important to note that the mortality rate is listed per 100,000 residents.

Education

#Read in education data, convert to tibble, and drop impertinent observation:
education <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Education.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   `Area name` = col_character(),
##   `Percent of adults with less than a high school diploma, 2015-19` = col_double(),
##   `Percent of adults with a high school diploma only, 2015-19` = col_double(),
##   `Percent of adults completing some college or associate's degree, 2015-19` = col_double(),
##   `Percent of adults with a bachelor's degree or higher, 2015-19` = col_double()
## )

education <- as_tibble(education)
education <- education[-1,] #drop State = National

education$State <- state.name[match(education$State, state.abb)] #convert state abbreviation to name

education <- education[education$`Area name` != education$State,] #drop state observations from Area name column
#dim(education) #3281 - 3233 = 48 observations dropped

#rename columns
education <- education %>% rename(
    County = `Area name`,
    LTHighSchool = `Percent of adults with less than a high school diploma, 2015-19`,
    HighSchool = `Percent of adults with a high school diploma only, 2015-19`,
    SomeCollege = `Percent of adults completing some college or associate's degree, 2015-19`,
    College = `Percent of adults with a bachelor's degree or higher, 2015-19`)

#drop excess verbage from County column
education$County <- removeWords(education$County, stopwords)

#head(education) #verify

For the Education dataset we end up with a 3233 observation x 6 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

EQI

#Read in eqi data and convert to tibble:
eqi <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/EnvironmentalQualityIndex.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   County_Name = col_character(),
##   socio_index = col_double(),
##   built_index = col_double(),
##   air_index = col_double(),
##   land_index = col_double(),
##   water_index = col_double(),
##   environmental_quality_index = col_double()
## )

eqi <- as_tibble(eqi)
eqi <- subset(eqi, select = -c(3:7)) #drop indices that makeup EQI score

eqi$State <- state.name[match(eqi$State, state.abb)] #convert state abbreviation to name

#rename columns
eqi <- eqi %>% rename(
    County = County_Name,
    EQI = environmental_quality_index)

#drop excess verbage from County column
eqi$County <- removeWords(eqi$County, stopwords)

#head(eqi) #verify
#dim(eqi) #3281 x 6

For the EQI dataset we end up with a 3143 observation x 8 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

Food Insecurity

#Read in food insecurity data and convert to tibble:
food <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/FoodInsecurity.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   FIPS = col_double(),
##   State = col_character(),
##   `County, State` = col_character(),
##   `2018 Food Insecurity Rate` = col_character()
## )

food <- as_tibble(food)
#head(food)

#drop FIPS
food <- subset(food, select=-c(FIPS))
#dim(food) #3142 x 3: no need to drop observations

#convert State to full name
food$State <- state.name[match(food$State, state.abb)] #convert state abbreviation to name

#rename columns
food <- food %>% rename(
    County = `County, State`,
    FoodInsecurity = `2018 Food Insecurity Rate`)

#remove excess verbage from County
food$County <- removeWords(food$County, stopwords)
food$County <- gsub("(.*),.*", "\\1", food$County) #remove everything after comma

#drop % from Food Insecurity and convert to double
food$FoodInsecurity = as.double(gsub("[\\%,]", "", food$FoodInsecurity))

#head(food) #verify

Sunlight

#Read in sun data and convert to tibble:
sun <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Sunlight.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   County = col_character(),
##   `Avg Daily Sunlight` = col_double()
## )

sun <- as_tibble(sun)
#dim(sun) #3161 x 3

#rename column
sun <- sun %>% rename(Sun = `Avg Daily Sunlight`)

#drop excess verbage from County column
sun$County <- removeWords(sun$County, stopwords)
sun$County <- gsub("(.*),.*", "\\1", sun$County) #remove everything after comma

#head(sun) #verify

For the sun dataset we end up with a 3161 observation x 3 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

Unemployment

#Read in une ployment data and convert to tibble:
unemp <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Unemployment.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   area_name = col_character(),
##   Unemployment_rate_2016 = col_double(),
##   Unemployment_rate_2019 = col_double(),
##   `Unemployment_chg_2016-2019` = col_double()
## )

unemp <- as_tibble(unemp)
unemp <- unemp[-1,] #drop State = National
unemp <- subset(unemp, select=-c(3)) #drop 2016

unemp$State <- state.name[match(unemp$State, state.abb)] #convert state abbreviation to name
unemp <- unemp[unemp$area_name != unemp$State,] #drop state observations from Area name column

unemp <- unemp %>% rename(
    County = area_name,
    Unemployment = Unemployment_rate_2019,
    UnemploymentChg = `Unemployment_chg_2016-2019`) #rename columns

unemp$County <- removeWords(unemp$County, stopwords) #drop excess verbage from County column
unemp$County <- gsub("(.*),.*", "\\1", unemp$County) #remove everything after comma

#dim(unemp) #3274 - 3224 = 50 dropped
#head(unemp) #verify

For the unemployment dataset we end up with a 3161 observation x 3 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

Income & Poverty

#Read in wealth data and convert to tibble:
wealth <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Wealth.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   County = col_character(),
##   PovertyRate = col_character(),
##   MedianHouseholdIncome = col_character()
## )

wealth <- as_tibble(wealth)
wealth <- wealth[-1,] #drop State = National
#dim(wealth) #3193 x 4

wealth$State <- state.name[match(wealth$State, state.abb)] #convert state abbreviation to name
wealth <- wealth[wealth$County != wealth$State,] #drop state observations from Area name column
#dim(wealth) #3193 - 3143 = 50 observations dropped

#convert columns to proper type
wealth$PovertyRate <- as.double(wealth$PovertyRate)

## Warning: NAs introduced by coercion

wealth$MedianHouseholdIncome <- as.numeric(gsub(",","",wealth$MedianHouseholdIncome))

## Warning: NAs introduced by coercion

#rename columns
wealth <- wealth %>% rename(
    Poverty = PovertyRate,
    Income = MedianHouseholdIncome) #rename columns

wealth$County <- removeWords(wealth$County, stopwords) #drop excess verbage from County column

#head(wealth)

For the wealth dataset we end up with a 3143 observation x 4 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

Population

#Read in population data and convert to tibble:
pop <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/population.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   STNAME = col_character(),
##   CTYNAME = col_character(),
##   CENSUS2010POP = col_double(),
##   POPESTIMATE2019 = col_double(),
##   BIRTHS2019 = col_double(),
##   DEATHS2019 = col_double(),
##   NETMIG2019 = col_double()
## )

pop <- as_tibble(pop)

# rename columns
pop <- pop %>% rename(
    State = STNAME,
    County = CTYNAME,
    Pop_2010 = CENSUS2010POP,
    Population = POPESTIMATE2019,
    Births = BIRTHS2019,
    Deaths = DEATHS2019,
    NetMig = NETMIG2019) #rename columns

#add population change variable
pop$PopChg <- pop$Population - pop$Pop_2010

pop <- subset(pop, select=-c(3)) #drop 2010

#dim(pop) #3193 x 7
pop <- pop[pop$County != pop$State,] #drop state observations from Area name column
#dim(pop) #3193 - 3141 = 52 observations dropped

pop$County[1802] <- "Dona Ana County" #invalid UTF-8
pop$County <- removeWords(pop$County, stopwords) #drop excess verbage from County column

#head(pop)

For the population dataset we end up with a 3141 observation x 7 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.

Merge df’s

Place into consistent format with dependent variables (state-count-…)

#MERGE DF's

#1. merge health score and alcohol df's

##time white space
alcohol$County <- trimws(alcohol$County)
starter_df$County <- trimws(starter_df$County)

##SQL join
df <- sqldf("SELECT *
             FROM starter_df
             LEFT JOIN alcohol ON starter_df.State = alcohol.State AND starter_df.County = alcohol.County")

##remove extra State, County columns
df <- subset(df, select=-c(4,5))

#2. merge heart to df

##time white space
heart$County <- trimws(heart$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN heart ON df.State = heart.State AND df.County = heart.County")

##remove extra State, County columns
df <- subset(df, select=-c(8,9))

#3. merge education to df

##time white space
education$County <- trimws(education$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN education ON df.State = education.State AND df.County = education.County")

##remove extra State, County columns
df <- subset(df, select=-c(10,11))

#4. merge eqi to df

##time white space
eqi$County <- trimws(eqi$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN eqi ON df.State = eqi.State AND df.County = eqi.County")

##remove extra State, County columns
df <- subset(df, select=-c(14,15))

#5. merge food to df

##time white space
food$County <- trimws(food$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN food ON df.State = food.State AND df.County = food.County")

##remove extra State, County columns
df <- subset(df, select=-c(15,16))

#6. merge sun to df

##time white space
sun$County <- trimws(sun$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN sun ON df.State = sun.State AND df.County = sun.County")

##remove extra State, County columns
df <- subset(df, select=-c(16,17))

#7. merge unemp to df

##time white space
unemp$County <- trimws(unemp$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN unemp ON df.State = unemp.State AND df.County = unemp.County")

##remove extra State, County columns
df <- subset(df, select=-c(17,18))

#8. merge wealth to df

##time white space
wealth$County <- trimws(wealth$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN wealth ON df.State = wealth.State AND df.County = wealth.County")

##remove extra State, County columns
df <- subset(df, select=-c(19,20))

#9. merge pop to df

##time white space
pop$County <- trimws(pop$County)

##SQL join
df <- sqldf("SELECT *
             FROM df
             LEFT JOIN pop ON df.State = pop.State AND df.County = pop.County")

##remove extra State, County columns
df <- subset(df, select=-c(21,22))

#verify variables and dimensions
head(df)

##     State  County health_score  Hvy HvyPctChg  Bng BngPctChg Mortality
## 1 Alabama Autauga    0.4597290 14.7      -2.7 31.1      -1.1    316.36
## 2 Alabama Baldwin    0.5185262 16.6       1.2 30.4      -1.8    272.04
## 3 Alabama Barbour    0.2292573 16.9      17.5 32.9       3.4    255.09
## 4 Alabama    Bibb    0.2954267 15.4       1.4 31.9      -2.3    378.09
## 5 Alabama  Blount    0.2025867 14.3       3.1 29.4      -3.4    307.90
## 6 Alabama Bullock    0.2151089 18.8      21.4 38.1      15.9    322.56
##   MortalityChg LTHighSchool HighSchool SomeCollege College          EQI
## 1       -52.47         11.5       33.6        28.4    26.6  0.284681678
## 2       -36.95          9.2       27.7        31.3    31.9  0.754969418
## 3       -45.95         26.8       35.6        26.0    11.6 -0.005637936
## 4       -31.16         20.9       44.9        23.8    10.4  0.304039747
## 5       -39.89         19.5       33.4        34.0    13.1  1.141660571
## 6       -68.95         25.3       40.3        22.3    12.1  0.104423791
##   FoodInsecurity      Sun Unemployment UnemploymentChg Poverty Income
## 1           15.6 18557.98          2.7            -2.4    12.1  58233
## 2           12.9 19101.34          2.7            -2.6    10.1  59871
## 3           21.9 18642.40          3.8            -4.5    27.1  35972
## 4           15.1 18282.04          3.1            -3.3    20.3  47918
## 5           13.6 17606.36          2.7            -2.7    16.3  52902
## 6           20.5 18662.86          3.6            -3.2    30.0  31906
##   Population Births Deaths NetMig PopChg
## 1      55869    624    541    254   1298
## 2     223234   2304   2326   5377  40969
## 3      24686    256    312   -128  -2771
## 4      22394    240    252     41   -521
## 5      57826    651    657     65    504
## 6      10101    109    109    -73   -813

dim(df) #3154 x 25

## [1] 3154   25

With all of our dataframes merged into one master dataframe df, we see that we’re dealing with 3154 observations x 25 variables. The variables are listed above and we get an idea of their value ranges and such from a quick glance but we can glean much more via exploratory data analysis (EDA).

At this point we’re done with data pre-processing.

—END OF DOCUMENT—

References

https://www.datanovia.com/en/blog/how-to-normalize-and-standardize-data-in-r-for-great-heatmap-visualization/ (normalization)
https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range (normalization)

DATA698: Data Preprocessing

Magnus Skonberg

2021-05-10

Background

Dependent Variable Creation

Life Expectancy Data

Normalization

Congregation

Obesity Data

Normalization

Congregation

Physical Activity Data

Normalization

Congregation

Dependent Variable Creation

Top 10 Healthiest Counties

Independent Variable Pre-Processing

Alcohol

Cardiovascular Disease

Education

EQI

Food Insecurity

Sunlight

Unemployment

Income & Poverty

Population

Merge df’s

References