The purpose of this project will be to investigate the impact of a variety of health and lifestyle metrics on county-level health scores.
We set out with the goal of answering the following 3 questions:
The purpose of this document, more specifically, is to document the process of creating our dependent ‘healthy lifestyle’ metric as well as the pre-processing of our independent variables. It’s to answer Q1 and document the compilation of the dataset to be used for this Final Project.
……………………………………………………………………..
In order to create our dependent ‘health score’ variable, we’ve first got to familiarize ourselves with the data at hand.
Life expectancy, obesity, and physical activity data were downloaded from IHME, converted to a csv-compatible form, where then a subset of columns were selected for our consideration for the creation of a dependent ‘health score’ variable.
Male life expectancy, 2010 (years)
, Female life expectancy, 2010 (years)
, Difference in male life expectancy, 1985-2010 (years)
, and Difference in female life expectancy, 1985-2010 (years)
.Male obesity prevalence, 2009 (%)
, Female obesity prevalence, 2009 (%)
, Difference in male obesity prevalence, 2001-2009 (percentage points)
, and Difference in female obesity prevalence, 2001-2009 (percentage points)
.Male sufficient physical activity prevalence, 2009 (%)
, Female sufficient physical activity prevalence, 2009 (%)
, Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)
, and Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)
.We start by exploring life expectancy data at a county level:
glimpse(life_table)
## Rows: 3,142
## Columns: 6
## $ State <chr> "Alabama", "~
## $ County <chr> "Autauga", "~
## $ `Male life expectancy, 2010 (years)` <dbl> 73.3, 75.0, ~
## $ `Female life expectancy, 2010 (years)` <dbl> 78.8, 80.3, ~
## $ `Difference in male life expectancy, 1985-2010 (years)` <dbl> 5.2, 3.8, 5.~
## $ `Difference in female life expectancy, 1985-2010 (years)` <dbl> 1.8, 1.5, 1.~
summary(life_table)
## State County Male life expectancy, 2010 (years)
## Length:3142 Length:3142 Min. :63.90
## Class :character Class :character 1st Qu.:73.00
## Mode :character Mode :character Median :75.00
## Mean :74.71
## 3rd Qu.:76.50
## Max. :81.70
## Female life expectancy, 2010 (years)
## Min. :72.7
## 1st Qu.:78.4
## Median :79.8
## Mean :79.7
## 3rd Qu.:81.1
## Max. :85.0
## Difference in male life expectancy, 1985-2010 (years)
## Min. :-1.500
## 1st Qu.: 2.900
## Median : 3.900
## Mean : 3.919
## 3rd Qu.: 4.900
## Max. :13.000
## Difference in female life expectancy, 1985-2010 (years)
## Min. :-3.500
## 1st Qu.: 0.600
## Median : 1.500
## Mean : 1.528
## 3rd Qu.: 2.400
## Max. : 8.400
We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based upon the output above, that:
From here, we move on to normalizing our life expectancy data and bringing it to a 0 to 1 scale. To do so we apply the following formula:
\[ Transformed.Values = \frac{(Values - Min)}{(Max - Min)} \] We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:
#Extract variables of interest
<- life_table$`Male life expectancy, 2010 (years)`
m1 <- life_table$`Female life expectancy, 2010 (years)`
f1 <- life_table$`Difference in male life expectancy, 1985-2010 (years)`
dm1 <- life_table$`Difference in female life expectancy, 1985-2010 (years)`
df1
#Normalize data scale to be from 0 to 1
= (m1-min(m1))/(max(m1)-min(m1))
n_m1 = (f1-min(f1))/(max(f1)-min(f1))
n_f1 = (dm1-min(dm1))/(max(dm1)-min(dm1))
n_dm1 = (df1-min(df1))/(max(df1)-min(df1))
n_df1
#Histogram of original vs. normalized data
##Life expectancy histograms
par(mfrow=c(2,2))
hist(m1, breaks=10, xlab="Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(n_m1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(f1, breaks=10, xlab="Age (years)", col="lightblue", main="Female life expectancy, 2010")
hist(n_f1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female life expectancy, 2010")
##Longevity improvement histograms
par(mfrow=c(2,2))
hist(dm1, breaks=10, xlab="Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(n_dm1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(df1, breaks=10, xlab="Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")
hist(n_df1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")
From the above plots we observe:
From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:
#Add normalized variables together
<- n_m1 + n_dm1 + n_f1 + n_df1
life
#Normalize activity to 0-1 range
= (life-min(life))/(max(life)-min(life))
n_life #head(n_life)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(life, breaks=10, xlab="Score", col="lightblue", main="Longevity metric")
hist(n_life, breaks=10, xlab="Normalized Score", col="lightblue", main="Longevity metric")
summary(n_life) #slight left skew
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4106 0.5039 0.4970 0.5804 1.0000
Our longevity metric appears to have been properly normalized. We observe a slight left skew and a peak centered at ~0.5-0.6.
When we consult the summary statistics, we verify a mean of 0.4970 and a median of 0.5039.
We continue our EDA by exploring obesity data at a county level:
glimpse(obesity_table)
## Rows: 3,142
## Columns: 6
## $ State <chr> ~
## $ County <chr> ~
## $ `Male obesity prevalence, 2009 (%)` <dbl> ~
## $ `Female obesity prevalence, 2009 (%)` <dbl> ~
## $ `Difference in male obesity prevalence, 2001-2009 (percentage points)` <dbl> ~
## $ `Difference in female obesity prevalence, 2001-2009 (percentage points)` <dbl> ~
summary(obesity_table)
## State County Male obesity prevalence, 2009 (%)
## Length:3142 Length:3142 Min. :17.8
## Class :character Class :character 1st Qu.:34.3
## Mode :character Mode :character Median :36.2
## Mean :35.9
## 3rd Qu.:38.1
## Max. :45.6
## Female obesity prevalence, 2009 (%)
## Min. :17.20
## 1st Qu.:34.80
## Median :37.60
## Mean :38.08
## 3rd Qu.:41.00
## Max. :59.30
## Difference in male obesity prevalence, 2001-2009 (percentage points)
## Min. :-2.900
## 1st Qu.: 5.900
## Median : 7.200
## Mean : 7.171
## 3rd Qu.: 8.475
## Max. :15.800
## Difference in female obesity prevalence, 2001-2009 (percentage points)
## Min. :-1.400
## 1st Qu.: 5.300
## Median : 6.700
## Mean : 6.754
## 3rd Qu.: 8.100
## Max. :16.400
We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based on the output above, that:
From here, we move on to normalizing our obesity data and bringing it to a 0 to 1 scale (by applying the formula noted earlier).
We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:
#Extract variables of interest
<- obesity_table$`Male obesity prevalence, 2009 (%)`
m2 <- obesity_table$`Female obesity prevalence, 2009 (%)`
f2 <- obesity_table$`Difference in male obesity prevalence, 2001-2009 (percentage points)`
dm2 <- obesity_table$`Difference in female obesity prevalence, 2001-2009 (percentage points)`
df2
#Normalize
= (m2-min(m2))/(max(m2)-min(m2))
n_m2 = (f2-min(f2))/(max(f2)-min(f2))
n_f2 = (dm2-min(dm2))/(max(dm2)-min(dm2))
n_dm2 = (df2-min(df2))/(max(df2)-min(df2))
n_df2
#Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(n_m2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(f2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")
hist(n_f2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")
par(mfrow=c(2,2))
hist(dm2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(n_dm2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(df2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")
hist(n_df2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")
From the above plots we observe:
From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:
<- n_m2 + n_dm2 + n_f2 + n_df2
fat #Normalize activity to 0-1 range
= (fat-min(fat))/(max(fat)-min(fat))
n_fat
#head(n_fat)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(fat, breaks=10, xlab="Score", col="lightblue", main="Obesity metric")
hist(n_fat, breaks=10, xlab="Normalized Score", col="lightblue", main="Obesity metric")
summary(n_fat) #right skewed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5089 0.5819 0.5783 0.6565 1.0000
Our obesity metric appears to have been properly normalized. We observe a slight left skew and a peak centered at ~0.5-0.6.
When we consult the summary statistics, we verify a mean of 0.5783 and a median of 0.5819.
We continue our EDA by exploring physical activity data at a county level:
glimpse(act_table)
## Rows: 3,142
## Columns: 6
## $ State <chr> ~
## $ County <chr> ~
## $ `Male sufficient physical activity prevalence, 2009 (%)` <dbl> ~
## $ `Female sufficient physical activity prevalence, 2009 (%)` <dbl> ~
## $ `Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)` <dbl> ~
## $ `Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)` <dbl> ~
summary(act_table)
## State County
## Length:3142 Length:3142
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Male sufficient physical activity prevalence, 2009 (%)
## Min. :34.4
## 1st Qu.:51.0
## Median :55.5
## Mean :55.0
## 3rd Qu.:58.9
## Max. :75.7
## Female sufficient physical activity prevalence, 2009 (%)
## Min. :27.40
## 1st Qu.:43.80
## Median :49.20
## Mean :48.68
## 3rd Qu.:53.70
## Max. :73.10
## Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)
## Min. :-11.400
## 1st Qu.: -0.500
## Median : 1.900
## Mean : 1.954
## 3rd Qu.: 4.200
## Max. : 16.700
## Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)
## Min. :-6.200
## 1st Qu.: 2.300
## Median : 4.500
## Mean : 4.661
## 3rd Qu.: 6.800
## Max. :18.300
We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based on the output above, that:
From here, we move on to normalizing our physical activity data and bringing it to a 0 to 1 scale (by applying the formula noted in the Life Expectancy Data
section).
We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:
#Explore and normalize male physical activity data
<- act_table$`Male sufficient physical activity prevalence, 2009 (%)`
m3 <- act_table$`Female sufficient physical activity prevalence, 2009 (%)`
f3 <- act_table$`Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)`
dm3 <- act_table$`Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)`
df3
#Normalized Data
= (m3-min(m3))/(max(m3)-min(m3))
n_m3 = (f3-min(f3))/(max(f3)-min(f3))
n_f3 = (dm3-min(dm3))/(max(dm3)-min(dm3))
n_dm3 = (df3-min(df3))/(max(df3)-min(df3))
n_df3
# Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(n_m3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(f3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity prevalence, 2009")
hist(n_f3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity prevalence, 2009")
par(mfrow=c(2,2))
hist(dm3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(n_dm3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(df3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity difference, 2009")
hist(n_df3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity difference, 2009")
From the above plots we observe:
From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:
<- n_m3 + n_dm3 + n_f3 + n_df3
active #Normalize activity to 0-1 range
= (active-min(active))/(max(active)-min(active))
n_active
#head(n_active)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(active, breaks=10, xlab="Score", col="lightblue", main="Physical activity metric")
hist(n_active, breaks=10, xlab="Normalized Score", col="lightblue", main="Physical activity metric")
summary(n_active) #slight right skew
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4152 0.5150 0.5086 0.6051 1.0000
Our physical activity metric appears to have been properly normalized. We observe a normal distribution whose peak is centered between 0.5 and 0.6.
When we consult the summary statistics, we verify a mean of 0.5086 and a median of 0.5150.
From here, we move on to creating our dependent ‘healthy lifestyle’ variable as a combination of the longevity, obesity, and physical activity metrics we’ve explored, normalized, and congregated upto this point. To do so we apply the following formula:
\[ Lifestyle = Normalized.Life - Normalized.Obesity + Normalized.Activity \]
We sum our normalized ‘umbrella’ variables with longevity and physical activity as positive indicators and obesity as a negative indicator. We then normalize the summation to ensure we’re on a 0-1 scale and output the histograms of our summation and our normalized summation for comparison:
<- n_life - n_fat + n_active
lifestyle
#Normalize health to 0-1 range
= (lifestyle-min(lifestyle))/(max(lifestyle)-min(lifestyle))
normalized_lifestyle
#Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(lifestyle, breaks=10, xlab="Score", col="lightblue", main="Health metric")
hist(normalized_lifestyle, breaks=10, xlab="Normalized Score", col="lightblue", main="Health metric")
summary(normalized_lifestyle)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3708 0.4642 0.4639 0.5546 1.0000
#head(normalized_lifestyle)
For our ‘healthy lifestyle’ metric, we observe a normal distribution whose peak is centered between 0.4 and 0.5.
When we consult the summary statistics, we verify a mean of 0.4639 and a median of 0.4642.
……………………………………………………………………..
As a next step, we utilize our health score metric to filter through county data for the top 10 healthiest counties:
#create new df with state | county | health score
<- life_table %>%
starter_df ::select(1:2)
dplyr
$health_score <- normalized_lifestyle
starter_df<- filter(starter_df, `health_score` > 0.895) #top 10
healthiest_counties <- healthiest_counties[order(-healthiest_counties$`health_score`),] #descending order
healthiest_counties
#head(starter_df)
#nrow(healthiest_counties) #10
kable(healthiest_counties)
State | County | health_score |
---|---|---|
Colorado | Routt | 1.0000000 |
California | Marin | 0.9889641 |
Colorado | Pitkin | 0.9833580 |
Wyoming | Teton | 0.9778477 |
Colorado | Eagle | 0.9745991 |
California | San Francisco | 0.9139634 |
Colorado | Summit | 0.9072650 |
Colorado | Douglas | 0.9054401 |
Utah | Summit | 0.9014540 |
Colorado | Gunnison | 0.8978052 |
From the list above, we see (6) Colorado counties, (2) California counties, (1) Utah and (1) Wyoming county. From this, we extend a few assumptions regarding factors that might come into play for the healthiest counties:
It will be interesting to see whether these factors carry once we’ve built our regression model.
From this point, we move on to reading in, exploring, and preparing our independent variables. We have to ensure the format of our data sets align …
Read in independent variable data
#Read in alcohol data, convert to tibble, and drop impertinent observation:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/AlcoholConsumption.csv")
alcohol <- as_tibble(alcohol)
alcohol <- alcohol[-1,] #drop State = National
alcohol #dim(alcohol) #3178 x 6
<- alcohol[alcohol$Location != alcohol$State,] #drop state observations from Location column
alcohol #dim(alcohol) #3178 - 3127 = 51 observations dropped
#rename columns
<- alcohol %>% rename(
alcohol County = Location,
Hvy = Hvy_2012,
Bng = Binge_2012,
HvyPctChg = `HvyPctChg_2005-2012`,
BngPctChg = `BingePctChg_2005-2012`)
#drop excess verbage from County column
<- c("and", "Area", "Borough", "Census", "City", "County", "Division", "Municipality", "Parish")
stopwords $County <- removeWords(alcohol$County, stopwords)
alcohol#head(alcohol)
For the Alcohol dataset we end up with a 3127 observation x 6 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
#Read in heart data, convert to tibble, and drop impertinent observation:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/CardiovascularDisease.csv") heart
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## Location = col_character(),
## `Mortality Rate, 2005*` = col_character(),
## `Mortality Rate, 2010*` = col_character(),
## `Mortality Rate, 2014*` = col_character()
## )
<- as_tibble(heart)
heart <- subset(heart, select = -c(`Mortality Rate, 2010*`)) #drop 2010
heart <- heart[-1,] #drop State = National
heart #dim(heart) #3193 x 4
# remove State == Location
<- heart[heart$Location != heart$State,]
heart dim(heart) #3193 - 3143 = 50 observations removed
## [1] 3143 4
# retitle columns
<- heart %>% rename(
heart County = Location,
Mortality_2005 = `Mortality Rate, 2005*`,
Mortality_2014 = `Mortality Rate, 2014*`)
# retain value EXCLUSIVELY for Mortality Rate columns
$Mortality_2005 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2005))
heart$Mortality_2014 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2014))
heart
#convert columns to proper type
$Mortality_2005 <- as.double(heart$Mortality_2005)
heart$Mortality_2014 <- as.double(heart$Mortality_2014)
heart
#drop excess verbage from County column
$County <- removeWords(heart$County, stopwords)
heart$County <- gsub("(.*),.*", "\\1", heart$County) #remove everything after comma
heart
# add Chg column
$MortalityChg <- heart$Mortality_2014 - heart$Mortality_2005
heart
#finalize format of df
<- subset(heart, select = -c(`Mortality_2005`)) #drop 2005
heart <- heart %>% rename( Mortality = Mortality_2014)
heart
#head(heart)
For the heart dataset we end up with a 3143 observation x 5 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column. It’s important to note that the mortality rate is listed per 100,000 residents.
#Read in education data, convert to tibble, and drop impertinent observation:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Education.csv") education
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## `Area name` = col_character(),
## `Percent of adults with less than a high school diploma, 2015-19` = col_double(),
## `Percent of adults with a high school diploma only, 2015-19` = col_double(),
## `Percent of adults completing some college or associate's degree, 2015-19` = col_double(),
## `Percent of adults with a bachelor's degree or higher, 2015-19` = col_double()
## )
<- as_tibble(education)
education <- education[-1,] #drop State = National
education
$State <- state.name[match(education$State, state.abb)] #convert state abbreviation to name
education
<- education[education$`Area name` != education$State,] #drop state observations from Area name column
education #dim(education) #3281 - 3233 = 48 observations dropped
#rename columns
<- education %>% rename(
education County = `Area name`,
LTHighSchool = `Percent of adults with less than a high school diploma, 2015-19`,
HighSchool = `Percent of adults with a high school diploma only, 2015-19`,
SomeCollege = `Percent of adults completing some college or associate's degree, 2015-19`,
College = `Percent of adults with a bachelor's degree or higher, 2015-19`)
#drop excess verbage from County column
$County <- removeWords(education$County, stopwords)
education
#head(education) #verify
For the Education dataset we end up with a 3233 observation x 6 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
#Read in eqi data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/EnvironmentalQualityIndex.csv") eqi
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## County_Name = col_character(),
## socio_index = col_double(),
## built_index = col_double(),
## air_index = col_double(),
## land_index = col_double(),
## water_index = col_double(),
## environmental_quality_index = col_double()
## )
<- as_tibble(eqi)
eqi <- subset(eqi, select = -c(3:7)) #drop indices that makeup EQI score
eqi
$State <- state.name[match(eqi$State, state.abb)] #convert state abbreviation to name
eqi
#rename columns
<- eqi %>% rename(
eqi County = County_Name,
EQI = environmental_quality_index)
#drop excess verbage from County column
$County <- removeWords(eqi$County, stopwords)
eqi
#head(eqi) #verify
#dim(eqi) #3281 x 6
For the EQI dataset we end up with a 3143 observation x 8 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
#Read in food insecurity data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/FoodInsecurity.csv") food
##
## -- Column specification --------------------------------------------------------
## cols(
## FIPS = col_double(),
## State = col_character(),
## `County, State` = col_character(),
## `2018 Food Insecurity Rate` = col_character()
## )
<- as_tibble(food)
food #head(food)
#drop FIPS
<- subset(food, select=-c(FIPS))
food #dim(food) #3142 x 3: no need to drop observations
#convert State to full name
$State <- state.name[match(food$State, state.abb)] #convert state abbreviation to name
food
#rename columns
<- food %>% rename(
food County = `County, State`,
FoodInsecurity = `2018 Food Insecurity Rate`)
#remove excess verbage from County
$County <- removeWords(food$County, stopwords)
food$County <- gsub("(.*),.*", "\\1", food$County) #remove everything after comma
food
#drop % from Food Insecurity and convert to double
$FoodInsecurity = as.double(gsub("[\\%,]", "", food$FoodInsecurity))
food
#head(food) #verify
#Read in sun data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Sunlight.csv") sun
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## County = col_character(),
## `Avg Daily Sunlight` = col_double()
## )
<- as_tibble(sun)
sun #dim(sun) #3161 x 3
#rename column
<- sun %>% rename(Sun = `Avg Daily Sunlight`)
sun
#drop excess verbage from County column
$County <- removeWords(sun$County, stopwords)
sun$County <- gsub("(.*),.*", "\\1", sun$County) #remove everything after comma
sun
#head(sun) #verify
For the sun dataset we end up with a 3161 observation x 3 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
#Read in une ployment data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Unemployment.csv") unemp
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## area_name = col_character(),
## Unemployment_rate_2016 = col_double(),
## Unemployment_rate_2019 = col_double(),
## `Unemployment_chg_2016-2019` = col_double()
## )
<- as_tibble(unemp)
unemp <- unemp[-1,] #drop State = National
unemp <- subset(unemp, select=-c(3)) #drop 2016
unemp
$State <- state.name[match(unemp$State, state.abb)] #convert state abbreviation to name
unemp<- unemp[unemp$area_name != unemp$State,] #drop state observations from Area name column
unemp
<- unemp %>% rename(
unemp County = area_name,
Unemployment = Unemployment_rate_2019,
UnemploymentChg = `Unemployment_chg_2016-2019`) #rename columns
$County <- removeWords(unemp$County, stopwords) #drop excess verbage from County column
unemp$County <- gsub("(.*),.*", "\\1", unemp$County) #remove everything after comma
unemp
#dim(unemp) #3274 - 3224 = 50 dropped
#head(unemp) #verify
For the unemployment dataset we end up with a 3161 observation x 3 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
#Read in wealth data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Wealth.csv") wealth
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## County = col_character(),
## PovertyRate = col_character(),
## MedianHouseholdIncome = col_character()
## )
<- as_tibble(wealth)
wealth <- wealth[-1,] #drop State = National
wealth #dim(wealth) #3193 x 4
$State <- state.name[match(wealth$State, state.abb)] #convert state abbreviation to name
wealth<- wealth[wealth$County != wealth$State,] #drop state observations from Area name column
wealth #dim(wealth) #3193 - 3143 = 50 observations dropped
#convert columns to proper type
$PovertyRate <- as.double(wealth$PovertyRate) wealth
## Warning: NAs introduced by coercion
$MedianHouseholdIncome <- as.numeric(gsub(",","",wealth$MedianHouseholdIncome)) wealth
## Warning: NAs introduced by coercion
#rename columns
<- wealth %>% rename(
wealth Poverty = PovertyRate,
Income = MedianHouseholdIncome) #rename columns
$County <- removeWords(wealth$County, stopwords) #drop excess verbage from County column
wealth
#head(wealth)
For the wealth dataset we end up with a 3143 observation x 4 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
#Read in population data and convert to tibble:
<- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/population.csv") pop
##
## -- Column specification --------------------------------------------------------
## cols(
## STNAME = col_character(),
## CTYNAME = col_character(),
## CENSUS2010POP = col_double(),
## POPESTIMATE2019 = col_double(),
## BIRTHS2019 = col_double(),
## DEATHS2019 = col_double(),
## NETMIG2019 = col_double()
## )
<- as_tibble(pop)
pop
# rename columns
<- pop %>% rename(
pop State = STNAME,
County = CTYNAME,
Pop_2010 = CENSUS2010POP,
Population = POPESTIMATE2019,
Births = BIRTHS2019,
Deaths = DEATHS2019,
NetMig = NETMIG2019) #rename columns
#add population change variable
$PopChg <- pop$Population - pop$Pop_2010
pop
<- subset(pop, select=-c(3)) #drop 2010
pop
#dim(pop) #3193 x 7
<- pop[pop$County != pop$State,] #drop state observations from Area name column
pop #dim(pop) #3193 - 3141 = 52 observations dropped
$County[1802] <- "Dona Ana County" #invalid UTF-8
pop$County <- removeWords(pop$County, stopwords) #drop excess verbage from County column
pop
#head(pop)
For the population dataset we end up with a 3141 observation x 7 variable data frame with states written out in the State
column and counties listed without additional verbage (ie. County, Borough) in the County
column.
Place into consistent format with dependent variables (state-count-…)
#MERGE DF's
#1. merge health score and alcohol df's
##time white space
$County <- trimws(alcohol$County)
alcohol$County <- trimws(starter_df$County)
starter_df
##SQL join
<- sqldf("SELECT *
df FROM starter_df
LEFT JOIN alcohol ON starter_df.State = alcohol.State AND starter_df.County = alcohol.County")
##remove extra State, County columns
<- subset(df, select=-c(4,5))
df
#2. merge heart to df
##time white space
$County <- trimws(heart$County)
heart
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN heart ON df.State = heart.State AND df.County = heart.County")
##remove extra State, County columns
<- subset(df, select=-c(8,9))
df
#3. merge education to df
##time white space
$County <- trimws(education$County)
education
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN education ON df.State = education.State AND df.County = education.County")
##remove extra State, County columns
<- subset(df, select=-c(10,11))
df
#4. merge eqi to df
##time white space
$County <- trimws(eqi$County)
eqi
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN eqi ON df.State = eqi.State AND df.County = eqi.County")
##remove extra State, County columns
<- subset(df, select=-c(14,15))
df
#5. merge food to df
##time white space
$County <- trimws(food$County)
food
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN food ON df.State = food.State AND df.County = food.County")
##remove extra State, County columns
<- subset(df, select=-c(15,16))
df
#6. merge sun to df
##time white space
$County <- trimws(sun$County)
sun
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN sun ON df.State = sun.State AND df.County = sun.County")
##remove extra State, County columns
<- subset(df, select=-c(16,17))
df
#7. merge unemp to df
##time white space
$County <- trimws(unemp$County)
unemp
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN unemp ON df.State = unemp.State AND df.County = unemp.County")
##remove extra State, County columns
<- subset(df, select=-c(17,18))
df
#8. merge wealth to df
##time white space
$County <- trimws(wealth$County)
wealth
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN wealth ON df.State = wealth.State AND df.County = wealth.County")
##remove extra State, County columns
<- subset(df, select=-c(19,20))
df
#9. merge pop to df
##time white space
$County <- trimws(pop$County)
pop
##SQL join
<- sqldf("SELECT *
df FROM df
LEFT JOIN pop ON df.State = pop.State AND df.County = pop.County")
##remove extra State, County columns
<- subset(df, select=-c(21,22))
df
#verify variables and dimensions
head(df)
## State County health_score Hvy HvyPctChg Bng BngPctChg Mortality
## 1 Alabama Autauga 0.4597290 14.7 -2.7 31.1 -1.1 316.36
## 2 Alabama Baldwin 0.5185262 16.6 1.2 30.4 -1.8 272.04
## 3 Alabama Barbour 0.2292573 16.9 17.5 32.9 3.4 255.09
## 4 Alabama Bibb 0.2954267 15.4 1.4 31.9 -2.3 378.09
## 5 Alabama Blount 0.2025867 14.3 3.1 29.4 -3.4 307.90
## 6 Alabama Bullock 0.2151089 18.8 21.4 38.1 15.9 322.56
## MortalityChg LTHighSchool HighSchool SomeCollege College EQI
## 1 -52.47 11.5 33.6 28.4 26.6 0.284681678
## 2 -36.95 9.2 27.7 31.3 31.9 0.754969418
## 3 -45.95 26.8 35.6 26.0 11.6 -0.005637936
## 4 -31.16 20.9 44.9 23.8 10.4 0.304039747
## 5 -39.89 19.5 33.4 34.0 13.1 1.141660571
## 6 -68.95 25.3 40.3 22.3 12.1 0.104423791
## FoodInsecurity Sun Unemployment UnemploymentChg Poverty Income
## 1 15.6 18557.98 2.7 -2.4 12.1 58233
## 2 12.9 19101.34 2.7 -2.6 10.1 59871
## 3 21.9 18642.40 3.8 -4.5 27.1 35972
## 4 15.1 18282.04 3.1 -3.3 20.3 47918
## 5 13.6 17606.36 2.7 -2.7 16.3 52902
## 6 20.5 18662.86 3.6 -3.2 30.0 31906
## Population Births Deaths NetMig PopChg
## 1 55869 624 541 254 1298
## 2 223234 2304 2326 5377 40969
## 3 24686 256 312 -128 -2771
## 4 22394 240 252 41 -521
## 5 57826 651 657 65 504
## 6 10101 109 109 -73 -813
dim(df) #3154 x 25
## [1] 3154 25
With all of our dataframes merged into one master dataframe df
, we see that we’re dealing with 3154 observations x 25 variables. The variables are listed above and we get an idea of their value ranges and such from a quick glance but we can glean much more via exploratory data analysis (EDA).
At this point we’re done with data pre-processing.
—END OF DOCUMENT—