The purpose of this project will be to investigate the impact of a variety of health and lifestyle metrics on county-level health scores.
We set out with the goal of answering the following 3 questions:
The purpose of this document, more specifically, is to document the process of creating our dependent ‘healthy lifestyle’ metric as well as the pre-processing of our independent variables. It’s to answer Q1 and document the compilation of the dataset to be used for this Final Project.
……………………………………………………………………..
In order to create our dependent ‘health score’ variable, we’ve first got to familiarize ourselves with the data at hand.
Life expectancy, obesity, and physical activity data were downloaded from IHME, converted to a csv-compatible form, where then a subset of columns were selected for our consideration for the creation of a dependent ‘health score’ variable.
Male life expectancy, 2010 (years), Female life expectancy, 2010 (years), Difference in male life expectancy, 1985-2010 (years), and Difference in female life expectancy, 1985-2010 (years).Male obesity prevalence, 2009 (%), Female obesity prevalence, 2009 (%), Difference in male obesity prevalence, 2001-2009 (percentage points), and Difference in female obesity prevalence, 2001-2009 (percentage points).Male sufficient physical activity prevalence, 2009 (%), Female sufficient physical activity prevalence, 2009 (%), Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points), and Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points).We start by exploring life expectancy data at a county level:
glimpse(life_table)## Rows: 3,142
## Columns: 6
## $ State <chr> "Alabama", "~
## $ County <chr> "Autauga", "~
## $ `Male life expectancy, 2010 (years)` <dbl> 73.3, 75.0, ~
## $ `Female life expectancy, 2010 (years)` <dbl> 78.8, 80.3, ~
## $ `Difference in male life expectancy, 1985-2010 (years)` <dbl> 5.2, 3.8, 5.~
## $ `Difference in female life expectancy, 1985-2010 (years)` <dbl> 1.8, 1.5, 1.~
summary(life_table)## State County Male life expectancy, 2010 (years)
## Length:3142 Length:3142 Min. :63.90
## Class :character Class :character 1st Qu.:73.00
## Mode :character Mode :character Median :75.00
## Mean :74.71
## 3rd Qu.:76.50
## Max. :81.70
## Female life expectancy, 2010 (years)
## Min. :72.7
## 1st Qu.:78.4
## Median :79.8
## Mean :79.7
## 3rd Qu.:81.1
## Max. :85.0
## Difference in male life expectancy, 1985-2010 (years)
## Min. :-1.500
## 1st Qu.: 2.900
## Median : 3.900
## Mean : 3.919
## 3rd Qu.: 4.900
## Max. :13.000
## Difference in female life expectancy, 1985-2010 (years)
## Min. :-3.500
## 1st Qu.: 0.600
## Median : 1.500
## Mean : 1.528
## 3rd Qu.: 2.400
## Max. : 8.400
We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based upon the output above, that:
From here, we move on to normalizing our life expectancy data and bringing it to a 0 to 1 scale. To do so we apply the following formula:
\[ Transformed.Values = \frac{(Values - Min)}{(Max - Min)} \] We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:
#Extract variables of interest
m1 <- life_table$`Male life expectancy, 2010 (years)`
f1 <- life_table$`Female life expectancy, 2010 (years)`
dm1 <- life_table$`Difference in male life expectancy, 1985-2010 (years)`
df1 <- life_table$`Difference in female life expectancy, 1985-2010 (years)`
#Normalize data scale to be from 0 to 1
n_m1 = (m1-min(m1))/(max(m1)-min(m1))
n_f1 = (f1-min(f1))/(max(f1)-min(f1))
n_dm1 = (dm1-min(dm1))/(max(dm1)-min(dm1))
n_df1 = (df1-min(df1))/(max(df1)-min(df1))
#Histogram of original vs. normalized data
##Life expectancy histograms
par(mfrow=c(2,2))
hist(m1, breaks=10, xlab="Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(n_m1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male life expectancy, 2010")
hist(f1, breaks=10, xlab="Age (years)", col="lightblue", main="Female life expectancy, 2010")
hist(n_f1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female life expectancy, 2010")##Longevity improvement histograms
par(mfrow=c(2,2))
hist(dm1, breaks=10, xlab="Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(n_dm1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Male longevity improvement, 1985-2010")
hist(df1, breaks=10, xlab="Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")
hist(n_df1, breaks=10, xlab="Normalized Age (years)", col="lightblue", main="Female longevity improvement, 1985-2010")From the above plots we observe:
From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:
#Add normalized variables together
life <- n_m1 + n_dm1 + n_f1 + n_df1
#Normalize activity to 0-1 range
n_life = (life-min(life))/(max(life)-min(life))
#head(n_life)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(life, breaks=10, xlab="Score", col="lightblue", main="Longevity metric")
hist(n_life, breaks=10, xlab="Normalized Score", col="lightblue", main="Longevity metric")summary(n_life) #slight left skew## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4106 0.5039 0.4970 0.5804 1.0000
Our longevity metric appears to have been properly normalized. We observe a slight left skew and a peak centered at ~0.5-0.6.
When we consult the summary statistics, we verify a mean of 0.4970 and a median of 0.5039.
We continue our EDA by exploring obesity data at a county level:
glimpse(obesity_table)## Rows: 3,142
## Columns: 6
## $ State <chr> ~
## $ County <chr> ~
## $ `Male obesity prevalence, 2009 (%)` <dbl> ~
## $ `Female obesity prevalence, 2009 (%)` <dbl> ~
## $ `Difference in male obesity prevalence, 2001-2009 (percentage points)` <dbl> ~
## $ `Difference in female obesity prevalence, 2001-2009 (percentage points)` <dbl> ~
summary(obesity_table)## State County Male obesity prevalence, 2009 (%)
## Length:3142 Length:3142 Min. :17.8
## Class :character Class :character 1st Qu.:34.3
## Mode :character Mode :character Median :36.2
## Mean :35.9
## 3rd Qu.:38.1
## Max. :45.6
## Female obesity prevalence, 2009 (%)
## Min. :17.20
## 1st Qu.:34.80
## Median :37.60
## Mean :38.08
## 3rd Qu.:41.00
## Max. :59.30
## Difference in male obesity prevalence, 2001-2009 (percentage points)
## Min. :-2.900
## 1st Qu.: 5.900
## Median : 7.200
## Mean : 7.171
## 3rd Qu.: 8.475
## Max. :15.800
## Difference in female obesity prevalence, 2001-2009 (percentage points)
## Min. :-1.400
## 1st Qu.: 5.300
## Median : 6.700
## Mean : 6.754
## 3rd Qu.: 8.100
## Max. :16.400
We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based on the output above, that:
From here, we move on to normalizing our obesity data and bringing it to a 0 to 1 scale (by applying the formula noted earlier).
We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:
#Extract variables of interest
m2 <- obesity_table$`Male obesity prevalence, 2009 (%)`
f2 <- obesity_table$`Female obesity prevalence, 2009 (%)`
dm2 <- obesity_table$`Difference in male obesity prevalence, 2001-2009 (percentage points)`
df2 <- obesity_table$`Difference in female obesity prevalence, 2001-2009 (percentage points)`
#Normalize
n_m2 = (m2-min(m2))/(max(m2)-min(m2))
n_f2 = (f2-min(f2))/(max(f2)-min(f2))
n_dm2 = (dm2-min(dm2))/(max(dm2)-min(dm2))
n_df2 = (df2-min(df2))/(max(df2)-min(df2))
#Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(n_m2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity prevalence, 2009")
hist(f2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")
hist(n_f2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity prevalence, 2009")par(mfrow=c(2,2))
hist(dm2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(n_dm2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Male obesity increase, 2001-2009")
hist(df2, breaks=10, xlab="Obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")
hist(n_df2, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female obesity increase, 2001-2009")From the above plots we observe:
From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:
fat <- n_m2 + n_dm2 + n_f2 + n_df2
#Normalize activity to 0-1 range
n_fat = (fat-min(fat))/(max(fat)-min(fat))
#head(n_fat)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(fat, breaks=10, xlab="Score", col="lightblue", main="Obesity metric")
hist(n_fat, breaks=10, xlab="Normalized Score", col="lightblue", main="Obesity metric")summary(n_fat) #right skewed## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5089 0.5819 0.5783 0.6565 1.0000
Our obesity metric appears to have been properly normalized. We observe a slight left skew and a peak centered at ~0.5-0.6.
When we consult the summary statistics, we verify a mean of 0.5783 and a median of 0.5819.
We continue our EDA by exploring physical activity data at a county level:
glimpse(act_table)## Rows: 3,142
## Columns: 6
## $ State <chr> ~
## $ County <chr> ~
## $ `Male sufficient physical activity prevalence, 2009 (%)` <dbl> ~
## $ `Female sufficient physical activity prevalence, 2009 (%)` <dbl> ~
## $ `Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)` <dbl> ~
## $ `Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)` <dbl> ~
summary(act_table)## State County
## Length:3142 Length:3142
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Male sufficient physical activity prevalence, 2009 (%)
## Min. :34.4
## 1st Qu.:51.0
## Median :55.5
## Mean :55.0
## 3rd Qu.:58.9
## Max. :75.7
## Female sufficient physical activity prevalence, 2009 (%)
## Min. :27.40
## 1st Qu.:43.80
## Median :49.20
## Mean :48.68
## 3rd Qu.:53.70
## Max. :73.10
## Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)
## Min. :-11.400
## 1st Qu.: -0.500
## Median : 1.900
## Mean : 1.954
## 3rd Qu.: 4.200
## Max. : 16.700
## Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)
## Min. :-6.200
## 1st Qu.: 2.300
## Median : 4.500
## Mean : 4.661
## 3rd Qu.: 6.800
## Max. :18.300
We’re dealing with 3142 observations x 6 variables with 2 categorical and 4 numeric variables and can extend, based on the output above, that:
From here, we move on to normalizing our physical activity data and bringing it to a 0 to 1 scale (by applying the formula noted in the Life Expectancy Data section).
We extract and normalize our variables of interest and then visualize histograms of our original vs. normalized data:
#Explore and normalize male physical activity data
m3 <- act_table$`Male sufficient physical activity prevalence, 2009 (%)`
f3 <- act_table$`Female sufficient physical activity prevalence, 2009 (%)`
dm3 <- act_table$`Difference in male sufficient physical activity prevalence, 2001-2009 (percentage points)`
df3 <- act_table$`Difference in female sufficient physical activity prevalence, 2001-2009 (percentage points)`
#Normalized Data
n_m3 = (m3-min(m3))/(max(m3)-min(m3))
n_f3 = (f3-min(f3))/(max(f3)-min(f3))
n_dm3 = (dm3-min(dm3))/(max(dm3)-min(dm3))
n_df3 = (df3-min(df3))/(max(df3)-min(df3))
# Histogram of original vs. normalized data
par(mfrow=c(2,2))
hist(m3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(n_m3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity prevalence, 2009")
hist(f3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity prevalence, 2009")
hist(n_f3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity prevalence, 2009")par(mfrow=c(2,2))
hist(dm3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(n_dm3, breaks=10, xlab="Normalized physical activity rate (%)", col="lightblue", main="Male activity difference, 2001-2009")
hist(df3, breaks=10, xlab="Physical activity rate (%)", col="lightblue", main="Female activity difference, 2009")
hist(n_df3, breaks=10, xlab="Normalized obesity rate (%)", col="lightblue", main="Female activity difference, 2009")From the above plots we observe:
From this point, we congregrate each set of four variables into ONE ‘umbrella’ variable. Being that all variables have been normalized upto this point, we add our normalized variables together, normalize the result and then visit the corresponding histogram and output statistics:
active <- n_m3 + n_dm3 + n_f3 + n_df3
#Normalize activity to 0-1 range
n_active = (active-min(active))/(max(active)-min(active))
#head(n_active)
# Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(active, breaks=10, xlab="Score", col="lightblue", main="Physical activity metric")
hist(n_active, breaks=10, xlab="Normalized Score", col="lightblue", main="Physical activity metric")summary(n_active) #slight right skew## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4152 0.5150 0.5086 0.6051 1.0000
Our physical activity metric appears to have been properly normalized. We observe a normal distribution whose peak is centered between 0.5 and 0.6.
When we consult the summary statistics, we verify a mean of 0.5086 and a median of 0.5150.
From here, we move on to creating our dependent ‘healthy lifestyle’ variable as a combination of the longevity, obesity, and physical activity metrics we’ve explored, normalized, and congregated upto this point. To do so we apply the following formula:
\[ Lifestyle = Normalized.Life - Normalized.Obesity + Normalized.Activity \]
We sum our normalized ‘umbrella’ variables with longevity and physical activity as positive indicators and obesity as a negative indicator. We then normalize the summation to ensure we’re on a 0-1 scale and output the histograms of our summation and our normalized summation for comparison:
lifestyle <- n_life - n_fat + n_active
#Normalize health to 0-1 range
normalized_lifestyle = (lifestyle-min(lifestyle))/(max(lifestyle)-min(lifestyle))
#Histogram of original vs. normalized data
#par(mfrow=c(1,2))
#hist(lifestyle, breaks=10, xlab="Score", col="lightblue", main="Health metric")
hist(normalized_lifestyle, breaks=10, xlab="Normalized Score", col="lightblue", main="Health metric")summary(normalized_lifestyle)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3708 0.4642 0.4639 0.5546 1.0000
#head(normalized_lifestyle)For our ‘healthy lifestyle’ metric, we observe a normal distribution whose peak is centered between 0.4 and 0.5.
When we consult the summary statistics, we verify a mean of 0.4639 and a median of 0.4642.
……………………………………………………………………..
As a next step, we utilize our health score metric to filter through county data for the top 10 healthiest counties:
#create new df with state | county | health score
starter_df <- life_table %>%
dplyr::select(1:2)
starter_df$health_score <- normalized_lifestyle
healthiest_counties <- filter(starter_df, `health_score` > 0.895) #top 10
healthiest_counties <- healthiest_counties[order(-healthiest_counties$`health_score`),] #descending order
#head(starter_df)
#nrow(healthiest_counties) #10
kable(healthiest_counties)| State | County | health_score |
|---|---|---|
| Colorado | Routt | 1.0000000 |
| California | Marin | 0.9889641 |
| Colorado | Pitkin | 0.9833580 |
| Wyoming | Teton | 0.9778477 |
| Colorado | Eagle | 0.9745991 |
| California | San Francisco | 0.9139634 |
| Colorado | Summit | 0.9072650 |
| Colorado | Douglas | 0.9054401 |
| Utah | Summit | 0.9014540 |
| Colorado | Gunnison | 0.8978052 |
From the list above, we see (6) Colorado counties, (2) California counties, (1) Utah and (1) Wyoming county. From this, we extend a few assumptions regarding factors that might come into play for the healthiest counties:
It will be interesting to see whether these factors carry once we’ve built our regression model.
From this point, we move on to reading in, exploring, and preparing our independent variables. We have to ensure the format of our data sets align …
Read in independent variable data
#Read in alcohol data, convert to tibble, and drop impertinent observation:
alcohol <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/AlcoholConsumption.csv")
alcohol <- as_tibble(alcohol)
alcohol <- alcohol[-1,] #drop State = National
#dim(alcohol) #3178 x 6
alcohol <- alcohol[alcohol$Location != alcohol$State,] #drop state observations from Location column
#dim(alcohol) #3178 - 3127 = 51 observations dropped
#rename columns
alcohol <- alcohol %>% rename(
County = Location,
Hvy = Hvy_2012,
Bng = Binge_2012,
HvyPctChg = `HvyPctChg_2005-2012`,
BngPctChg = `BingePctChg_2005-2012`)
#drop excess verbage from County column
stopwords <- c("and", "Area", "Borough", "Census", "City", "County", "Division", "Municipality", "Parish")
alcohol$County <- removeWords(alcohol$County, stopwords)
#head(alcohol)For the Alcohol dataset we end up with a 3127 observation x 6 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
#Read in heart data, convert to tibble, and drop impertinent observation:
heart <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/CardiovascularDisease.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## Location = col_character(),
## `Mortality Rate, 2005*` = col_character(),
## `Mortality Rate, 2010*` = col_character(),
## `Mortality Rate, 2014*` = col_character()
## )
heart <- as_tibble(heart)
heart <- subset(heart, select = -c(`Mortality Rate, 2010*`)) #drop 2010
heart <- heart[-1,] #drop State = National
#dim(heart) #3193 x 4
# remove State == Location
heart <- heart[heart$Location != heart$State,]
dim(heart) #3193 - 3143 = 50 observations removed## [1] 3143 4
# retitle columns
heart <- heart %>% rename(
County = Location,
Mortality_2005 = `Mortality Rate, 2005*`,
Mortality_2014 = `Mortality Rate, 2014*`)
# retain value EXCLUSIVELY for Mortality Rate columns
heart$Mortality_2005 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2005))
heart$Mortality_2014 <- gsub("\\s*\\([^\\)]+\\)","",as.character(heart$Mortality_2014))
#convert columns to proper type
heart$Mortality_2005 <- as.double(heart$Mortality_2005)
heart$Mortality_2014 <- as.double(heart$Mortality_2014)
#drop excess verbage from County column
heart$County <- removeWords(heart$County, stopwords)
heart$County <- gsub("(.*),.*", "\\1", heart$County) #remove everything after comma
# add Chg column
heart$MortalityChg <- heart$Mortality_2014 - heart$Mortality_2005
#finalize format of df
heart <- subset(heart, select = -c(`Mortality_2005`)) #drop 2005
heart <- heart %>% rename( Mortality = Mortality_2014)
#head(heart)For the heart dataset we end up with a 3143 observation x 5 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column. It’s important to note that the mortality rate is listed per 100,000 residents.
#Read in education data, convert to tibble, and drop impertinent observation:
education <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Education.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## `Area name` = col_character(),
## `Percent of adults with less than a high school diploma, 2015-19` = col_double(),
## `Percent of adults with a high school diploma only, 2015-19` = col_double(),
## `Percent of adults completing some college or associate's degree, 2015-19` = col_double(),
## `Percent of adults with a bachelor's degree or higher, 2015-19` = col_double()
## )
education <- as_tibble(education)
education <- education[-1,] #drop State = National
education$State <- state.name[match(education$State, state.abb)] #convert state abbreviation to name
education <- education[education$`Area name` != education$State,] #drop state observations from Area name column
#dim(education) #3281 - 3233 = 48 observations dropped
#rename columns
education <- education %>% rename(
County = `Area name`,
LTHighSchool = `Percent of adults with less than a high school diploma, 2015-19`,
HighSchool = `Percent of adults with a high school diploma only, 2015-19`,
SomeCollege = `Percent of adults completing some college or associate's degree, 2015-19`,
College = `Percent of adults with a bachelor's degree or higher, 2015-19`)
#drop excess verbage from County column
education$County <- removeWords(education$County, stopwords)
#head(education) #verifyFor the Education dataset we end up with a 3233 observation x 6 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
#Read in eqi data and convert to tibble:
eqi <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/EnvironmentalQualityIndex.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## County_Name = col_character(),
## socio_index = col_double(),
## built_index = col_double(),
## air_index = col_double(),
## land_index = col_double(),
## water_index = col_double(),
## environmental_quality_index = col_double()
## )
eqi <- as_tibble(eqi)
eqi <- subset(eqi, select = -c(3:7)) #drop indices that makeup EQI score
eqi$State <- state.name[match(eqi$State, state.abb)] #convert state abbreviation to name
#rename columns
eqi <- eqi %>% rename(
County = County_Name,
EQI = environmental_quality_index)
#drop excess verbage from County column
eqi$County <- removeWords(eqi$County, stopwords)
#head(eqi) #verify
#dim(eqi) #3281 x 6For the EQI dataset we end up with a 3143 observation x 8 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
#Read in food insecurity data and convert to tibble:
food <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/FoodInsecurity.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## FIPS = col_double(),
## State = col_character(),
## `County, State` = col_character(),
## `2018 Food Insecurity Rate` = col_character()
## )
food <- as_tibble(food)
#head(food)
#drop FIPS
food <- subset(food, select=-c(FIPS))
#dim(food) #3142 x 3: no need to drop observations
#convert State to full name
food$State <- state.name[match(food$State, state.abb)] #convert state abbreviation to name
#rename columns
food <- food %>% rename(
County = `County, State`,
FoodInsecurity = `2018 Food Insecurity Rate`)
#remove excess verbage from County
food$County <- removeWords(food$County, stopwords)
food$County <- gsub("(.*),.*", "\\1", food$County) #remove everything after comma
#drop % from Food Insecurity and convert to double
food$FoodInsecurity = as.double(gsub("[\\%,]", "", food$FoodInsecurity))
#head(food) #verify#Read in sun data and convert to tibble:
sun <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Sunlight.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## County = col_character(),
## `Avg Daily Sunlight` = col_double()
## )
sun <- as_tibble(sun)
#dim(sun) #3161 x 3
#rename column
sun <- sun %>% rename(Sun = `Avg Daily Sunlight`)
#drop excess verbage from County column
sun$County <- removeWords(sun$County, stopwords)
sun$County <- gsub("(.*),.*", "\\1", sun$County) #remove everything after comma
#head(sun) #verifyFor the sun dataset we end up with a 3161 observation x 3 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
#Read in une ployment data and convert to tibble:
unemp <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Unemployment.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## area_name = col_character(),
## Unemployment_rate_2016 = col_double(),
## Unemployment_rate_2019 = col_double(),
## `Unemployment_chg_2016-2019` = col_double()
## )
unemp <- as_tibble(unemp)
unemp <- unemp[-1,] #drop State = National
unemp <- subset(unemp, select=-c(3)) #drop 2016
unemp$State <- state.name[match(unemp$State, state.abb)] #convert state abbreviation to name
unemp <- unemp[unemp$area_name != unemp$State,] #drop state observations from Area name column
unemp <- unemp %>% rename(
County = area_name,
Unemployment = Unemployment_rate_2019,
UnemploymentChg = `Unemployment_chg_2016-2019`) #rename columns
unemp$County <- removeWords(unemp$County, stopwords) #drop excess verbage from County column
unemp$County <- gsub("(.*),.*", "\\1", unemp$County) #remove everything after comma
#dim(unemp) #3274 - 3224 = 50 dropped
#head(unemp) #verifyFor the unemployment dataset we end up with a 3161 observation x 3 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
#Read in wealth data and convert to tibble:
wealth <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/Wealth.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## County = col_character(),
## PovertyRate = col_character(),
## MedianHouseholdIncome = col_character()
## )
wealth <- as_tibble(wealth)
wealth <- wealth[-1,] #drop State = National
#dim(wealth) #3193 x 4
wealth$State <- state.name[match(wealth$State, state.abb)] #convert state abbreviation to name
wealth <- wealth[wealth$County != wealth$State,] #drop state observations from Area name column
#dim(wealth) #3193 - 3143 = 50 observations dropped
#convert columns to proper type
wealth$PovertyRate <- as.double(wealth$PovertyRate)## Warning: NAs introduced by coercion
wealth$MedianHouseholdIncome <- as.numeric(gsub(",","",wealth$MedianHouseholdIncome))## Warning: NAs introduced by coercion
#rename columns
wealth <- wealth %>% rename(
Poverty = PovertyRate,
Income = MedianHouseholdIncome) #rename columns
wealth$County <- removeWords(wealth$County, stopwords) #drop excess verbage from County column
#head(wealth)For the wealth dataset we end up with a 3143 observation x 4 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
#Read in population data and convert to tibble:
pop <- read_csv("https://raw.githubusercontent.com/Magnus-PS/DATA-698/data/population.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## STNAME = col_character(),
## CTYNAME = col_character(),
## CENSUS2010POP = col_double(),
## POPESTIMATE2019 = col_double(),
## BIRTHS2019 = col_double(),
## DEATHS2019 = col_double(),
## NETMIG2019 = col_double()
## )
pop <- as_tibble(pop)
# rename columns
pop <- pop %>% rename(
State = STNAME,
County = CTYNAME,
Pop_2010 = CENSUS2010POP,
Population = POPESTIMATE2019,
Births = BIRTHS2019,
Deaths = DEATHS2019,
NetMig = NETMIG2019) #rename columns
#add population change variable
pop$PopChg <- pop$Population - pop$Pop_2010
pop <- subset(pop, select=-c(3)) #drop 2010
#dim(pop) #3193 x 7
pop <- pop[pop$County != pop$State,] #drop state observations from Area name column
#dim(pop) #3193 - 3141 = 52 observations dropped
pop$County[1802] <- "Dona Ana County" #invalid UTF-8
pop$County <- removeWords(pop$County, stopwords) #drop excess verbage from County column
#head(pop)For the population dataset we end up with a 3141 observation x 7 variable data frame with states written out in the State column and counties listed without additional verbage (ie. County, Borough) in the County column.
Place into consistent format with dependent variables (state-count-…)
#MERGE DF's
#1. merge health score and alcohol df's
##time white space
alcohol$County <- trimws(alcohol$County)
starter_df$County <- trimws(starter_df$County)
##SQL join
df <- sqldf("SELECT *
FROM starter_df
LEFT JOIN alcohol ON starter_df.State = alcohol.State AND starter_df.County = alcohol.County")
##remove extra State, County columns
df <- subset(df, select=-c(4,5))
#2. merge heart to df
##time white space
heart$County <- trimws(heart$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN heart ON df.State = heart.State AND df.County = heart.County")
##remove extra State, County columns
df <- subset(df, select=-c(8,9))
#3. merge education to df
##time white space
education$County <- trimws(education$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN education ON df.State = education.State AND df.County = education.County")
##remove extra State, County columns
df <- subset(df, select=-c(10,11))
#4. merge eqi to df
##time white space
eqi$County <- trimws(eqi$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN eqi ON df.State = eqi.State AND df.County = eqi.County")
##remove extra State, County columns
df <- subset(df, select=-c(14,15))
#5. merge food to df
##time white space
food$County <- trimws(food$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN food ON df.State = food.State AND df.County = food.County")
##remove extra State, County columns
df <- subset(df, select=-c(15,16))
#6. merge sun to df
##time white space
sun$County <- trimws(sun$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN sun ON df.State = sun.State AND df.County = sun.County")
##remove extra State, County columns
df <- subset(df, select=-c(16,17))
#7. merge unemp to df
##time white space
unemp$County <- trimws(unemp$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN unemp ON df.State = unemp.State AND df.County = unemp.County")
##remove extra State, County columns
df <- subset(df, select=-c(17,18))
#8. merge wealth to df
##time white space
wealth$County <- trimws(wealth$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN wealth ON df.State = wealth.State AND df.County = wealth.County")
##remove extra State, County columns
df <- subset(df, select=-c(19,20))
#9. merge pop to df
##time white space
pop$County <- trimws(pop$County)
##SQL join
df <- sqldf("SELECT *
FROM df
LEFT JOIN pop ON df.State = pop.State AND df.County = pop.County")
##remove extra State, County columns
df <- subset(df, select=-c(21,22))
#verify variables and dimensions
head(df) ## State County health_score Hvy HvyPctChg Bng BngPctChg Mortality
## 1 Alabama Autauga 0.4597290 14.7 -2.7 31.1 -1.1 316.36
## 2 Alabama Baldwin 0.5185262 16.6 1.2 30.4 -1.8 272.04
## 3 Alabama Barbour 0.2292573 16.9 17.5 32.9 3.4 255.09
## 4 Alabama Bibb 0.2954267 15.4 1.4 31.9 -2.3 378.09
## 5 Alabama Blount 0.2025867 14.3 3.1 29.4 -3.4 307.90
## 6 Alabama Bullock 0.2151089 18.8 21.4 38.1 15.9 322.56
## MortalityChg LTHighSchool HighSchool SomeCollege College EQI
## 1 -52.47 11.5 33.6 28.4 26.6 0.284681678
## 2 -36.95 9.2 27.7 31.3 31.9 0.754969418
## 3 -45.95 26.8 35.6 26.0 11.6 -0.005637936
## 4 -31.16 20.9 44.9 23.8 10.4 0.304039747
## 5 -39.89 19.5 33.4 34.0 13.1 1.141660571
## 6 -68.95 25.3 40.3 22.3 12.1 0.104423791
## FoodInsecurity Sun Unemployment UnemploymentChg Poverty Income
## 1 15.6 18557.98 2.7 -2.4 12.1 58233
## 2 12.9 19101.34 2.7 -2.6 10.1 59871
## 3 21.9 18642.40 3.8 -4.5 27.1 35972
## 4 15.1 18282.04 3.1 -3.3 20.3 47918
## 5 13.6 17606.36 2.7 -2.7 16.3 52902
## 6 20.5 18662.86 3.6 -3.2 30.0 31906
## Population Births Deaths NetMig PopChg
## 1 55869 624 541 254 1298
## 2 223234 2304 2326 5377 40969
## 3 24686 256 312 -128 -2771
## 4 22394 240 252 41 -521
## 5 57826 651 657 65 504
## 6 10101 109 109 -73 -813
dim(df) #3154 x 25## [1] 3154 25
With all of our dataframes merged into one master dataframe df, we see that we’re dealing with 3154 observations x 25 variables. The variables are listed above and we get an idea of their value ranges and such from a quick glance but we can glean much more via exploratory data analysis (EDA).
At this point we’re done with data pre-processing.
—END OF DOCUMENT—