# load data
library(stringr)
comm <- read.csv("hc.csv")
#extract the community's state and assigns a region of the U.S. to each community
comm$state <- openintro::state2abbr(str_extract(comm$Community, '\\b[^,]+$'))
northeast <- c("CT","ME","MA","NH","RI","VT","NJ","NY","PA")
midwest <- c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD")
south <- c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN","AR","LA","OK","TX")
west <- c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","HI","CA","OR","WA")
comm$region[comm$state %in% northeast] <- "northeast"
comm$region[comm$state %in% midwest] <- "midwest"
comm$region[comm$state %in% south] <- "south"
comm$region[comm$state %in% west] <- "west"
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is there a significant relationship between a community’s education score and population health score? In other words, does a community’s population health outcomes (access to care, health behaviours, health conditions, mental health) improve with better education (achievement, infrastructure, participation) in the community?
Also, where are these healthiest communities clustered geographically? Is there a difference in the relationship between education and population health by region of the U.S.?
What are the cases, and how many are there?
There are 500 cases and each stands for one of the top 500 “healthiest communities” in the United States. These rankings take into account several social determinants to assess a community’s health.
Describe the method of data collection.
The data is provided by the source in the form of an .xlsx file. We unmerged some cells in MS Excel and saved it as a .csv file for the purpose of this analysis.
What type of study is this (observational/experiment)?
This is an observational study as there is no treatment or control group.
If you collected the data, state self-collected. If not, provide a citation/link.
“THE HEALTHIEST Communities rankings from U.S. News & World Report show how nearly 3,000 U.S. counties and county equivalents perform in 81 metrics across 10 health and health-related categories.”
https://www.usnews.com/news/healthiest-communities/rankings https://www.usnews.com/media/healthiest-communities/2019/top-500-counties.xlsx
What is the response variable? Is it quantitative or qualitative?
The response variable is quantitative in the form of a population health score.
You should have two independent variables, one quantitative and one qualitative.
One independent variable is the community’s education score, which is quantitative. The other independent variable is the community’s region in the U.S. which is qualitative.
Provide summary statistics for each of the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
First we have some summary statistics for our Population Health score, Education score and region variables.
summary(comm$Population.Health)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.20 70.28 75.35 75.70 81.12 100.00
summary(comm$Education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.60 47.80 54.75 55.23 63.02 100.00
table(comm$region)
##
## midwest northeast south west
## 249 67 69 115
We also explore a scatterplot of Education scores versus Population Health scores. There appears to be a positive relationship here, and we will explore it, as well as the significance of this relationship, in the final project.
plot(x= comm$Education, y = comm$Population.Health)
Finally, we explore the mean Population Health scores and Education scores for these communities by the regions in which they’re in.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked _by_ '.GlobalEnv':
##
## midwest
ggplot(comm, aes(x=factor(region), y=Population.Health)) + stat_summary(fun.y="mean", geom="bar")
ggplot(comm, aes(x=factor(region), y=Education)) + stat_summary(fun.y="mean", geom="bar")