Be sure to change the author in the YAML to your name. Remember to keep it inside the quotes.
Questions that require the use of R will have an R code chunk below it.
Download the Census 2015 dataset (titled census2015.csv) as well as the two image files from the Canvas page for this assignment and save these files to the folder where the RMD file is located.
Remember to change the filepath location in the
read.csv() function to where the .csv dataset is
located on your computer. You can find the filepath by using the
file.choose() function. Once this has been completed, then
you can delete the # at the start of lines 24 and
25.
Examine the Data
For this Challenge Problem assignment, you are going to be using the
2015 census dataset.1 (Note: It is called census2015
in the set-up code chunk.) The data includes demographic and economic
information for each county in the US [Note: States in the US are made
up of counties, which are bigger than a city but smaller than a state].
The documentation for this dataset can be found on Kaggle.
census2015 dataset. [Note: Feel free to use
the eval=FALSE in the code chunk options so the output
doesn’t get long.]dim(census2015)
glimpse(census2015)
census2015 dataset? How many
variables? [Note: You may have to use R code to figure this out if you
didn’t obtain this information in a previous question.]census2015 %>%
group_by(State) %>%
summarise(cases = n()) %>%
arrange(desc(cases))
Practice Problems
Now let’s practice wrangling the data (if need be) and creating data visualizations to answer more in depth questions. Each question is supposed to be stand alone and not build from each other (unless specified).
ggplot(census2015, aes(x = Income)) +
geom_density(fill = "#7a0019", alpha = 0.6, na.rm = TRUE) +
labs(title = "Density of Counties in the U.S. Median Household Income",
x = "Median Household Income",
y = "Density") +
theme_bw()
geom_hex instead.
(you’ll need to make sure the “hexbin” package is installed)]ggplot(census2015, aes(x = Income, y = MeanCommute)) +
geom_point(alpha = 0.4, color = "#7a0019", na.rm = TRUE) +
labs(title = "Mean Commute Time (in minutes) vs Median Household Income",
x = "Median Household Income",
y = "Mean Commute Time (minutes)") +
theme_bw()
ggplot(census2015, aes(x = Income, y = MeanCommute, color = TotalPop, size = TotalPop)) +
geom_point(alpha = 0.6, color = "#7a0019", na.rm = TRUE) +
labs(title = "Mean Commute Time (in minutes) vs Median Household Income by Total Population",
x = "Median Household Income",
y = "Mean Commute Time (minutes)",
color = "Total Population",
size = "Total Population") +
scale_size_continuous(labels = comma) +
scale_color_continuous(labels = comma) +
theme_bw()
ggplot(census2015, aes(x = Income, y = MeanCommute)) +
geom_point(alpha = 0.6, color = "#7a0019", na.rm = TRUE) +
labs(title = "Mean Commute Time (in minutes) vs Median Household Income by State",
x = "Median Household Income",
y = "Mean Commute Time (minutes)") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
strip.text = element_text(size = 6.5),
plot.title = element_text(size = 15)) +
facet_wrap(~State)
state <- census2015 %>%
group_by(State) %>%
summarise(
income_median = median(Income, na.rm = TRUE),
mean_commute = mean(MeanCommute, na.rm = TRUE),
total = sum(TotalPop, na.rm = TRUE)
)
ggplot(state, aes(x = income_median, y = mean_commute, color = total, size = total)) +
geom_point(na.rm = TRUE) +
geom_text_repel(aes(label = State), vjust = -0.5, size = 2.5, color = "#7a0019", max.overlaps = Inf) +
labs(title = "Mean Commute time (in minutes) vs Median Household Income by States Total Population",
x = "Median Household Income",
y = "Mean Commute Time (minutes)",
color = "Total Population",
size = "Total Population") +
scale_size_continuous(labels = comma) +
scale_color_continuous(labels = comma) +
theme_bw() +
theme(plot.title = element_text(size = 14))
Putting It All Together
Now let’s put the components of data visualization together to answer a question.
What is the distribution of median income by county for each state?
You are going to explore this in three different data visualizations (in questions 10 to 12). Feel free to customize the aesthetics of graph as you see fit.
Income) where the state is
reordered by the median of the Income variable like the
plot below: [Note: Be sure to examine the results from each step to make
sure they reflect what you think they are!]ggplot(census2015, aes(x = reorder(State, Income, FUN = median), y = Income)) +
geom_point(alpha = 0.5, color = "#7a0019", na.rm = TRUE) +
labs(title = "Distribution of Median Income by County for Each State",
x = "State",
y = "Median Household Income") +
theme_bw() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 6.8)
)
Income) where each state has their own panel.ggplot(census2015, aes(x= Income)) +
geom_histogram(fill = "#7a0019", color = "white", bins = 20, na.rm = TRUE) +
labs(title = "Distribution of Median Household Income by State",
x = "Median Household Income",
y = "Count") +
theme_bw() +
theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
strip.text = element_text(size = 6.5),
axis.title.x = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1, size = 5)
) +
facet_wrap(~State)
Income) where each state is located approximately in their
own region of the country. [Note: To do this, you will need the function
facet_geo(~variable, scales="free") from the
{geofacet} package2] An example of this with density plots is
below:library(geofacet)
census <- census2015 %>%
filter(!State %in% c("Puerto Rico", "District of Columbia"))
ggplot(census, aes(x = Income)) +
geom_histogram(fill = "#7a0019", color = "white", bins = 20, na.rm = TRUE) +
labs(title = "Distribution of Median Household Income by State",
x = "Median Household Income",
y = "Count") +
theme_bw() +
theme(plot.title = element_text(size =24, face = "bold", hjust = 0.5),
strip.text = element_text(size = 12),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 6),
axis.text.x = element_text(size = 9, angle = 45, hjust = 1)) +
facet_geo(~State, scales = "free")
Resource: Introduction to geofacet.↩︎