---
title: "Weeks 10 and 11"
subtitle: "Introduction To Geographic Data and Quiz 2 Review"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r include=F}
#|label: setup
knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks
options(scipen=100) # suppress scientific notation
# install pacman if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
pacman::p_load(pacman, tidyverse, ggthemes, gridExtra, magrittr,
kableExtra, RColorBrewer, maps, usdata, countrycode,
mapproj, shadowtext, cowplot, grid)
# install and load required packages
p_loaded() # verify loaded packages
```
### Upcoming Dates
- **HW 5 - Part 1 was due yesterday (10/29)**
- **Draft Proposals are due Today 10/30 at 6:00 PM**
- **Quiz 2 is on Thursday, 11/6**
- Mostly on Weeks 5 through 9, but material is cumulative.
- Mostly on HW Assignments 4 and 5, but material is cumulative.
- Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions.
- HW 5 - Part 2 will be posted after Quiz 2.
## Today and Tuesday, 11/4
- **Practice Questions for Quiz 2 are posted.**
- I am recording new videos and will post them by Sunday.
- **Tue. 11/4: Skills/Concepts Review for Quiz 2**
- Putting skills together for different goals
- Come with questions or submit them as `Engagement Questions`.
- **Today: Intro to Managing and Plotting Geographic Data**
- Today's geographic maps will not be on Quiz 2.
- Mapping data is an effective data curation tool that may be useful in this class, other classes, your career.
- We will cover more about geographic data and map visualizations after Quiz 2.
- **If you want help with mapping project data, please reach out to me or a TA**.
## In-class Exercise - Week 10
::::::: columns
:::: {.column width="48%"}
::: fragment
**Purpose:**
:::
- To gain some experience and understanding of map data available in R and elsewhere.
- To experiment with mapping data
- Students are encouraged to use domestic or international map data in their dashboards if appropriate.
::::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig.align="center"}
:::
:::::::
## Data Preparation
- The data for today's in-class exercise is part of R.
- These geographic data are useful if you have information by state or county and you want to show a [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map){target="_blank"} of your data.
- R also has world information, e.g., countries, continents, etc.
::: fragment
```{r us data prep}
us_states <- map_data("state") |> # state polygons (not used today)
rename("state" = "region")
us_counties <- map_data("county") |> # county polygons
rename("state" = "region", "county" = "subregion") |>
mutate(county = gsub(" ", "", county),
county = gsub("'","", county) |> tolower())
#unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties
cnty2019_all <- county_2019
#unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties
cnty2019_all <- cnty2019_all |>
mutate(state = tolower(state),
county = tolower(name),
county = gsub(" county", "", county),
county = gsub(" parish", "", county),
county = gsub("\\.", "", county), # \\ is required because . used in R coding
county = gsub(" ", "", county),
county = gsub("'","", county)) |>
relocate(county, .before=name)
cnty2019_all <- full_join(us_counties,cnty2019_all) # geo data and demographic data
```
:::
## County Demographic Plots
:::::::::::::::::::: panel-tabset
### [Data & Plot 1]{style="color:blue;"}
::::::: columns
:::: {.column width="48%"}
- Creating a new dataset not required, but helpful
- Plot code could be converted into a function
::: fragment
```{r}
#|label: select data and plot 1 map
cnty_data1 <- cnty2019_all |>
select(long:county, hs_grad, bachelors,
household_has_computer,
household_has_broadband)
cnty_hs_grad <- cnty_data1 |>
ggplot(aes(x=long, y=lat,
group=group, fill=hs_grad)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Percent with High School Degree") +
scale_fill_continuous(type = "viridis") +
theme(plot.background = element_rect(fill = "lightgrey", color = NA),
legend.key.size = unit(.4, 'cm'),
plot.title = element_text(size = 10),
legend.text= element_text(size = 8))
```
:::
::::
::: {.column width="4%"}
:::
::: {.column width="48%"}
Font in plot adjusted for screen.
```{r plot 1 shown, echo=F, fig.dim=c(7,6)}
(cnty_hs_grad1 <- cnty_hs_grad +
theme(plot.background = element_rect(colour = "darkgrey", linewidth=2),
legend.key.size = unit(1, 'cm'),
plot.title = element_text(size = 18),
legend.text= element_text(size = 12)))
```
:::
:::::::
### [Plot 2]{style="color:blue;"}
:::::: columns
::: {.column width="48%"}
```{r}
#|label: plot 2 code
cnty_bachelors <- cnty_data1 |>
ggplot(aes(x=long, y=lat,
group=group,
fill=bachelors)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Percent with Bachelor's Degree") +
scale_fill_continuous(type = "viridis") +
theme(plot.background = element_rect(fill = "lightgrey", color = NA),
legend.key.size = unit(.4, 'cm'),
plot.title = element_text(size = 10),
legend.text= element_text(size = 8))
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
Font in plot adjusted for screen.
```{r plot 2 shown, echo=F, fig.dim=c(7,6)}
(cnty_bachelors1 <- cnty_bachelors +
theme(plot.background = element_rect(colour = "darkgrey", linewidth=2),
legend.key.size = unit(1, 'cm'),
plot.title = element_text(size = 18),
legend.text= element_text(size = 12)))
```
:::
::::::
### [Plot 3]{style="color:blue;"}
:::::: columns
::: {.column width="48%"}
```{r}
#|label: plot 3 code
cnty_computer <- cnty_data1 |>
ggplot(aes(x=long, y=lat,
group=group,
fill=household_has_computer)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Percent of Households with A Computer")+
scale_fill_continuous(type = "viridis") +
theme(plot.background = element_rect(fill = "lightgrey", color = NA),
legend.key.size = unit(.4, 'cm'),
plot.title = element_text(size = 10),
legend.text= element_text(size = 8))
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
Font in plot adjusted for screen
```{r plot 3 shown, echo=F, fig.dim=c(7,6)}
(cnty_computer1 <- cnty_computer +
theme(plot.background = element_rect(colour = "darkgrey", linewidth=2),
legend.key.size = unit(1, 'cm'),
plot.title = element_text(size = 18),
legend.text= element_text(size = 12)))
```
:::
::::::
### [Plot 4]{style="color:blue;"}
:::::: columns
::: {.column width="48%"}
```{r plot 4 code}
#|label: plot 4 code
cnty_brdbd <- cnty_data1 |>
ggplot(aes(x=long, y=lat,
group=group,
fill=household_has_broadband)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Percent of Households with BroadBand") +
scale_fill_continuous(type = "viridis") +
theme(plot.background = element_rect(fill = "lightgrey", color = NA),
legend.key.size = unit(.4, 'cm'),
plot.title = element_text(size = 10),
legend.text= element_text(size = 8))
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
Font in plot adjusted for screen
```{r plot 4 shown, echo=F, fig.dim=c(7,6)}
(cnty_brdbd1 <- cnty_brdbd +
theme(plot.background = element_rect(colour = "darkgrey", linewidth=2),
legend.key.size = unit(1, 'cm'),
plot.title = element_text(size = 18),
legend.text= element_text(size = 12)))
```
:::
::::::
::::::::::::::::::::
## Demographic Plot Grid - Order Matters
Grid is populated left to right and then top to bottom unless otherwise specified.
```{r eval=F}
#|label: grid of pct plots
# note alternative to grid.arrange used
# code shown is for export version of plots
grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2, )
```
```{r grid of 4 demo pct plots for slides, fig.align='center', fig.dim=c(12,6), echo=F}
plot_grid(cnty_hs_grad1, cnty_computer1, cnty_bachelors1, cnty_brdbd1, ncol=2) # screen version
grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2) # export version
save_plot("img/grid_plot_example_Penelope_Pooler.png", grid)
```
## In-class Exercise - Plot Grid (Steps 1 & 2)
**Steps to Follow**
1. Examine the available variables in the **`cnty_2019_all`** dataset saved from R.
2. **Create Individual Plots**. Variable names in R dataset and definitions:
- **`household_has_smartphone`: Households with Smart Phones**
- **`median_age`: Median Age**
- **`median_household_income`: Median Household Income**
- **`median_individual_income`: Median Individual Income**
- You can copy provided plot code and modify it for these variables or you could try to create a function (not required today).
::: fragment
```{r}
#|label: usmaps demographic maps exercise
# select data for plot (not required, but helpful)
# cnty_data2 <-
# create four individual plots, one for each variable
# use provided plot code and modify
```
:::
## In-class Exercise - Plot Grid (Steps 3 & 4)
3. Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week.
- **For full credit, plots must be in the order specified.**
- **Row 1:** should have smartphone variable and median age.
- **Row 2:** median household and individual income.
- Create Plot Grid using `plot_grid` command.
::: fragment
```{r}
#|label: in-class exercise 2x2 plot grid
# for full credit grid of plots must be in order specified
# use plot_grid command
# export plots to img folder using save_plot
```
:::
4. Right click on plot grid, then save as... and save to `img` folder with correct name.
- or use `save_plot` command
## Mapping Log Transformed Data
:::::: columns
::: {.column width="68%"}
- The previous examples above are all percent data.
- No transformations are needed
- In contrast, population data or financial data are often right-skewed and need to be **log transformed.**
- Recall from MAS 261, BUA 345 (and perhaps FIN classes):
- An effective transformation for right skewed data is the natural log (LN) transformation.
- The following demo shows how useful it is for mapping right skewed data.
:::
::: {.column width="4%"}
:::
::: {.column width="28%"}

:::
::::::
## Plot of Skewed Data
:::::: columns
::: {.column width="48%"}
```{r}
#|label: untransformed pop. map code
cnty_data3 <- cnty2019_all |>
select(long:county, pop) |>
mutate(pop1k = pop/1000)
cnty_pop <- cnty_data3 |>
ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Population by County",
subtitle="Unit is 1000 People") +
scale_fill_continuous(type = "viridis") +
theme(legend.key.width = unit(.4, "cm"))
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
```{r untransformed pop map shown, echo=F, fig.dim=c(7,6)}
cnty_pop +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::::::
## Histogram Clarifies Data Skewness
:::::: columns
::: {.column width="48%"}
```{r}
#|label: untransformed pop hist code
hist_pop <- cnty_data3 |>
ggplot() +
geom_histogram(aes(x=pop1k),
fill="lightblue",
col="darkblue") +
labs(x="Population", title="Histogram of US Population Data",
y = "Count",subtitle="Unit is 1000 People") +
theme_classic()
```
**The Problem:** We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation.
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
```{r untransformed pop hist shown, fig.dim=c(7,6), echo=F}
hist_pop +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::::::
## Solution - Natural Log Transformed Plot
::::::: columns
:::: {.column width="48%"}
- Data are not transformed, but data axes and scale are
- Options specified in `scale_fill_continuous`:
- `trans = "log"`
- `breaks = c(...)`
- break intervals determined by examining data
::: fragment
```{r}
#|label: log transformed map plot code
cnty_lpop <- cnty_data3 |>
ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Population by County",
subtitle="Unit is 1000 People and Date are Log-transformed") +
scale_fill_continuous(type = "viridis",trans="log",
breaks=c(1,10,100,1000,10000)) +
theme(legend.key.width = unit(.4, "cm"))
```
:::
::::
::: {.column width="4%"}
:::
::: {.column width="48%"}
```{r log transformed map plot shown, echo=F, fig.dim=c(7,6)}
cnty_lpop +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
:::::::
## Histogram of Log transformed Data
::::::: columns
:::: {.column width="48%"}
- Final dashboard doesn't include exporatory plots
- Data exploration plots:
- Histograms, Scatterplots and boxplots are all useful
::: fragment
```{r}
#|label: log transformed pop hist code
hist_lpop <- cnty_data3 |>
ggplot() +
geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") +
labs(x="Population",
title="Histogram of Natural Log of US Population Data",
y = "Count",
subtitle="Unit is 1000 People and Data are Log-transformed") +
theme_classic() +
scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000))
```
:::
::::
::: {.column width="4%"}
:::
::: {.column width="48%"}
**Histogram of Log-transformed Data**
```{r log transformed pop hist shown, echo=F, fig.dim=c(7,6)}
hist_lpop +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
:::::::
## Population Plot Grid
```{r eval=F}
#|label: plot code for pop grid
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2)
```
```{r grid of all four population plots, fig.align='center', fig.dim=c(14,6), echo=F}
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2)
grid.rect(.5,.5,width=unit(.99,"npc"), height=unit(0.99,"npc"), gp=gpar(lwd=3, fill=NA, col="darkgrey"))
```
## When and How to Log Transform
- Log transformation are useful if you have right skewed POSITIVE data such as
- Prices
- Population
- Sales
- Income
- Note: If data (x) have zeros, a good option is to use log(x + 1)
- ln(1) = 0 (In R `log(1)` = 0)
- 0 values in the data will still be zeros
- In the following example we will create plots for number of households by county:
- Histograms with and without LN transformation
- Map plots of with and without LN transformation
## Number of Households Per County
### Without Transformation
:::::: columns
::: {.column width="48%"}
**Untransformed Data Histogram**
```{r}
#|label: data and untransformed hh data histogram code
#|
cnty_data4 <- cnty2019_all |>
select(long:county, households) |>
mutate(households1K = households/1000)
hist_hholds <- cnty_data4 |>
ggplot() +
geom_histogram(aes(x=households1K),
fill="lightblue",
col="darkblue") +
labs(x="Number of Households", title="Histogram of US Household Data",
y = "Count",subtitle="Unit is 1000 Households") +
theme_classic()
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
**Number of Households is highly right-skewed.**
```{r untransformed hh data histogram shown, echo=F, fig.dim=c(7,6)}
hist_hholds +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::::::
## Number of Households Per County
### With Transformation
:::::: columns
::: {.column width="48%"}
**Log transformed Data Histogram**
```{r}
#|label: log transformed hh data histogram code
hist_lhholds <- cnty_data4 |>
ggplot() +
geom_histogram(aes(x=households1K),
fill="lightblue",
col="darkblue") +
labs(x="Number of Households",
title="Histogram of Natural Log of US Household Data",
y = "Count",
subtitle="Unit is 1000 Households and Data are Log-transformed") +
theme_classic() +
scale_x_continuous(trans="log", breaks=c(1,10,100,1000))
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
**Log-transformed data appear normally distributed**
```{r log transformed hh data histogram shown, echo=F, fig.dim=c(7,6)}
hist_lhholds +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::::::
## Number of Households Per County
:::::: columns
::: {.column width="48%"}
```{r untransformed hh plot map code}
cnty_hholds <- cnty_data4 |>
ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Number of Households by County",
subtitle="Unit is 1000 Households") +
scale_fill_continuous(type = "viridis") +
theme(legend.key.width = unit(.4, "cm"))
```
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
**Map of untransformed households per county data is uninformative.**
```{r untransformed data hh map shown, echo=F, fig.dim=c(7,6)}
cnty_hholds +
theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::::::
## In-class Exercise 2
**Log-transformed data map is more more informative about geographic variability.**
Submit R code to create log transformed households map in a text (`.txt`) file with your name.
```{r log transformed hh data map, echo=F, fig.align='center', fig.dim=c(14,6.5)}
(cnty_lhholds <- cnty_data4 |>
ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
geom_polygon() +
theme_map() +
coord_map("albers", lat0 = 39, lat1 = 45) +
labs(fill= "", title="Number of Households by County",
subtitle="Unit is 1000 Households and data are Log-transformed") +
scale_fill_continuous(type = "viridis",
trans="log", breaks=c(1,10,100,1000)) +
theme(legend.key.width = unit(.4, "cm")))
```
## Households per County Plot Grid
```{r household grid plot code, eval=F}
grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)
```
```{r grid of all four household plots, echo = F, fig.align='center', fig.dim=c(14,6)}
grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)
```
## Quiz 2 Information
- Questions and Material from Quiz 1 may be on Quiz 2
- Practice Questions are now posted.
- Videos will be posted by Sunday 11/2.
- Review Quiz 1 and Quiz 1 Practice Questions
- Review Week HW 4 and HW 5 - Part 1 and recent lectures
- Study Tip: Feel free to add on to practice questions .qmd file with extra chunks and notes so that all of your notes are in one place.
## Quiz 2 Information Cont'd
- Converting text (character) date information to a date using [**lubridate**](https://rawgit.com/rstudio/cheatsheets/main/lubridate.pdf) commands (Week 5)
- Example R commands:`ymd`, `dmy`, `mdy`, `ym` combined with `paste` to combine columns
- Extracting year, month, or day from the date variable using lubridate commands (Week 6)
- Example R commands: `year`, `month`, `quarter`, `wday`, `day`
- Converting an `xts` dataset to a tibble (standard R dataset) (Week 7)
- Creating a lineplot from time series (non-xts) dataset
- Converting a tibble to an `xts` dataset
- Creating an interactive `hchart`
## Quiz 2 Info Cont'd
- You should be familiar with the `bls_tidy` function we created and how to use it to import similar datasets.
- There will be datasets to be imported AND combined (joined)
- Data sets can be joined by row.
- You should know how to do the different joins we covered and what each one does:
- `full_join`
- `right_join`
- `left_join`
- `inner_join`
- Data sets can be stacked by column if the columns are identical
- In BUA 455 we covered `bind_rows`
## Quiz 2 Info Cont'd
- Cleaning messy data (Some examples in HW 5)
- Dealing with text (character variables)
- `gsub`
- `separate`
- `unite` or `paste` or `paste0`
- `ifelse` can be be used for text OR for numeric data
- `ifelse` followed by `factor` allows you to make any categorical variable you want.
- Other commands for modifying text:
- `tolower` and `toupper`
- `str_trim`, `str_squish` and `str_pad`
## Additional Text Commands
::: fragment
**Additional skills for Quiz 2 from HW 5 - Part 1:**
:::
- summing across rows using `sum(c_across(...))`
- using `pivot_wider` and then `pivot_longer` and then replacing NAs with 0 to create a 'complete' data set
- Useful for area plots
- Plotting skills for Quiz 2
- unformatted line plot, area plot, or grouped bar chart
- `hchart` in **highcharter** package
::: fragment
**NOT ON QUIZ 2:**
:::
- [**Commands to covert case**](https://stringr.tidyverse.org/reference/case.html)
- `str_to_title`: First letter of each word
- `str_to_sentence`: First letter of first word
##
### Key Points
::: fragment
**Introduction to Geographic Data**
:::
- Use skills already covered to
- clean data and check text variables
- join datasets
- create plot grids for comparing variables
- Determine if variables need to be log-transformed.
- Quiz 2 Practice Questions are posted.
- Quiz 2 will be on Thursday, 11/6.
::: fragment
You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.**
:::