Weeks 10 and 11

Introduction To Geographic Data and Quiz 2 Review

Author

Penelope Pooler Eisenbies

Published

October 29, 2025

Housekeeping

Upcoming Dates

HW 5 - Part 1 was due yesterday (10/29)
Draft Proposals are due Today 10/30 at 6:00 PM
Quiz 2 is on Thursday, 11/6
- Mostly on Weeks 5 through 9, but material is cumulative.
- Mostly on HW Assignments 4 and 5, but material is cumulative.
- Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions.
HW 5 - Part 2 will be posted after Quiz 2.

Today and Tuesday, 11/4

Practice Questions for Quiz 2 are posted.
- I am recording new videos and will post them by Sunday.
Tue. 11/4: Skills/Concepts Review for Quiz 2
- Putting skills together for different goals
- Come with questions or submit them as Engagement Questions.
Today: Intro to Managing and Plotting Geographic Data
- Today’s geographic maps will not be on Quiz 2.
- Mapping data is an effective data curation tool that may be useful in this class, other classes, your career.
We will cover more about geographic data and map visualizations after Quiz 2.
If you want help with mapping project data, please reach out to me or a TA.

In-class Exercise - Week 10

Purpose:

To gain some experience and understanding of map data available in R and elsewhere.
To experiment with mapping data
Students are encouraged to use domestic or international map data in their dashboards if appropriate.

Data Preparation

The data for today’s in-class exercise is part of R.
These geographic data are useful if you have information by state or county and you want to show a choropleth map of your data.
R also has world information, e.g., countries, continents, etc.

Code

```{r us data prep}
us_states <- map_data("state") |>               # state polygons (not used today)
  rename("state" = "region")
us_counties <- map_data("county") |>            # county polygons
  rename("state" = "region", "county" = "subregion") |>
  mutate(county = gsub(" ", "", county), 
         county = gsub("'","", county) |> tolower())
#unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties
cnty2019_all <- county_2019
#unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties

cnty2019_all <- cnty2019_all |> 
  mutate(state = tolower(state),
         county = tolower(name),
         county = gsub(" county", "", county),
         county = gsub(" parish", "", county),
         county = gsub("\\.", "", county),          # \\ is required because . used in R coding
         county = gsub(" ", "", county),
         county = gsub("'","", county)) |>
  relocate(county, .before=name)

cnty2019_all <- full_join(us_counties,cnty2019_all)  # geo data and demographic data
```

Joining with `by = join_by(state, county)`

County Demographic Plots

Creating a new dataset not required, but helpful
Plot code could be converted into a function

Code

```{r}
#|label: select data and plot 1 map
cnty_data1 <- cnty2019_all |> 
  select(long:county, hs_grad, bachelors, 
         household_has_computer, 
         household_has_broadband) 
cnty_hs_grad <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, fill=hs_grad)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with High School Degree") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))
```

Font in plot adjusted for screen.

Code

```{r}
#|label: plot 2 code
cnty_bachelors <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=bachelors)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with Bachelor's Degree") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))
```

Font in plot adjusted for screen.

Code

```{r}
#|label:  plot 3 code
cnty_computer <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_computer)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with A Computer")+
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))
```

Font in plot adjusted for screen

Code

```{r plot 4 code}
#|label:  plot 4 code
cnty_brdbd <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_broadband)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with BroadBand") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))
```

Font in plot adjusted for screen

Demographic Plot Grid - Order Matters

Grid is populated left to right and then top to bottom unless otherwise specified.

Code

```{r eval=F}
#|label: grid of pct plots

# note alternative to grid.arrange used
# code shown is for export version of plots
grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2, )
```

In-class Exercise - Plot Grid (Steps 1 & 2)

Steps to Follow

Examine the available variables in the cnty_2019_all dataset saved from R.
Create Individual Plots. Variable names in R dataset and definitions:
- household_has_smartphone: Households with Smart Phones
- median_age: Median Age
- median_household_income: Median Household Income
- median_individual_income: Median Individual Income

You can copy provided plot code and modify it for these variables or you could try to create a function (not required today).

Code

```{r}
#|label: usmaps demographic maps exercise
# select data for plot (not required, but helpful)
# cnty_data2 <- 

# create four individual plots, one for each variable
# use provided plot code and modify
```

In-class Exercise - Plot Grid (Steps 3 & 4)

Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week.

For full credit, plots must be in the order specified.
- Row 1: should have smartphone variable and median age.
- Row 2: median household and individual income.
Create Plot Grid using plot_grid command.

Code

```{r}
#|label: in-class exercise 2x2 plot grid
# for full credit grid of plots must be in order specified
# use plot_grid command
# export plots to img folder using save_plot 
```

Right click on plot grid, then save as… and save to img folder with correct name.

or use save_plot command

Mapping Log Transformed Data

The previous examples above are all percent data.
No transformations are needed
In contrast, population data or financial data are often right-skewed and need to be log transformed.
Recall from MAS 261, BUA 345 (and perhaps FIN classes):
- An effective transformation for right skewed data is the natural log (LN) transformation.
- The following demo shows how useful it is for mapping right skewed data.

Plot of Skewed Data

Code

```{r}
#|label: untransformed pop. map code

cnty_data3 <- cnty2019_all |> 
  select(long:county, pop) |>
  mutate(pop1k = pop/1000)
       
cnty_pop <- cnty_data3 |>
    ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People") +
    scale_fill_continuous(type = "viridis") +
    theme(legend.key.width = unit(.4, "cm"))
```

Histogram Clarifies Data Skewness

Code

```{r}
#|label: untransformed pop hist code
hist_pop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Population", title="Histogram of US Population Data",
       y = "Count",subtitle="Unit is 1000 People") +
  theme_classic()
```

The Problem: We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Solution - Natural Log Transformed Plot

Data are not transformed, but data axes and scale are
Options specified in scale_fill_continuous:
- trans = "log"
- breaks = c(...)
break intervals determined by examining data

Code

```{r}
#|label: log transformed map plot code
cnty_lpop <- cnty_data3 |>
  ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People and Date are Log-transformed") +
    scale_fill_continuous(type = "viridis",trans="log",
                          breaks=c(1,10,100,1000,10000)) +
    theme(legend.key.width = unit(.4, "cm"))
```

Histogram of Log transformed Data

Final dashboard doesn’t include exporatory plots
Data exploration plots:
- Histograms, Scatterplots and boxplots are all useful

Code

```{r}
#|label: log transformed pop hist code
hist_lpop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") +
  labs(x="Population", 
       title="Histogram of Natural Log of US Population Data",
       y = "Count",
       subtitle="Unit is 1000 People and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000))
```

Histogram of Log-transformed Data

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Population Plot Grid

Code

```{r eval=F}
#|label: plot code for pop grid
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) 
```

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

When and How to Log Transform

Log transformation are useful if you have right skewed POSITIVE data such as
- Prices
- Population
- Sales
- Income
Note: If data (x) have zeros, a good option is to use log(x + 1)
- ln(1) = 0 (In R log(1) = 0)
- 0 values in the data will still be zeros
In the following example we will create plots for number of households by county:
- Histograms with and without LN transformation
- Map plots of with and without LN transformation

Number of Households Per County

Without Transformation

Untransformed Data Histogram

Code

```{r}
#|label: data and untransformed hh data histogram code
#|
cnty_data4 <- cnty2019_all |> 
  select(long:county, households) |>
  mutate(households1K = households/1000)

hist_hholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", title="Histogram of US Household Data",
       y = "Count",subtitle="Unit is 1000 Households") +
  theme_classic()
```

Number of Households is highly right-skewed.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Number of Households Per County

With Transformation

Log transformed Data Histogram

Code

```{r}
#|label: log transformed hh data histogram code
hist_lhholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", 
       title="Histogram of Natural Log of US Household Data",
       y = "Count",
       subtitle="Unit is 1000 Households and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000))
```

Log-transformed data appear normally distributed

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Number of Households Per County

Code

```{r untransformed hh plot map code}
cnty_hholds <- cnty_data4 |>
  ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
  geom_polygon() +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  labs(fill= "", title="Number of Households by County",
       subtitle="Unit is 1000 Households") +
  scale_fill_continuous(type = "viridis") +
  theme(legend.key.width = unit(.4, "cm"))
```

Map of untransformed households per county data is uninformative.

In-class Exercise 2

Log-transformed data map is more more informative about geographic variability.

Submit R code to create log transformed households map in a text (.txt) file with your name.

Households per County Plot Grid

Code

```{r household grid plot code, eval=F}
grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)
```

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Quiz 2 Information

Questions and Material from Quiz 1 may be on Quiz 2
Practice Questions are now posted.
- Videos will be posted by Sunday 11/2.
- Review Quiz 1 and Quiz 1 Practice Questions
- Review Week HW 4 and HW 5 - Part 1 and recent lectures
Study Tip: Feel free to add on to practice questions .qmd file with extra chunks and notes so that all of your notes are in one place.

Quiz 2 Information Cont’d

Converting text (character) date information to a date using lubridate commands (Week 5)
- Example R commands:ymd, dmy, mdy, ym combined with paste to combine columns
Extracting year, month, or day from the date variable using lubridate commands (Week 6)
- Example R commands: year, month, quarter, wday, day
Converting an xts dataset to a tibble (standard R dataset) (Week 7)
- Creating a lineplot from time series (non-xts) dataset
Converting a tibble to an xts dataset
- Creating an interactive hchart

Quiz 2 Info Cont’d

You should be familiar with the bls_tidy function we created and how to use it to import similar datasets.
There will be datasets to be imported AND combined (joined)
- Data sets can be joined by row.
- You should know how to do the different joins we covered and what each one does:
  - full_join
  - right_join
  - left_join
  - inner_join
Data sets can be stacked by column if the columns are identical
- In BUA 455 we covered bind_rows

Quiz 2 Info Cont’d

Cleaning messy data (Some examples in HW 5)
- Dealing with text (character variables)
  - gsub
  - separate
  - unite or paste or paste0
  - ifelse can be be used for text OR for numeric data
  - ifelse followed by factor allows you to make any categorical variable you want.
Other commands for modifying text:
- tolower and toupper
- str_trim, str_squish and str_pad

Additional Text Commands

Additional skills for Quiz 2 from HW 5 - Part 1:

summing across rows using sum(c_across(...))
using pivot_wider and then pivot_longer and then replacing NAs with 0 to create a ‘complete’ data set
- Useful for area plots
Plotting skills for Quiz 2
- unformatted line plot, area plot, or grouped bar chart
- hchart in highcharter package

NOT ON QUIZ 2:

Commands to covert case
str_to_title: First letter of each word
str_to_sentence: First letter of first word

Key Points

Introduction to Geographic Data

Use skills already covered to
- clean data and check text variables
- join datasets
- create plot grids for comparing variables
Determine if variables need to be log-transformed.
Quiz 2 Practice Questions are posted.
Quiz 2 will be on Thursday, 11/6.

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

--- title: "Weeks 10 and 11" subtitle: "Introduction To Geographic Data and Quiz 2 Review" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r include=F} #|label: setup knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks options(scipen=100) # suppress scientific notation # install pacman if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") pacman::p_load(pacman, tidyverse, ggthemes, gridExtra, magrittr, kableExtra, RColorBrewer, maps, usdata, countrycode, mapproj, shadowtext, cowplot, grid) # install and load required packages p_loaded() # verify loaded packages ``` ### Upcoming Dates - **HW 5 - Part 1 was due yesterday (10/29)** - **Draft Proposals are due Today 10/30 at 6:00 PM** - **Quiz 2 is on Thursday, 11/6** - Mostly on Weeks 5 through 9, but material is cumulative. - Mostly on HW Assignments 4 and 5, but material is cumulative. - Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions. - HW 5 - Part 2 will be posted after Quiz 2. ## Today and Tuesday, 11/4 - **Practice Questions for Quiz 2 are posted.** - I am recording new videos and will post them by Sunday. - **Tue. 11/4: Skills/Concepts Review for Quiz 2** - Putting skills together for different goals - Come with questions or submit them as `Engagement Questions`. - **Today: Intro to Managing and Plotting Geographic Data** - Today's geographic maps will not be on Quiz 2. - Mapping data is an effective data curation tool that may be useful in this class, other classes, your career. - We will cover more about geographic data and map visualizations after Quiz 2. - **If you want help with mapping project data, please reach out to me or a TA**. ## In-class Exercise - Week 10 ::::::: columns :::: {.column width="48%"} ::: fragment **Purpose:** ::: - To gain some experience and understanding of map data available in R and elsewhere. - To experiment with mapping data - Students are encouraged to use domestic or international map data in their dashboards if appropriate. :::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/owl.png){fig.align="center"} ::: ::::::: ## Data Preparation - The data for today's in-class exercise is part of R. - These geographic data are useful if you have information by state or county and you want to show a [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map){target="_blank"} of your data. - R also has world information, e.g., countries, continents, etc. ::: fragment ```{r us data prep} us_states <- map_data("state") |> # state polygons (not used today) rename("state" = "region") us_counties <- map_data("county") |> # county polygons rename("state" = "region", "county" = "subregion") |> mutate(county = gsub(" ", "", county), county = gsub("'","", county) |> tolower()) #unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties cnty2019_all <- county_2019 #unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties cnty2019_all <- cnty2019_all |> mutate(state = tolower(state), county = tolower(name), county = gsub(" county", "", county), county = gsub(" parish", "", county), county = gsub("\\.", "", county), # \\ is required because . used in R coding county = gsub(" ", "", county), county = gsub("'","", county)) |> relocate(county, .before=name) cnty2019_all <- full_join(us_counties,cnty2019_all) # geo data and demographic data ``` ::: ## County Demographic Plots :::::::::::::::::::: panel-tabset ### [Data & Plot 1]{style="color:blue;"} ::::::: columns :::: {.column width="48%"} - Creating a new dataset not required, but helpful - Plot code could be converted into a function ::: fragment ```{r} #|label: select data and plot 1 map cnty_data1 <- cnty2019_all |> select(long:county, hs_grad, bachelors, household_has_computer, household_has_broadband) cnty_hs_grad <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=hs_grad)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent with High School Degree") + scale_fill_continuous(type = "viridis") + theme(plot.background = element_rect(fill = "lightgrey", color = NA), legend.key.size = unit(.4, 'cm'), plot.title = element_text(size = 10), legend.text= element_text(size = 8)) ``` ::: :::: ::: {.column width="4%"} ::: ::: {.column width="48%"} Font in plot adjusted for screen. ```{r plot 1 shown, echo=F, fig.dim=c(7,6)} (cnty_hs_grad1 <- cnty_hs_grad + theme(plot.background = element_rect(colour = "darkgrey", linewidth=2), legend.key.size = unit(1, 'cm'), plot.title = element_text(size = 18), legend.text= element_text(size = 12))) ``` ::: ::::::: ### [Plot 2]{style="color:blue;"} :::::: columns ::: {.column width="48%"} ```{r} #|label: plot 2 code cnty_bachelors <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=bachelors)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent with Bachelor's Degree") + scale_fill_continuous(type = "viridis") + theme(plot.background = element_rect(fill = "lightgrey", color = NA), legend.key.size = unit(.4, 'cm'), plot.title = element_text(size = 10), legend.text= element_text(size = 8)) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} Font in plot adjusted for screen. ```{r plot 2 shown, echo=F, fig.dim=c(7,6)} (cnty_bachelors1 <- cnty_bachelors + theme(plot.background = element_rect(colour = "darkgrey", linewidth=2), legend.key.size = unit(1, 'cm'), plot.title = element_text(size = 18), legend.text= element_text(size = 12))) ``` ::: :::::: ### [Plot 3]{style="color:blue;"} :::::: columns ::: {.column width="48%"} ```{r} #|label: plot 3 code cnty_computer <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=household_has_computer)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent of Households with A Computer")+ scale_fill_continuous(type = "viridis") + theme(plot.background = element_rect(fill = "lightgrey", color = NA), legend.key.size = unit(.4, 'cm'), plot.title = element_text(size = 10), legend.text= element_text(size = 8)) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} Font in plot adjusted for screen ```{r plot 3 shown, echo=F, fig.dim=c(7,6)} (cnty_computer1 <- cnty_computer + theme(plot.background = element_rect(colour = "darkgrey", linewidth=2), legend.key.size = unit(1, 'cm'), plot.title = element_text(size = 18), legend.text= element_text(size = 12))) ``` ::: :::::: ### [Plot 4]{style="color:blue;"} :::::: columns ::: {.column width="48%"} ```{r plot 4 code} #|label: plot 4 code cnty_brdbd <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=household_has_broadband)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent of Households with BroadBand") + scale_fill_continuous(type = "viridis") + theme(plot.background = element_rect(fill = "lightgrey", color = NA), legend.key.size = unit(.4, 'cm'), plot.title = element_text(size = 10), legend.text= element_text(size = 8)) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} Font in plot adjusted for screen ```{r plot 4 shown, echo=F, fig.dim=c(7,6)} (cnty_brdbd1 <- cnty_brdbd + theme(plot.background = element_rect(colour = "darkgrey", linewidth=2), legend.key.size = unit(1, 'cm'), plot.title = element_text(size = 18), legend.text= element_text(size = 12))) ``` ::: :::::: :::::::::::::::::::: ## Demographic Plot Grid - Order Matters Grid is populated left to right and then top to bottom unless otherwise specified. ```{r eval=F} #|label: grid of pct plots # note alternative to grid.arrange used # code shown is for export version of plots grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2, ) ``` ```{r grid of 4 demo pct plots for slides, fig.align='center', fig.dim=c(12,6), echo=F} plot_grid(cnty_hs_grad1, cnty_computer1, cnty_bachelors1, cnty_brdbd1, ncol=2) # screen version grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2) # export version save_plot("img/grid_plot_example_Penelope_Pooler.png", grid) ``` ## In-class Exercise - Plot Grid (Steps 1 & 2) **Steps to Follow** 1. Examine the available variables in the **`cnty_2019_all`** dataset saved from R. 2. **Create Individual Plots**. Variable names in R dataset and definitions: - **`household_has_smartphone`: Households with Smart Phones** - **`median_age`: Median Age** - **`median_household_income`: Median Household Income** - **`median_individual_income`: Median Individual Income** - You can copy provided plot code and modify it for these variables or you could try to create a function (not required today). ::: fragment ```{r} #|label: usmaps demographic maps exercise # select data for plot (not required, but helpful) # cnty_data2 <- # create four individual plots, one for each variable # use provided plot code and modify ``` ::: ## In-class Exercise - Plot Grid (Steps 3 & 4) 3. Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week. - **For full credit, plots must be in the order specified.** - **Row 1:** should have smartphone variable and median age. - **Row 2:** median household and individual income. - Create Plot Grid using `plot_grid` command. ::: fragment ```{r} #|label: in-class exercise 2x2 plot grid # for full credit grid of plots must be in order specified # use plot_grid command # export plots to img folder using save_plot ``` ::: 4. Right click on plot grid, then save as... and save to `img` folder with correct name. - or use `save_plot` command ## Mapping Log Transformed Data :::::: columns ::: {.column width="68%"} - The previous examples above are all percent data. - No transformations are needed - In contrast, population data or financial data are often right-skewed and need to be **log transformed.** - Recall from MAS 261, BUA 345 (and perhaps FIN classes): - An effective transformation for right skewed data is the natural log (LN) transformation. - The following demo shows how useful it is for mapping right skewed data. ::: ::: {.column width="4%"} ::: ::: {.column width="28%"} ![](img/beaver.png) ::: :::::: ## Plot of Skewed Data :::::: columns ::: {.column width="48%"} ```{r} #|label: untransformed pop. map code cnty_data3 <- cnty2019_all |> select(long:county, pop) |> mutate(pop1k = pop/1000) cnty_pop <- cnty_data3 |> ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Population by County", subtitle="Unit is 1000 People") + scale_fill_continuous(type = "viridis") + theme(legend.key.width = unit(.4, "cm")) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r untransformed pop map shown, echo=F, fig.dim=c(7,6)} cnty_pop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::::: ## Histogram Clarifies Data Skewness :::::: columns ::: {.column width="48%"} ```{r} #|label: untransformed pop hist code hist_pop <- cnty_data3 |> ggplot() + geom_histogram(aes(x=pop1k), fill="lightblue", col="darkblue") + labs(x="Population", title="Histogram of US Population Data", y = "Count",subtitle="Unit is 1000 People") + theme_classic() ``` **The Problem:** We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation. ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r untransformed pop hist shown, fig.dim=c(7,6), echo=F} hist_pop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::::: ## Solution - Natural Log Transformed Plot ::::::: columns :::: {.column width="48%"} - Data are not transformed, but data axes and scale are - Options specified in `scale_fill_continuous`: - `trans = "log"` - `breaks = c(...)` - break intervals determined by examining data ::: fragment ```{r} #|label: log transformed map plot code cnty_lpop <- cnty_data3 |> ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Population by County", subtitle="Unit is 1000 People and Date are Log-transformed") + scale_fill_continuous(type = "viridis",trans="log", breaks=c(1,10,100,1000,10000)) + theme(legend.key.width = unit(.4, "cm")) ``` ::: :::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r log transformed map plot shown, echo=F, fig.dim=c(7,6)} cnty_lpop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: ::::::: ## Histogram of Log transformed Data ::::::: columns :::: {.column width="48%"} - Final dashboard doesn't include exporatory plots - Data exploration plots: - Histograms, Scatterplots and boxplots are all useful ::: fragment ```{r} #|label: log transformed pop hist code hist_lpop <- cnty_data3 |> ggplot() + geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") + labs(x="Population", title="Histogram of Natural Log of US Population Data", y = "Count", subtitle="Unit is 1000 People and Data are Log-transformed") + theme_classic() + scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000)) ``` ::: :::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Histogram of Log-transformed Data** ```{r log transformed pop hist shown, echo=F, fig.dim=c(7,6)} hist_lpop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: ::::::: ## Population Plot Grid ```{r eval=F} #|label: plot code for pop grid grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) ``` ```{r grid of all four population plots, fig.align='center', fig.dim=c(14,6), echo=F} grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) grid.rect(.5,.5,width=unit(.99,"npc"), height=unit(0.99,"npc"), gp=gpar(lwd=3, fill=NA, col="darkgrey")) ``` ## When and How to Log Transform - Log transformation are useful if you have right skewed POSITIVE data such as - Prices - Population - Sales - Income - Note: If data (x) have zeros, a good option is to use log(x + 1) - ln(1) = 0 (In R `log(1)` = 0) - 0 values in the data will still be zeros - In the following example we will create plots for number of households by county: - Histograms with and without LN transformation - Map plots of with and without LN transformation ## Number of Households Per County ### Without Transformation :::::: columns ::: {.column width="48%"} **Untransformed Data Histogram** ```{r} #|label: data and untransformed hh data histogram code #| cnty_data4 <- cnty2019_all |> select(long:county, households) |> mutate(households1K = households/1000) hist_hholds <- cnty_data4 |> ggplot() + geom_histogram(aes(x=households1K), fill="lightblue", col="darkblue") + labs(x="Number of Households", title="Histogram of US Household Data", y = "Count",subtitle="Unit is 1000 Households") + theme_classic() ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Number of Households is highly right-skewed.** ```{r untransformed hh data histogram shown, echo=F, fig.dim=c(7,6)} hist_hholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::::: ## Number of Households Per County ### With Transformation :::::: columns ::: {.column width="48%"} **Log transformed Data Histogram** ```{r} #|label: log transformed hh data histogram code hist_lhholds <- cnty_data4 |> ggplot() + geom_histogram(aes(x=households1K), fill="lightblue", col="darkblue") + labs(x="Number of Households", title="Histogram of Natural Log of US Household Data", y = "Count", subtitle="Unit is 1000 Households and Data are Log-transformed") + theme_classic() + scale_x_continuous(trans="log", breaks=c(1,10,100,1000)) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Log-transformed data appear normally distributed** ```{r log transformed hh data histogram shown, echo=F, fig.dim=c(7,6)} hist_lhholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::::: ## Number of Households Per County :::::: columns ::: {.column width="48%"} ```{r untransformed hh plot map code} cnty_hholds <- cnty_data4 |> ggplot(aes(x=long, y=lat, group=group, fill=households1K)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Number of Households by County", subtitle="Unit is 1000 Households") + scale_fill_continuous(type = "viridis") + theme(legend.key.width = unit(.4, "cm")) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Map of untransformed households per county data is uninformative.** ```{r untransformed data hh map shown, echo=F, fig.dim=c(7,6)} cnty_hholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::::: ## In-class Exercise 2 **Log-transformed data map is more more informative about geographic variability.** Submit R code to create log transformed households map in a text (`.txt`) file with your name. ```{r log transformed hh data map, echo=F, fig.align='center', fig.dim=c(14,6.5)} (cnty_lhholds <- cnty_data4 |> ggplot(aes(x=long, y=lat, group=group, fill=households1K)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Number of Households by County", subtitle="Unit is 1000 Households and data are Log-transformed") + scale_fill_continuous(type = "viridis", trans="log", breaks=c(1,10,100,1000)) + theme(legend.key.width = unit(.4, "cm"))) ``` ## Households per County Plot Grid ```{r household grid plot code, eval=F} grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2) ``` ```{r grid of all four household plots, echo = F, fig.align='center', fig.dim=c(14,6)} grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2) ``` ## Quiz 2 Information - Questions and Material from Quiz 1 may be on Quiz 2 - Practice Questions are now posted. - Videos will be posted by Sunday 11/2. - Review Quiz 1 and Quiz 1 Practice Questions - Review Week HW 4 and HW 5 - Part 1 and recent lectures - Study Tip: Feel free to add on to practice questions .qmd file with extra chunks and notes so that all of your notes are in one place. ## Quiz 2 Information Cont'd - Converting text (character) date information to a date using [**lubridate**](https://rawgit.com/rstudio/cheatsheets/main/lubridate.pdf) commands (Week 5) - Example R commands:`ymd`, `dmy`, `mdy`, `ym` combined with `paste` to combine columns - Extracting year, month, or day from the date variable using lubridate commands (Week 6) - Example R commands: `year`, `month`, `quarter`, `wday`, `day` - Converting an `xts` dataset to a tibble (standard R dataset) (Week 7) - Creating a lineplot from time series (non-xts) dataset - Converting a tibble to an `xts` dataset - Creating an interactive `hchart` ## Quiz 2 Info Cont'd - You should be familiar with the `bls_tidy` function we created and how to use it to import similar datasets. - There will be datasets to be imported AND combined (joined) - Data sets can be joined by row. - You should know how to do the different joins we covered and what each one does: - `full_join` - `right_join` - `left_join` - `inner_join` - Data sets can be stacked by column if the columns are identical - In BUA 455 we covered `bind_rows` ## Quiz 2 Info Cont'd - Cleaning messy data (Some examples in HW 5) - Dealing with text (character variables) - `gsub` - `separate` - `unite` or `paste` or `paste0` - `ifelse` can be be used for text OR for numeric data - `ifelse` followed by `factor` allows you to make any categorical variable you want. - Other commands for modifying text: - `tolower` and `toupper` - `str_trim`, `str_squish` and `str_pad` ## Additional Text Commands ::: fragment **Additional skills for Quiz 2 from HW 5 - Part 1:** ::: - summing across rows using `sum(c_across(...))` - using `pivot_wider` and then `pivot_longer` and then replacing NAs with 0 to create a 'complete' data set - Useful for area plots - Plotting skills for Quiz 2 - unformatted line plot, area plot, or grouped bar chart - `hchart` in **highcharter** package ::: fragment **NOT ON QUIZ 2:** ::: - [**Commands to covert case**](https://stringr.tidyverse.org/reference/case.html) - `str_to_title`: First letter of each word - `str_to_sentence`: First letter of first word ## ### Key Points ::: fragment **Introduction to Geographic Data** ::: - Use skills already covered to - clean data and check text variables - join datasets - create plot grids for comparing variables - Determine if variables need to be log-transformed. - Quiz 2 Practice Questions are posted. - Quiz 2 will be on Thursday, 11/6. ::: fragment You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.** :::