Week 10

Introduction To Geographic Data and Quiz 2 Information

Author

Penelope Pooler Eisenbies

Published

March 23, 2026

Housekeeping

Upcoming Dates

HW 5 - Part 1 is due tomorrow, 3/25
Rough Draft Proposals are due on Thursday, 3/26 at 6:00 PM
Project presentations will take place on April 23rd in class (4 weeks).
HW 5 - Part 2 will be posted after Quiz 2.
Quiz 2 is on Tuesday 4/7
- Mostly on Weeks 5 through 9, but material is cumulative.
- Mostly on HW Assignments 4 and 5, but material is cumulative.
- Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions.

Next Few Lectures

Practice Questions and Demo Videos for Quiz 2 are available.
Next Tuesday: Skills/Concepts Review for Quiz 2
Putting skills together for different goals
Come with questions or submit them in Engagement Questions
Intro to Managing and Plotting Geographic Data
- Today’s geographic maps will not be on Quiz 2.
- Mapping data is an effective data curation tool that may be useful in this class, other classes, your career.
We will cover more about geographic data and map visualizations in upcoming lectures.
If you want help with mapping project data, please reach out to me.

In-class Exercise - Week 10

Purpose:

To gain some experience and understanding of map data available in R and elsewhere.
To experiment with mapping data
Students are encouraged to use domestic or international map data in their dashboards if appropriate.

Data Preparation

The data for today’s in-class exercise is part of R.
These geographic data are useful if you have information by state or county and you want to show a choropleth map of your data.
R also has world information, e.g., countries, continents, etc.

Code

```{r us data prep}
us_states <- map_data("state") |>               # state polygons (not used today)
  rename("state" = "region")
us_counties <- map_data("county") |>            # county polygons
  rename("state" = "region", "county" = "subregion") |>
  mutate(county = gsub(" ", "", county), 
         county = gsub("'","", county) |> tolower())
#unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties
cnty2019_all <- county_2019
#unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties

cnty2019_all <- cnty2019_all |> 
  mutate(state = tolower(state),
         county = tolower(name),
         county = gsub(" county", "", county),
         county = gsub(" parish", "", county),
         county = gsub("\\.", "", county),          # \\ is required because . used in R coding
         county = gsub(" ", "", county),
         county = gsub("'","", county)) |>
  relocate(county, .before=name)

cnty2019_all <- full_join(us_counties,cnty2019_all)  # geo data and demographic data
```

Joining with `by = join_by(state, county)`

County Demographic Plots

Creating a new dataset not required, but helpful
Plot code could be converted into a function

Code

```{r}
#|label: select data and plot 1 map
cnty_data1 <- cnty2019_all |> 
  select(long:county, hs_grad, bachelors, 
         household_has_computer, 
         household_has_broadband) 
cnty_hs_grad <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, fill=hs_grad)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", 
         title="Percent with High School Degree") +
    scale_fill_continuous(type = "viridis") 
```

Code

```{r}
#|label: plot 2 code
cnty_bachelors <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=bachelors)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with Bachelor's Degree") +
    scale_fill_continuous(type = "viridis")
```

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Code

```{r}
#|label:  plot 3 code
cnty_computer <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_computer)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with A Computer")+
    scale_fill_continuous(type = "viridis")
```

Code

```{r plot 4 code}
cnty_brdbd <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_broadband)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with BroadBand") +
    scale_fill_continuous(type = "viridis")
```

Demographic Plot Grid - Order Matters

Grid is populated left to right and then top to bottom unless otherwise specified.

Code

```{r eval=F}
#|label: grid of pct plots
grid.arrange(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2)
```

In-class Exercise - Plot Grid (Steps 1 & 2)

Steps to Follow

Examine the available variables in the cnty_2019_all dataset saved from R.
Create Individual Plots**.

Variable names in R dataset and definitions:
- household_has_smartphone: Households with Smart Phones
- median_age: Median Age
- median_household_income: Median Household Income
- median_individual_income: Median Individual Income

You can copy provided plot code and modify it for these variables or you can try to create a function (not required today).

Code

```{r}
#|label: usmaps demographic maps exercise
# select data for plot (not required, but helpful)
# cnty_data2 <- 

# create four individual plots, one for each variable
# use provided plot code and modify
```

In-class Exercise - Plot Grid (Steps 3 & 4)

Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week.

For full credit, plots must be in the order specified.
- Row 1: should have smartphone variable and median age.
- Row 2: median household and individual income.
Create Plot Grid using grid.arrange

Code

```{r}
#|label: in-class exercise 2x2 plot grid
# for full credit grid of plots must be in order specified
# border around all four plots not required
```

Right click on plot grid, then save as… and save to img folder with correct name.

Note: ggsave will not work for grids created by grid.arrange.

Mapping Log Transformed Data

The previous examples above are all percent data.
No transformations are needed
In contrast, population data or financial data are often right-skewed and need to be log transformed.
Recall from MAS 261, BUA 345 (and perhaps FIN classes):
- An effective transformation for right skewed data is the natural log (LN) transformation.
- The following demo shows how useful it is for mapping right skewed data.

Plot of Skewed Data

Code

```{r}
#|label: untransformed pop. map code

cnty_data3 <- cnty2019_all |> 
  select(long:county, pop) |>
  mutate(pop1k = pop/1000)
       
cnty_pop <- cnty_data3 |>
    ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People") +
    scale_fill_continuous(type = "viridis") +
    theme(legend.position = "bottom",
          legend.key.width = unit(1, "cm"))
```

Histogram Clarifies Data Skewness

Code

```{r}
#|label: untransformed pop hist code
hist_pop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Population", title="Histogram of US Population Data",
       y = "Count",subtitle="Unit is 1000 People") +
  theme_classic()
```

The Problem: We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Solution - Natural Log Transformed Plot

Data are not transformed, but data axes and scale are
Options specified in scale_fill_continuous:
- trans = "log"
- breaks = c(...)
break intervals determined by examining data

Code

```{r}
#|label: log transformed map plot code
cnty_lpop <- cnty_data3 |>
  ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People and Date are Log-transformed") +
    scale_fill_continuous(type = "viridis",trans="log",
                          breaks=c(1,10,100,1000,10000)) +
    theme(legend.position = "bottom",
          legend.key.width = unit(1, "cm"))
```

Histogram of Log transformed Data

Final dashboard doesn’t include exporatory plots
Data exploration plots:
- Histograms, Scatterplots and boxplots are all useful

Code

```{r}
#|label: log transformed pop hist code
hist_lpop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") +
  labs(x="Population", 
       title="Histogram of Natural Log of US Population Data",
       y = "Count",
       subtitle="Unit is 1000 People and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000))
```

Histogram of Log-transformed Data

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Population Plot Grid

Code

```{r eval=F}
#|label: plot code for pop grid
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) 
```

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

When and How to Log Transform

Log transformation are useful if you have right skewed POSITIVE data such as
- Prices
- Population
- Sales
- Income
Note: If data (x) have zeros, a good option is to use log(x + 1)
- ln(1) = 0 (In R log(1) = 0)
- 0 values in the data will still be zeros
In the following example we will create plots for number of households by county:
- Histograms with and without LN transformation
- Map plots of with and without LN transformation

Number of Households Per County - Without Transformation

Untransformed Data Histogram

Code

```{r}
#|label: data and untransformed hh data histogram code
#|
cnty_data4 <- cnty2019_all |> 
  select(long:county, households) |>
  mutate(households1K = households/1000)

hist_hholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", title="Histogram of US Household Data",
       y = "Count",subtitle="Unit is 1000 Households") +
  theme_classic()
```

Number of Households is highly right-skewed.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Number of Households Per County - With Transformation

Log transformed Data Histogram

Code

```{r}
#|label: log transformed hh data histogram code
hist_lhholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", 
       title="Histogram of Natural Log of US Household Data",
       y = "Count",
       subtitle="Unit is 1000 Households and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000))
```

Log-transformed data appear normally distributed

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Number of Households Per County

Code

```{r untransformed hh plot map code}
cnty_hholds <- cnty_data4 |>
  ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
  geom_polygon() +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  labs(fill= "", title="Number of Households by County",
       subtitle="Unit is 1000 Households") +
  scale_fill_continuous(type = "viridis") +
  theme(legend.position = "bottom",
        legend.key.width = unit(1, "cm"))
```

Map of untransformed households per county data is uninformative.

In-class Exercise 2

Log-transformed data map is more more informative about geographic variability.

Submit R code to create log transformed households map in a text file with your name.

Households per County Plot Grid

Code

```{r household grid plot code, eval=F}
grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)
```

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Quiz 2 Information

Questions and Material from Quiz 1 may be on Quiz 2
Practice Questions and demo videos are posted.
- Review Quiz 1 and Quiz 1 Practice Questions
- Review Week HW 4 and HW 5 - Part 1 and recent lectures
Converting text (character) date information to a date using lubridate commands (Week 5)
- Example R commands:ymd, dmy, mdy, ym combined with paste to combine columns
Extracting year, month, or day from the date variable using lubridate commands (Week 6)
- Example R commands: year, month, quarter, wday, day
Converting an xts dataset to a tibble and vise-versa
- Creating a lineplot from time series (non-xts) dataset
- Creating an interactive hchart or dygraph from an xts.

Quiz 2 Info Continued

You should be familiar with the bls_tidy function we created and how to use it to import similar datasets.
There will be datasets to be imported AND combined (joined)
- Data sets can be joined by row.
- You should know how to do the different joins we covered and what each one does:
  - full_join
  - right_join
  - left_join
  - inner_join
Data sets can be stacked by column if the columns are identical
- In BUA 455 we covered bind_rows

Quiz 2 Info Continued

Cleaning messy data (Week 5 and Weeks 8-9)
- Dealing with text (character variables)
  - gsub
  - separate
  - unite or paste or paste0
  - ifelse can be be used for text OR for numeric data
  - ifelse followed by factor allows you to make any categorical variable you want.
Other commands for modifying text:
- tolower and toupper
- str_trim, str_squish and str_pad
- ifelse sometimes followed by factor

Additional Text Commands

NOT ON QUIZ 2:

Commands to covert case
str_to_title: First letter of each word
str_to_sentence: First letter of first word

Additional skills for Quiz 2 from HW 5 - Part 1:

summing across rows using sum(c_across(...))
using pivot_wider and then pivot_longer and then replacing NAs with 0 to create a ‘complete’ data set
- Useful for area plots
Plotting skills for Quiz 2
- unformatted line plot, area plot, or grouped bar chart
- hchart in highcharter package

Key Points from This Week

Introduction to Geographic Data

Use skills already covered to
- clean data and check text variables
- join datasets
- create plot grids for comparing variables
Determine if variables need to be log-transformed.
Quiz 2 Practice Questions are available.
Quiz 2 will be on Tuesday, 4/7.

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

--- title: "Week 10" subtitle: "Introduction To Geographic Data and Quiz 2 Information" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r include=F} #|label: setup knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks options(scipen=100) # suppress scientific notation # install pacman if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") pacman::p_load(pacman, tidyverse, ggthemes, gridExtra, magrittr, kableExtra, RColorBrewer, maps, usdata, countrycode, mapproj, shadowtext, grid) # install and load required packages p_loaded() # verify loaded packages ``` ## Upcoming Dates - **HW 5 - Part 1 is due tomorrow, 3/25** - **Rough Draft Proposals are due on Thursday, 3/26 at 6:00 PM** - Project presentations will take place on April 23rd in class (4 weeks). - HW 5 - Part 2 will be posted after Quiz 2. - **Quiz 2 is on Tuesday 4/7** - Mostly on Weeks 5 through 9, but material is cumulative. - Mostly on HW Assignments 4 and 5, but material is cumulative. - Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions. ## Next Few Lectures - **Practice Questions and Demo Videos for Quiz 2 are available.** - Next Tuesday: Skills/Concepts Review for Quiz 2 - Putting skills together for different goals - Come with questions or submit them in Engagement Questions - Intro to Managing and Plotting Geographic Data - Today's geographic maps will not be on Quiz 2. - Mapping data is an effective data curation tool that may be useful in this class, other classes, your career. - We will cover more about geographic data and map visualizations in upcoming lectures. - If you want help with mapping project data, please reach out to me. ## In-class Exercise - Week 10 :::: {.columns} ::: {.column width="48%"} ::: fragment **Purpose:** ::: - To gain some experience and understanding of map data available in R and elsewhere. - To experiment with mapping data - Students are encouraged to use domestic or international map data in their dashboards if appropriate. ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/owl.png){fig.align="center"} ::: :::: ## Data Preparation - The data for today's in-class exercise is part of R. - These geographic data are useful if you have information by state or county and you want to show a [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) of your data. - R also has world information, e.g., countries, continents, etc. ```{r us data prep} us_states <- map_data("state") |> # state polygons (not used today) rename("state" = "region") us_counties <- map_data("county") |> # county polygons rename("state" = "region", "county" = "subregion") |> mutate(county = gsub(" ", "", county), county = gsub("'","", county) |> tolower()) #unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties cnty2019_all <- county_2019 #unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties cnty2019_all <- cnty2019_all |> mutate(state = tolower(state), county = tolower(name), county = gsub(" county", "", county), county = gsub(" parish", "", county), county = gsub("\\.", "", county), # \\ is required because . used in R coding county = gsub(" ", "", county), county = gsub("'","", county)) |> relocate(county, .before=name) cnty2019_all <- full_join(us_counties,cnty2019_all) # geo data and demographic data ``` ## County Demographic Plots ::: {.panel-tabset} ### [Data & Plot 1]{style="color:blue;"} :::: {.columns} ::: {.column width="48%"} - Creating a new dataset not required, but helpful - Plot code could be converted into a function ::: fragment ```{r} #|label: select data and plot 1 map cnty_data1 <- cnty2019_all |> select(long:county, hs_grad, bachelors, household_has_computer, household_has_broadband) cnty_hs_grad <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=hs_grad)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent with High School Degree") + scale_fill_continuous(type = "viridis") ``` ::: ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r plot 1 shown, echo=F, fig.dim=c(7,6)} cnty_hs_grad + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::: ### [Plot 2]{style="color:blue;"} :::: {.columns} ::: {.column width="48%"} ```{r} #|label: plot 2 code cnty_bachelors <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=bachelors)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent with Bachelor's Degree") + scale_fill_continuous(type = "viridis") ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r plot 2 shown, echo=F, fig.dim=c(7,6)} cnty_bachelors + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ### [Plot 3]{style="color:blue;"} :::: {.columns} ::: {.column width="48%"} ```{r} #|label: plot 3 code cnty_computer <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=household_has_computer)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent of Households with A Computer")+ scale_fill_continuous(type = "viridis") ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r plot 3 shown, echo=F, fig.dim=c(7,6)} cnty_computer + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ### [Plot 4]{style="color:blue;"} :::: {.columns} ::: {.column width="48%"} ```{r plot 4 code} cnty_brdbd <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=household_has_broadband)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Percent of Households with BroadBand") + scale_fill_continuous(type = "viridis") ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r plot 4 shown, echo=F, fig.dim=c(7,6)} cnty_brdbd + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ::: ## Demographic Plot Grid - Order Matters Grid is populated left to right and then top to bottom unless otherwise specified. ```{r eval=F} #|label: grid of pct plots grid.arrange(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2) ``` ```{r grid of 4 demo pct plots for slides, fig.align='center', fig.dim=c(14,6), echo=F} grid.arrange(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2) grid.rect(.5,.5,width=unit(.99,"npc"), height=unit(0.99,"npc"), gp=gpar(lwd=3, fill=NA, col="darkgrey")) ``` ## In-class Exercise - Plot Grid (Steps 1 & 2) **Steps to Follow** 1. Examine the available variables in the **`cnty_2019_all`** dataset saved from R. 2. Create Individual Plots**. - Variable names in R dataset and definitions: - **`household_has_smartphone`: Households with Smart Phones** - **`median_age`: Median Age** - **`median_household_income`: Median Household Income** - **`median_individual_income`: Median Individual Income** ::: fragment You can copy provided plot code and modify it for these variables or you can try to create a function (not required today). ```{r} #|label: usmaps demographic maps exercise # select data for plot (not required, but helpful) # cnty_data2 <- # create four individual plots, one for each variable # use provided plot code and modify ``` ::: ## In-class Exercise - Plot Grid (Steps 3 & 4) 3. Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week. - **For full credit, plots must be in the order specified.** - **Row 1:** should have smartphone variable and median age. - **Row 2:** median household and individual income. - Create Plot Grid using `grid.arrange` ::: fragment ```{r} #|label: in-class exercise 2x2 plot grid # for full credit grid of plots must be in order specified # border around all four plots not required ``` ::: 4. Right click on plot grid, then save as... and save to `img` folder with correct name. - Note: ggsave will not work for grids created by `grid.arrange`. ## Mapping Log Transformed Data :::: {.columns} ::: {.column width="68%"} - The previous examples above are all percent data. - No transformations are needed - In contrast, population data or financial data are often right-skewed and need to be **log transformed.** - Recall from MAS 261, BUA 345 (and perhaps FIN classes): - An effective transformation for right skewed data is the natural log (LN) transformation. - The following demo shows how useful it is for mapping right skewed data. ::: ::: {.column width="4%"} ::: ::: {.column width="28%"} ![](img/beaver.png) ::: :::: ## Plot of Skewed Data :::: {.columns} ::: {.column width="48%"} ```{r} #|label: untransformed pop. map code cnty_data3 <- cnty2019_all |> select(long:county, pop) |> mutate(pop1k = pop/1000) cnty_pop <- cnty_data3 |> ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Population by County", subtitle="Unit is 1000 People") + scale_fill_continuous(type = "viridis") + theme(legend.position = "bottom", legend.key.width = unit(1, "cm")) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r untransformed pop map shown, echo=F, fig.dim=c(7,6)} cnty_pop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::: ## Histogram Clarifies Data Skewness :::: {.columns} ::: {.column width="48%"} ```{r} #|label: untransformed pop hist code hist_pop <- cnty_data3 |> ggplot() + geom_histogram(aes(x=pop1k), fill="lightblue", col="darkblue") + labs(x="Population", title="Histogram of US Population Data", y = "Count",subtitle="Unit is 1000 People") + theme_classic() ``` **The Problem:** We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation. ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r untransformed pop hist shown, fig.dim=c(7,6), echo=F} hist_pop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ## Solution - Natural Log Transformed Plot :::: {.columns} ::: {.column width="48%"} - Data are not transformed, but data axes and scale are - Options specified in `scale_fill_continuous`: - `trans = "log"` - `breaks = c(...)` - break intervals determined by examining data ```{r} #|label: log transformed map plot code cnty_lpop <- cnty_data3 |> ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Population by County", subtitle="Unit is 1000 People and Date are Log-transformed") + scale_fill_continuous(type = "viridis",trans="log", breaks=c(1,10,100,1000,10000)) + theme(legend.position = "bottom", legend.key.width = unit(1, "cm")) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r log transformed map plot shown, echo=F, fig.dim=c(7,6)} cnty_lpop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::: ## Histogram of Log transformed Data :::: {.columns} ::: {.column width="48%"} - Final dashboard doesn't include exporatory plots - Data exploration plots: - Histograms, Scatterplots and boxplots are all useful ```{r} #|label: log transformed pop hist code hist_lpop <- cnty_data3 |> ggplot() + geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") + labs(x="Population", title="Histogram of Natural Log of US Population Data", y = "Count", subtitle="Unit is 1000 People and Data are Log-transformed") + theme_classic() + scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000)) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Histogram of Log-transformed Data** ```{r log transformed pop hist shown, echo=F, fig.dim=c(7,6)} hist_lpop + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ## Population Plot Grid ```{r eval=F} #|label: plot code for pop grid grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) ``` ```{r grid of all four population plots, fig.align='center', fig.dim=c(14,6), echo=F} grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) grid.rect(.5,.5,width=unit(.99,"npc"), height=unit(0.99,"npc"), gp=gpar(lwd=3, fill=NA, col="darkgrey")) ``` ## When and How to Log Transform - Log transformation are useful if you have right skewed POSITIVE data such as - Prices - Population - Sales - Income - Note: If data (x) have zeros, a good option is to use log(x + 1) - ln(1) = 0 (In R `log(1)` = 0) - 0 values in the data will still be zeros - In the following example we will create plots for number of households by county: - Histograms with and without LN transformation - Map plots of with and without LN transformation ## Number of Households Per County - Without Transformation :::: {.columns} ::: {.column width="48%"} **Untransformed Data Histogram** ```{r} #|label: data and untransformed hh data histogram code #| cnty_data4 <- cnty2019_all |> select(long:county, households) |> mutate(households1K = households/1000) hist_hholds <- cnty_data4 |> ggplot() + geom_histogram(aes(x=households1K), fill="lightblue", col="darkblue") + labs(x="Number of Households", title="Histogram of US Household Data", y = "Count",subtitle="Unit is 1000 Households") + theme_classic() ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Number of Households is highly right-skewed.** ```{r untransformed hh data histogram shown, echo=F, fig.dim=c(7,6)} hist_hholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: :::: ## Number of Households Per County - With Transformation :::: {.columns} ::: {.column width="48%"} **Log transformed Data Histogram** ```{r} #|label: log transformed hh data histogram code hist_lhholds <- cnty_data4 |> ggplot() + geom_histogram(aes(x=households1K), fill="lightblue", col="darkblue") + labs(x="Number of Households", title="Histogram of Natural Log of US Household Data", y = "Count", subtitle="Unit is 1000 Households and Data are Log-transformed") + theme_classic() + scale_x_continuous(trans="log", breaks=c(1,10,100,1000)) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Log-transformed data appear normally distributed** ```{r log transformed hh data histogram shown, echo=F, fig.dim=c(7,6)} hist_lhholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ## Number of Households Per County :::: {.columns} ::: {.column width="48%"} ```{r untransformed hh plot map code} cnty_hholds <- cnty_data4 |> ggplot(aes(x=long, y=lat, group=group, fill=households1K)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Number of Households by County", subtitle="Unit is 1000 Households") + scale_fill_continuous(type = "viridis") + theme(legend.position = "bottom", legend.key.width = unit(1, "cm")) ``` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} **Map of untransformed households per county data is uninformative.** ```{r untransformed data hh map shown, echo=F, fig.dim=c(7,6)} cnty_hholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ::: :::: ## In-class Exercise 2 **Log-transformed data map is more more informative about geographic variability.** Submit R code to create log transformed households map in a text file with your name. ```{r log transformed hh data map, echo=F, fig.align='center', fig.dim=c(14,6.5)} cnty_lhholds <- cnty_data4 |> ggplot(aes(x=long, y=lat, group=group, fill=households1K)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Number of Households by County", subtitle="Unit is 1000 Households and data are Log-transformed") + scale_fill_continuous(type = "viridis", trans="log", breaks=c(1,10,100,1000)) + theme(legend.position = "bottom", legend.key.width = unit(1, "cm")) cnty_lhholds + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2)) ``` ## Households per County Plot Grid ```{r household grid plot code, eval=F} grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2) ``` ```{r grid of all four household plots, echo = F, fig.align='center', fig.dim=c(14,6)} grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2) grid.rect(.5,.5,width=unit(.99,"npc"), height=unit(0.99,"npc"), gp=gpar(lwd=3, fill=NA, col="darkgrey")) ``` ## ### Quiz 2 Information - Questions and Material from Quiz 1 may be on Quiz 2 - Practice Questions and demo videos are posted. - Review Quiz 1 and Quiz 1 Practice Questions - Review Week HW 4 and HW 5 - Part 1 and recent lectures - Converting text (character) date information to a date using **[lubridate](https://rawgit.com/rstudio/cheatsheets/main/lubridate.pdf)** commands (Week 5) - Example R commands:`ymd`, `dmy`, `mdy`, `ym` combined with `paste` to combine columns - Extracting year, month, or day from the date variable using lubridate commands (Week 6) - Example R commands: `year`, `month`, `quarter`, `wday`, `day` - Converting an `xts` dataset to a tibble and vise-versa - Creating a lineplot from time series (non-xts) dataset - Creating an interactive `hchart` or `dygraph` from an xts. ## ### Quiz 2 Info Continued - You should be familiar with the `bls_tidy` function we created and how to use it to import similar datasets. - There will be datasets to be imported AND combined (joined) - Data sets can be joined by row. - You should know how to do the different joins we covered and what each one does: - `full_join` - `right_join` - `left_join` - `inner_join` - Data sets can be stacked by column if the columns are identical - In BUA 455 we covered `bind_rows` ## ### Quiz 2 Info Continued - Cleaning messy data (Week 5 and Weeks 8-9) - Dealing with text (character variables) - `gsub` - `separate` - `unite` or `paste` or `paste0` - `ifelse` can be be used for text OR for numeric data - `ifelse` followed by `factor` allows you to make any categorical variable you want. - Other commands for modifying text: - `tolower` and `toupper` - `str_trim`, `str_squish` and `str_pad` - `ifelse` sometimes followed by `factor` ## Additional Text Commands ::: fragment **NOT ON QUIZ 2:** ::: - **[Commands to covert case](https://stringr.tidyverse.org/reference/case.html)** - `str_to_title`: First letter of each word - `str_to_sentence`: First letter of first word ::: fragment **Additional skills for Quiz 2 from HW 5 - Part 1:** ::: - summing across rows using `sum(c_across(...))` - using `pivot_wider` and then `pivot_longer` and then replacing NAs with 0 to create a 'complete' data set - Useful for area plots - Plotting skills for Quiz 2 - unformatted line plot, area plot, or grouped bar chart - `hchart` in **highcharter** package ## ### Key Points from This Week ::: fragment **Introduction to Geographic Data** ::: - Use skills already covered to - clean data and check text variables - join datasets - create plot grids for comparing variables - Determine if variables need to be log-transformed. - Quiz 2 Practice Questions are available. - Quiz 2 will be on Tuesday, 4/7. ::: fragment You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.** :::