Weeks 10 and 11

Introduction To Geographic Data and Quiz 2 Review

Penelope Pooler Eisenbies

2025-03-25

Housekeeping

Upcoming Dates

HW 5 - Part 1 is due on Wed. 3/26 - Grace Period ends 3/28
Draft Proposals are due Thursday 3/27
Thursday 3/27: In-class work day and NO OFFICE HOURS
Tuesday 4/1: Review using practice questions
Quiz 2 is on Thursday, 4/3
- Mostly on Weeks 5 through 9, but material is cumulative.
- Mostly on HW Assignments 4 and 5 , but material is cumulative.
- Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions.
HW 5 - Part 2 will be posted after Quiz 2.

Today, Thursday, and Tuesday, 4/1

Practice Questions for Quiz 2 are posted.
Tue. 4/1: Skills/Concepts Review for Quiz 2
- Putting skills together for different goals
- Come with questions or submit them as Engagement Questions.
Today: Intro to Managing and Plotting Geographic Data
- Today’s geographic maps will not be on Quiz 2.
- Mapping data is an effective data curation tool that may be useful in this class, other classes, your career.
We will cover more about geographic data and map visualizations after Quiz 2.
If you want help with mapping project data, please reach out to me or TA.

In-class Exercise - Week 10

Purpose:

To gain some experience and understanding of map data available in R and elsewhere.
To experiment with mapping data
Students are encouraged to use domestic or international map data in their dashboards if appropriate.

Data Preparation

The data for today’s in-class exercise is part of R.
These geographic data are useful if you have information by state or county and you want to show a choropleth map of your data.
R also has world information, e.g., countries, continents, etc.

us_states <- map_data("state") |>               # state polygons (not used today)
  rename("state" = "region")
us_counties <- map_data("county") |>            # county polygons
  rename("state" = "region", "county" = "subregion") |>
  mutate(county = gsub(" ", "", county), 
         county = gsub("'","", county) |> tolower())
#unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties
cnty2019_all <- county_2019
#unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties

cnty2019_all <- cnty2019_all |> 
  mutate(state = tolower(state),
         county = tolower(name),
         county = gsub(" county", "", county),
         county = gsub(" parish", "", county),
         county = gsub("\\.", "", county),          # \\ is required because . used in R coding
         county = gsub(" ", "", county),
         county = gsub("'","", county)) |>
  relocate(county, .before=name)

cnty2019_all <- full_join(us_counties,cnty2019_all)  # geo data and demographic data

Creating a new dataset not required, but helpful
Plot code could be converted into a function

#|label: select data and plot 1 map
cnty_data1 <- cnty2019_all |> 
  select(long:county, hs_grad, bachelors, 
         household_has_computer, 
         household_has_broadband) 
cnty_hs_grad <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, fill=hs_grad)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with High School Degree") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen.

#|label: plot 2 code
cnty_bachelors <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=bachelors)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with Bachelor's Degree") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen.

#|label:  plot 3 code
cnty_computer <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_computer)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with A Computer")+
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen

#|label:  plot 4 code
cnty_brdbd <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_broadband)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with BroadBand") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen

Demographic Plot Grid - Order Matters

Grid is populated left to right and then top to bottom unless otherwise specified.

#|label: grid of pct plots

# note alternative to grid.arrange used
# code shown is for export version of plots
grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2, )

In-class Exercise - Plot Grid (Steps 1 & 2)

Steps to Follow

Examine the available variables in the cnty_2019_all dataset saved from R.
Create Individual Plots. Variable names in R dataset and definitions:
- household_has_smartphone: Households with Smart Phones
- median_age: Median Age
- median_household_income: Median Household Income
- median_individual_income: Median Individual Income

You can copy provided plot code and modify it for these variables or you could try to create a function (not required today).

#|label: usmaps demographic maps exercise
# select data for plot (not required, but helpful)
# cnty_data2 <- 

# create four individual plots, one for each variable
# use provided plot code and modify

In-class Exercise - Plot Grid (Steps 3 & 4)

Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week.

For full credit, plots must be in the order specified.
- Row 1: should have smartphone variable and median age.
- Row 2: median household and individual income.
Create Plot Grid using plot_grid command.

#|label: in-class exercise 2x2 plot grid
# for full credit grid of plots must be in order specified
# use plot_grid command
# export plots to img folder using save_plot

Right click on plot grid, then save as… and save to img folder with correct name.

or use save_plot command

Mapping Log Transformed Data

The previous examples above are all percent data.
No transformations are needed
In contrast, population data or financial data are often right-skewed and need to be log transformed.
Recall from MAS 261, BUA 345 (and perhaps FIN classes):
- An effective transformation for right skewed data is the natural log (LN) transformation.
- The following demo shows how useful it is for mapping right skewed data.

Plot of Skewed Data

#|label: untransformed pop. map code

cnty_data3 <- cnty2019_all |> 
  select(long:county, pop) |>
  mutate(pop1k = pop/1000)
       
cnty_pop <- cnty_data3 |>
    ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People") +
    scale_fill_continuous(type = "viridis") +
    theme(legend.key.width = unit(.4, "cm"))

Histogram Clarifies Data Skewness

#|label: untransformed pop hist code
hist_pop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Population", title="Histogram of US Population Data",
       y = "Count",subtitle="Unit is 1000 People") +
  theme_classic()

The Problem: We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation.

Solution - Natural Log Transformed Plot

Data are not transformed, but data axes and scale are
Options specified in scale_fill_continuous:
- trans = "log"
- breaks = c(...)
break intervals determined by examining data

#|label: log transformed map plot code
cnty_lpop <- cnty_data3 |>
  ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People and Date are Log-transformed") +
    scale_fill_continuous(type = "viridis",trans="log",
                          breaks=c(1,10,100,1000,10000)) +
    theme(legend.key.width = unit(.4, "cm"))

Histogram of Log transformed Data

Final dashboard doesn’t include exporatory plots
Data exploration plots:
- Histograms, Scatterplots and boxplots are all useful

#|label: log transformed pop hist code
hist_lpop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") +
  labs(x="Population", 
       title="Histogram of Natural Log of US Population Data",
       y = "Count",
       subtitle="Unit is 1000 People and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000))

Histogram of Log-transformed Data

Population Plot Grid

#|label: plot code for pop grid
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2)

When and How to Log Transform

Log transformation are useful if you have right skewed POSITIVE data such as
- Prices
- Population
- Sales
- Income
Note: If data (x) have zeros, a good option is to use log(x + 1)
- ln(1) = 0 (In R log(1) = 0)
- 0 values in the data will still be zeros
In the following example we will create plots for number of households by county:
- Histograms with and without LN transformation
- Map plots of with and without LN transformation

Number of Households Per County

Without Transformation

Untransformed Data Histogram

#|label: data and untransformed hh data histogram code
#|
cnty_data4 <- cnty2019_all |> 
  select(long:county, households) |>
  mutate(households1K = households/1000)

hist_hholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", title="Histogram of US Household Data",
       y = "Count",subtitle="Unit is 1000 Households") +
  theme_classic()

Number of Households is highly right-skewed.

Number of Households Per County

With Transformation

Log transformed Data Histogram

#|label: log transformed hh data histogram code
hist_lhholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", 
       title="Histogram of Natural Log of US Household Data",
       y = "Count",
       subtitle="Unit is 1000 Households and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000))

Log-transformed data appear normally distributed

Number of Households Per County

cnty_hholds <- cnty_data4 |>
  ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
  geom_polygon() +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  labs(fill= "", title="Number of Households by County",
       subtitle="Unit is 1000 Households") +
  scale_fill_continuous(type = "viridis") +
  theme(legend.key.width = unit(.4, "cm"))

Map of untransformed households per county data is uninformative.

In-class Exercise 2

Log-transformed data map is more more informative about geographic variability.

Submit R code to create log transformed households map in a text (.txt) file with your name.

Households per County Plot Grid

grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)

Quiz 2 Information

Questions and Material from Quiz 1 may be on Quiz 2
Practice Questions will be posted by 10/31
- Review Quiz 1 and Quiz 1 Practice Questions
- Review Week HW 4 and HW 5 - Part 1 and recent lectures
Study Tip: Feel free to add on to practice questions .qmd file with extra chunks and notes so that all of your notes are in one place.

Quiz 2 Information Cont’d

Converting text (character) date information to a date using lubridate commands (Week 5)
- Example R commands:ymd, dmy, mdy, ym combined with paste to combine columns
Extracting year, month, or day from the date variable using lubridate commands (Week 6)
- Example R commands: year, month, quarter, wday, day
Converting an xts dataset to a tibble (standard R dataset) (Week 7)
- Creating a lineplot from time series (non-xts) dataset
Converting a tibble to an xts dataset
- Creating an interactive hchart

Quiz 2 Info Cont’d

You should be familiar with the bls_tidy function we created and how to use it to import similar datasets.
There will be datasets to be imported AND combined (joined)
- Data sets can be joined by row.
- You should know how to do the different joins we covered and what each one does:
  - full_join
  - right_join
  - left_join
  - inner_join
Data sets can be stacked by column if the columns are identical
- In BUA 455 we covered bind_rows

Quiz 2 Info Cont’d

Cleaning messy data (Week 5 and Weeks 8-9)
- Dealing with text (character variables)
  - gsub
  - separate
  - unite or paste or paste0
  - ifelse can be be used for text OR for numeric data
  - ifelse followed by factor allows you to make any categorical variable you want.
Other commands for modifying text:
- tolower and toupper
- str_trim, str_squish and str_pad

Additional Text Commands

Additional skills for Quiz 2 from HW 5 - Part 1:

summing across rows using sum(c_across(...))
using pivot_wider and then pivot_longer and then replacing NAs with 0 to create a ‘complete’ data set
- Useful for area plots
Plotting skills for Quiz 2
- unformatted line plot, area plot, or grouped bar chart
- hchart in highcharter package

NOT ON QUIZ 2:

Commands to covert case
str_to_title: First letter of each word
str_to_sentence: First letter of first word

Key Points

Introduction to Geographic Data

Use skills already covered to
- clean data and check text variables
- join datasets
- create plot grids for comparing variables
Determine if variables need to be log-transformed.
Quiz 2 Practice Questions are posted.
Quiz 2 will be on Thursday, 4/3.

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.